A Recipe for Scaling up Text-to-Video Generation with Text-free Videos (TF-T2V)

동영상 플랫폼에 널려있는 고품질 unlabeled video를 훈련에 활용

단순무식한 end-to-end 공동 훈련의 힘인지 coherence loss의 힘인지 모델도 단순하고 진짜 별거 없어 보이는데 결과물은 굉장히 부드럽다.

요즘 temporal layer를 따로 훈련하는 방법이 확실히 결과물의 품질은 별로인 것 같다는 생각이 듦.

[Project Page]

[Github]

[arXiv](Current version v1)

Abstract

캡션이 있는 비디오 데이터는 많지 않고, youtube와 같은 동영상 플랫폼에서 비디오를 수집하는 것이 훨씬 쉽다.

텍스트 없는 비디오를 통해 학습할 수 있는 TF-T2V 제안.

Method

Preliminaries of video diffusion model

VAE와 3D U-Net이 있는 확산 모델 사용, v-prediction(링크 맨 아래) 사용.

TF-T2V

Spatial appearance generation

Content branch는 image, text condition을 사용하여 content 생성을 위한 지침을 제공한다.

훈련 중에 무작위로 condition을 drop 하여 각 조건이 개별적으로 content를 제어할 수 있도록 한다.

Motion dynamic synthesis

널리 사용되는 WebVid10M dataset은 워터마킹 및 저해상도 dataset이다.

Youtube와 같은 동영상 플랫폼에서 텍스트 없는 고품질 비디오를 수집하고 중앙 프레임을 조건으로 하여 image-to-video 작업에 대해 훈련한다.

또한 추가적인 고품질 text-video pair 데이터를 활용하여 text-to-video, image-to-video 생성을 훈련한다.

각 프레임에 개별적으로 적용되는 훈련 손실 외에도 두 프레임 간의 동작과 같은 추가 신호를 감독으로 하는 시간 일관성 손실 제안: ~~프레임 간의 차이의 차이~~

이 손실을 통해 유망한 시간 역학을 기대할 수 있고 프레임 간 깜빡임을 완화한다.

Training and inference

공간적 외관 생성과 모션 합성의 보완적인 이점을 찾기 위해 전체 모델을 end-to-end 방식으로 공동 최적화한다.

이미지를 단일 프레임 비디오로 간주한다.

뭔가가 좀 많이 생략되어 있는 듯한 논문인데, overview도 좀 헷갈리게 되어 있고.

아무튼 내가 이해한 바로는 U-Net은 두 개가 아니라 한 개이고 end-to-end로 text-to-image, image-to-video, text-to-video data를 무작위로 그냥 때려 박는 걸로 보인다?

Experiments

ModelScopeT2V, VideoComposer 기반

훈련: 1000 steps DDPM, 추론: 50 steps DDIM

Project Page

TF-T2V

--> TF-T2V (Semi-supervised) “A dog is running away from the camera” --> --> “A man is running from right to left” --> --> “Beautiful peonies are blooming on a black background” --> --> “A manor, a rotating view” -->

tf-t2v.github.io

저작자표시 (새창열림)

'논문 리뷰 > Diffusion Model' 카테고리의 다른 글

SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation (1)	2024.01.03
One-dimensional Adapter to Rule Them All: Concepts, Diffusion Models and Erasing Applications (SPM) (1)	2024.01.02
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation (0)	2023.12.29
DreamTuner: Single Image is Enough for Subject Driven Generation (0)	2023.12.26
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models (0)	2023.12.24
Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis (0)	2023.12.24

Ostin X

A Recipe for Scaling up Text-to-Video Generation with Text-free Videos (TF-T2V)

Abstract