VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM

LLM, Img model, Vid model을 활용하여 일관성 있는 multi-scene video 생성

[arXiv](2024/01/02 version v1)

Abstract

LLM을 활용하여 content 일관성이 있는 multi-scene video를 생성할 수 있는 VideoDrafter 제안

1. Multi-Scene Video Script Generation

2. Entity Reference Image Generation

3. Video Scene Generation

LLM은 배포 유연성과 추론 효율성을 고려하여 ChatGLM3-6B 채택. (중국어-영어 이중언어 모델)

LLM은 다음과 같은 형식으로 스크립트를 생성하도록 요청받는다.

N은 LLM이 결정하며 각 scene에 등장하는 같은 개체의 이름을 엄격하게 일치시키도록 한다.

그다음 multi-round 대화를 통해 각 개체를 자세히 설명하도록 요청한다.

ChatGPT-4는 돈 들어서 안 썼다고 함; LLM의 스크립트 안정성을 강화하기 위한 추가 사항:

사전 훈련된 확산 모델을 통해 각 개체에 대한 참조 이미지를 명시적으로 생성하고 U2-Net을 통해 전경과 배경을 분할한다.

VideoDrafter-Img

Stable Diffusion의 기존 attention module을 텍스트, 전경, 배경 조건을 처리할 수 있도록 개조한다. 전경 개체가 여러 개인 경우 채널 차원으로 연결한다.

VideoDrafter-Vid

Stable Diffusion을 시공간 형태로 확장하고 attention module을 다음과 같이 개조한다.

Kinetics의 400개 action category에 대해 [0,1]⁴⁰⁰vector로 표시하고 임베딩을 통해 feature space로 변환한다.

또한 시간적 종속성을 더 잘 포착하기 위해 spatial conv 뒤에 여러 개의 temporal conv를 추가한다.

그리고 카메라 움직임을 생성된 비디오에 반영하기 위해 몇 번의 DDIM 샘플링 후에 직접 프레임을 수정한다고 하는데 해당 방법은 보충 자료에서 설명해준다고 한다. 아직 안 나옴.

VideoDrafter-Img는 SD-2.1, VideoDrafter-Vid는 SDXL를 기반으로 Diffusers codebase를 사용하여 구현되었다.

VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM

videodrafter.github.io

Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions (0)	2024.01.09
TrailBlazer: Trajectory Control for Diffusion-Based Video Generation (1)	2024.01.08
Directed Diffusion: Direct Control of Object Placement through Attention Guidance (0)	2024.01.08
LooseControl: Lifting ControlNet for Generalized Depth Conditioning (0)	2024.01.05
Discovering Interpretable Directions in the Semantic Latent Space of Diffusion Models (Semantic-DDM) (1)	2024.01.05
SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation (1)	2024.01.03