Dynamic Typography: Bringing Text to Life via Video Diffusion Prior

Abstract

사용자 프롬프트에 따라 텍스트에 움직임을 불어넣는 end-to-end 최적화 프레임워크

[Project Page]

[Github]

[arXiv](2024/04/18 version v2)

Preliminary

Vector Representation and Fonts

FreeType 글꼴 라이브러리를 통해 문자의 윤곽선을 추출하고 특정 해상도에 얽매이지 않는 벡터 표현인 베지에 곡선으로 변환한다.

Score Distillation Sampling

[SDS 논문 리뷰]

사전 훈련된 text-to-video 모델의 지식을 추출하기 위해 사용한다.

SDS는 벡터에 적용할 수 없으므로 미분 가능한 래스터라이저로 DiffVG를 사용한다.

DiffVG는 미분 가능한 방식으로 벡터 표현을 픽셀 표현으로 변환할 수 있다.

Method

Problem Formulation

초기 글자는 베지에 곡선의 제어점 집합으로 표현되며

출력 비디오의 각 프레임 또한 제어점 집합으로 표현된다.

우리의 목표는 각 프레임에 대해 원본 글자의 제어점에 추가될 변위를 학습하여 최종 비디오를 도출하는 것이다.

Non-end-to-end 방식은 사전 지식의 상충으로 인해 왜곡 및 아티팩트가 발생한다.

따라서 본문에서는 원본 문자를 애니메이션에 직접 매핑하는 end-to-end architecture를 제안한다.

Base Field and Motion Field

NeRF와 동일한 위치 인코딩을 각 차원에 적용하여 좌표를 고차원 공간에 투영한 뒤

base field, motion field 라는 이름의 좌표 기반 MLP를 적용한다.

Motion을 더 잘 모델링하기 위해 모든 제어점에 공유되는 global motion, 각 제어점에 적용되는 local motion으로 분리한다.

if predict_global_frame_deltas:
    self.rotation_weight = rotation_weight
    self.scale_weight = scale_weight
    self.shear_weight = shear_weight
    self.translation_weight = translation_weight
    self.frames_rigid_shared = nn.Sequential(nn.Linear(int(input_dim * self.inter_dim / 2), inter_dim),
                                      nn.LayerNorm(inter_dim),
                                      nn.LeakyReLU(),
                                      nn.Linear(inter_dim, inter_dim),
                                      nn.LayerNorm(inter_dim),
                                      nn.LeakyReLU())

    self.frames_rigid_translation = nn.Sequential(nn.Linear(inter_dim, inter_dim),
                                                  nn.LayerNorm(inter_dim),
                                                  nn.LeakyReLU(),
                                                  nn.Linear(inter_dim, inter_dim),
                                                  nn.LayerNorm(inter_dim),
                                                  nn.LeakyReLU(),
                                                  nn.Linear(inter_dim, inter_dim),
                                                  nn.LayerNorm(inter_dim),
                                                  nn.LeakyReLU(),
                                                  nn.Linear(inter_dim, self.num_frames * 2))

    self.frames_rigid_rotation = nn.Sequential(nn.Linear(inter_dim, inter_dim),
                                               nn.LayerNorm(inter_dim),
                                               nn.LeakyReLU(),
                                               nn.Linear(inter_dim, self.num_frames * 1))

    self.frames_rigid_shear = nn.Sequential(nn.Linear(inter_dim, inter_dim),
                                            nn.LayerNorm(inter_dim),
                                            nn.LeakyReLU(),
                                            nn.Linear(inter_dim, self.num_frames * 2))

    self.frames_rigid_scale = nn.Sequential(nn.Linear(inter_dim, inter_dim),
                                            nn.LayerNorm(inter_dim),
                                            nn.LeakyReLU(),
                                            nn.Linear(inter_dim, self.num_frames * 2))

    self.global_layers = nn.ModuleList([self.frames_rigid_shared, 
                          self.frames_rigid_translation, 
                          self.frames_rigid_rotation, 
                          self.frames_rigid_shear, 
                          self.frames_rigid_scale,
                          ])

self.local_model = nn.Sequential(
    nn.Linear(int(input_dim * self.inter_dim / 2), inter_dim),
    nn.LayerNorm(inter_dim),
    nn.LeakyReLU(),
    nn.Linear(inter_dim, inter_dim),
    nn.LayerNorm(inter_dim),
    nn.LeakyReLU(),
    nn.Linear(inter_dim, self.input_dim))

사용자 프롬프트와 SDS를 통해 base field, motion field를 공동으로 최적화한다.

Legibility Regularization

원래 문자의 가독성을 유지하는 것이 중요하기 때문에 L_legibility 추가.

Mesh-based Structure Preservation Regularization

베지에 곡선 간의 빈번한 교차로 인해 구멍이 발생하고 움직임의 제한되지 않은 자유도로 인해 프레임 간의 불일치가 발생한다.

기본 모양에 Delaunay Triangulation을 적용하고 각 프레임 간의 triangular mesh의 대응각 간의 차이를 정규화에 이용, 프레임 간의 기하학적 구조가 보존되도록 한다. 추가로 base field 전후에도 적용.

Frequency-based Encoding and Annealing

좌표에 sinusoidal function을 적용하는 NeRF 위치 인코딩은 고주파 정보를 더 효과적으로 모델링할 수 있도록 한다.

또한 Nerfies에 따라 annealing을 수행하는데, 구체적으로 큰 동작을 먼저 모델링하고 작은 동작을 순차적으로 모델링하기 위해 MLP의 위치 인코딩의 각 주파수 대역에 훈련 반복 t에 기반한 가중치를 적용한다.

이러한 방법으로 고품질의 부드럽고 세밀한 motion을 생성할 수 있다.

Experiments

🪄 animate your word!

animate-your-word.github.io

저작자표시 (새창열림)

'논문 리뷰 > etc.' 카테고리의 다른 글

Your Transformer is Secretly Linear (1)	2024.05.26
The Platonic Representation Hypothesis (1)	2024.05.22
Is Flash Attention Stable? (0)	2024.05.13
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU (Fuyou) (0)	2024.03.18
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models (0)	2024.03.12
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping (SearchFormer) (0)	2024.03.07

Ostin X

Dynamic Typography: Bringing Text to Life via Video Diffusion Prior

Abstract

Preliminary

Vector Representation and Fonts

Score Distillation Sampling

Method

Base Field and Motion Field

Legibility Regularization

Mesh-based Structure Preservation Regularization

Frequency-based Encoding and Annealing

Experiments

'논문 리뷰 > etc.' 카테고리의 다른 글

티스토리툴바

Dynamic Typography: Bringing Text to Life via Video Diffusion Prior

Abstract

Preliminary

Vector Representation and Fonts

Score Distillation Sampling

Method

Base Field and Motion Field

Legibility Regularization

Mesh-based Structure Preservation Regularization

Frequency-based Encoding and Annealing

Experiments

'논문 리뷰 > etc.' 카테고리의 다른 글

'논문 리뷰/etc.' Related Articles

티스토리툴바