FateZero: Fusing Attentions for Zero-shot Text-based Video Editing

Inversion을 통해 얻은 attention map을 사용해 시간적 일관성 향상

[arXiv](2023/10/11 version v3)

Abstract

Inversion을 통해 zero-shot video editing을 수행하는 FateZero 제안

DDIM Sampling:

DDIM Inversion:

Inversion Attention Fusion

Inversion noise를 직접 사용하면 많은 denoising step에 따른 오류 누적, 높은 cfg 가중치 때문에 프레임 불일치 발생.

Source prompt p_src, z₀을 알 때, DDIM inversion을 수행하며 self-attention map, cross-attention map, z_T를 저장한다.

편집 단계에서 얻은 attention map을 융합하여 제거할 노이즈를 얻을 수 있다.

p_edit은 편집된 prompt이고 cross-attention map의 경우 편집되지 않은 부분만을 사용한다.

Attention Map Blending

s^src를 그대로 사용하면 구조 유출이 발생하고(5열) s^edit을 사용하면 원본 구조가 소실된다.(4열)

s^src에서 임계값 τ를 넘는 마스크 M을 추출하고 외부 구조는 s^src, 내부 구조는 s^edit을 사용한다.

Spatial-Temporal Self-Attention

Tune-A-Video의 casual attention과 비슷하게 self-attention에서 중간 프레임 z^w = z^Round[n/2]를 K, V로 추가한다.

FateZero: Fusing Attentions for Zero-shot Text-based Video Editing

fate-zero-edit.github.io

PALP: Prompt Aligned Personalization of Text-to-Image Models (0)	2024.01.19
Delta Denoising Score (DDS) (1)	2024.01.19
Object-Centric Diffusion for Efficient Video Editing (0)	2024.01.18
Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation (0)	2024.01.17
Score Distillation Sampling with Learned Manifold Corrective (LMC-SDS) (0)	2024.01.16
Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer Level Loss (0)	2024.01.10