SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

모바일 장치에서 2초 이내에 작동되는 매우 빠른 T2I 확산 모델

Project Page

SnapFusion

snap-research.github.io

Abstract

모바일 장치에서 2초 이내에 작동되는 text-to-image 확산 모델.

Efficient UNet, efficient image decoder, step distillation.

Model Analysis of Stable Diffusion

Prerequisites of Stable Diffusion

Diffusion Model

DDIM의 denoising

Classifier-free guidance

Latent Diffusion Model(LDM, Stable Diffusion)

Benchmark and Analysis

Macro Prospective

Breakdown for UNet

Forward of UNet 공식화:

또한 채널 수가 많은 UNet의 중간 단계에 피라미터가 몰려 있으며 높은 해상도에서 계산 복잡성으로 인해 지연 시간이 증가한다.

Architecture Optimizations

Efficient UNet

Robust Training

UNet의 블록과 ID mapping이 확률적으로 실행되도록 함

학습 중에 적용하면 안정적인 아키텍처 진화와 각 블록에 대한 정확한 평가 가능.

Evaluation and Architecture Evolving

Evolution action set A를 사용하여 네트워크 변경.

각 블록을 더하거나 빼면서 각 블록의 성능과 지연 시간을 측정, CLIP score에 대한 기여도가 높거나 지연 시간이 짧은 블록은 보존, 그렇지 않은 블록은 제거하는 작업을 반복적으로 수행하는 아키텍처 진화를 수행한다.

Efficient Image Decoder

VAE 대신 efficient image decoder를 훈련한다.

Efficient image decoder는 무작위 text prompt를 원본 SD에 전달하여 얻은 latent-image pair data로 훈련됨.

Step Distillation

아키텍처 최적화 외에도 step distillation을 통해 denoising 단계 수를 줄이는 방법 제안

(progressive distillation = step distillation)

Overview of Distillation Pipeline

Progressive distillation에 따라 stable diffusion을 v-prediction으로 fine-tuning.

Step distillation

32-step SD를 통해 50-step 성능의 16-step UNet을 얻음.(경험적으로 직접 증류가 점진적 증류보다 더 낫기 때문에 128-step에서 점진적 증류를 사용하지 않고 32-step에서 직접 증류)
같은 방법으로 16-step efficient UNet을 얻음
16-step UNet을 교사로, 16-step efficient UNet을 학생으로 step distillation을 진행하여 8-step efficient UNet 얻음

CFG-Aware Step Distillation

Vanilla Step Distillation

교사 모델의 2-step 출력과 학생 모델의 1-step 출력이 같도록 학습됨.(Progressive distillation과 완전히 같음.)

CFG-Aware Step Distillation

Efficient UNet에 classifier-free guidance를 추가하고 L_{cfg_dstl}로 명명.

Total Loss Function

Vanilla loss는 FID에, cfg loss는 CLIP score에 도움이 되기 때문에 훈련 중 확률적으로 두 loss를 같이 사용함.

Experiment

저작자표시 (새창열림)

'논문 리뷰 > Diffusion Model' 카테고리의 다른 글

DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models (1)	2023.09.25
Emergent Correspondence from Image Diffusion (DIFT) (3)	2023.07.14
DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing (0)	2023.07.14
Progressive Distillation for Fast Sampling of Diffusion Models (0)	2023.06.22
Consistency Models (0)	2023.04.14
InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning (0)	2023.04.13

Ostin X

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

Abstract