PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

[arXiv](2024/03/07 version v1)

Abstract

사전 훈련된 Diffusion Transformer를 효율적으로 fine-tuning 하여 4K 해상도의 이미지를 생성

레딧 반응을 보면 고품질 데이터의 양이 부족해서 품질이 좋진 않다는 것 같다.

Aesthetic scoring model (AES)로 필터링된 33M의 고품질 이미지.

최신 캡션 모델을 통해 캡션의 길이와 정확도를 향상시켰다.

Self-attention 계산 비용을 줄이기 위해 PVTv2와 마찬가지로 K, V matrix를 압축하여 토큰 수를 줄인다.

2×2 group convolution을 사용했다.

T2I 모델을 처음부터 훈련하는 것은 비효율적이다.

(a) PixArt-α의 VAE를 SDXL의 VAE로 교체하고 확산 모델을 fine-tuning 한다.

(b) 저해상도 모델에 positional embedding 보간을 적용하고 고해상도로 fine-tuning 하면 빠르게 수렴할 수 있다.

사전 훈련된 PixArt-α와 SDXL의 VAE를 사용했다.

GitHub - PixArt-alpha/PixArt-sigma-project: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation - PixArt-alpha/PixArt-sigma-project

github.com

ViTAR: Vision Transformer with Any Resolution (2)	2024.04.01
Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation (SD3-Turbo, LADD) (0)	2024.03.21
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (Stable Diffusion 3) (2)	2024.03.15
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks (0)	2024.03.12
FiT: Flexible Vision Transformer for Diffusion Model (0)	2024.03.05
From Sparse to Soft Mixtures of Experts (SoftMoE) (2)	2024.02.23