PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models

[arXiv](2024/01/10 version v1)

Abstract

LCD Algorithm

가변 guidance scale w 대신 고정된 w를 사용하는 것 외에는 LCM과 똑같다.

Effect of Hyper-parameters

bs = batch size, w_fix = 고정된 w, w_Embed = 가변 w

Training Efficiency and Inference Speedup

U-Net 기반 모델에 비해 메모리 제약이 크게 줄어 소비자급 GPU에서도 훈련할 수 있다.

8-bit inference를 통해 매우 빠르고 효율적인 추론 가능.

Training Details

자세한 훈련 설정들. 생략.

Zero-convolution을 zero-linear로 대체하고 두 가지 구조 제안.

ControlNet-UNet

1 ~ 14 block을 인코더, 15 ~ 28 block을 디코더로 취급한다. 하지만 이 방법은 transformer architecture에서 벗어나기 때문에 효율성이 떨어진다.

ControlNet-Transformer

초기 N개의 block을 복사하고 i 번째 copy block의 출력을 fixed block의 출력에 바로 더한다. 이러한 구조는 transformer의 원래 데이터 흐름을 준수하여 성능이 크게 향상된다.

Rethinking Patch Dependence for Masked Autoencoders (CrossMAE) (1)	2024.01.30
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data (6)	2024.01.26
Scalable Pre-training of Large Autoregressive Image Models (AIM) (0)	2024.01.19
Denoising Vision Transformers (DVT) (0)	2024.01.10
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception (IMP) (0)	2023.12.28
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts (0)	2023.12.28