Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer Level Loss

[arXiv](2024/01/05 version v1)

Abstract

SDXL을 증류하여 30% 크기에 100% 속도 향상을 제공하는 [Segmind-Vega], 50% 크기에 60% 속도 향상을 제공하는 [SSD-1B] 모델 소개

일단 SDXL의 U-Net은 Stable Diffusion에서 Down/Up block을 하나씩 줄이고 블록 당 attention layer를 10개씩 사용한다. Stable Diffusion, SDXL architecture 차이

따라서 SD의 블록 자체를 제거한 BK-SDM과 다르게 attention의 수만 줄이는 방법을 사용했다.

(Up Blocks인데 그림에서 Down Blocks라고 표기되어 있다. 잘못 적은 거 아닌가? 근데 밑에 그림도 다 잘못 적혀있는데 뭐징;)

SSD-1B는 Mid block의 attention을 제거하고 Down 2, Up blocks에서 attention을 조금씩 제거.

Segmind-Vega에서는 모든 블록의 attention을 조금 더 제거했다.

Loss function은 BK-SDM과 동일하지만 한 가지 다른 점이 있다.

Attention layer의 수를 줄였기 때문에 FeatKD loss에서 block-level 비교를 사용한 BK-SDM과 달리 layer-level 출력을 비교한다.

처음에는 SDXL Base를 교사로 사용한 뒤 fine-tuned model인 ZavychromaXL로 교체하고 또다시 JuggernautXL로 교체하였다. 동일한 데이터셋을 사용하더라도 교사를 교체하며 훈련했을 때 품질이 크게 향상됐다고 한다.

SSD-1B, Segmind-Vega에서 제거된 attention layer는 무작위가 아닌 출력 결과에 대한 인간 평가에 의하여 결정되었다고 한다.

빠름

인간 선호도 평가 (의외로 이김)

질적 비교

FateZero: Fusing Attentions for Zero-shot Text-based Video Editing (0)	2024.01.18
Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation (0)	2024.01.17
Score Distillation Sampling with Learned Manifold Corrective (LMC-SDS) (0)	2024.01.16
BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion (0)	2024.01.10
Instruct-Imagen: Image Generation with Multi-modal Instruction (0)	2024.01.10
Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions (0)	2024.01.09