InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation

Abstract

Multi-modal embedding stack, masked cross-attention을 통해 multi-ID 생성 문제를 해결

[arXiv](2024/04/30 version v1)

위 그림 (a)에 나와있듯이 face encoder의 2D local feature, 1D global feature를 text condition에 연결하여 cross-attention 입력으로 사용한다.

Masked cross-attention은 3-stage로 진행되며 UNet과 ControlNet 모두에 사용된다.

뭐 설명할 필요 없을 듯. 각 얼굴 영역에 맞게 attention masking 하는 게 전부.

얼굴의 전체적인 특징을 반영하고 배경 및 스타일과의 조화를 이루기 위해 face mask 25% 확장.

얼굴 모양 제한 없이 자세를 제어할 수 있는 OpenPose 채택.

확산 모델의 안내를 위해 제어 이미지에 얼굴 위치를 의미하는 유색 원 추가.

FIFO-Diffusion: Generating Infinite Videos from Text without Training (1)	2024.05.23
Distilling Diffusion Models into Conditional GANs (Diffusion2GAN) (1)	2024.05.21
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation (0)	2024.05.07
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback (0)	2024.04.18
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators (0)	2024.04.09
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching (1)	2024.04.08