본문 바로가기

전체 글

(565)

VideoPoet: A Large Language Model for Zero-Shot Video Generation 지금까지 본 비디오 생성 모델 중에 움직임이 제일 부드러운 것 같다. [Google Research Blog] [Project Page] [arXiv](Current version v1) Abstract Multi-modal 입력을 처리하고 고품질 audio, video를 합성할 수 있는 VideoPoet 제안 Introduction 본 논문에서는 비디오 생성에서 LLM의 적용을 조사한다. VideoPoet은 각 양식을 이산 토큰으로 출력하는 decoder-only LLM architecture를 사용한다. VideoPoet의 훈련은 pretraining → task-adaptation으로 구성되며 별도의 확산 모델에 의존하지 않고 단일 LLM으로 통합된다. LLM인 VideoPoet은 zero-shot..

DreamTuner: Single Image is Enough for Subject Driven Generation DreamBooth + Subject Encoder + Self Subject Attention [Project Page] [arXiv](Current version v1) Abstract Subject-driven image generation을 효과적으로 달성하기 위해 coarse∙fine 정보를 주입하는 DreamTuner 제안 Method Subject-Encoder Self-Subject-Attention Subject-Driven Fine-Tuning Subject-Encoder 분할 모델을 통해 참조 이미지에서 배경을 분리하고 CLIP image encoder에 projection을 위한 ResBlocks 추가. U-Net의 transformer block에 Subject-Encoder At..

PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models 이미지에 affinity score를 더하여 각 프레임에 조건화 [Project Page] [Github] [arXiv](Current version v1) Abstract 주어진 이미지로 정렬하고 텍스트를 통해 모션을 제어하는 PIA(Personalized Image Animator) 제안 PIA: Your Personalized Image Animator Plug-and-Play Modules for Animation 조건부 이미지 I를 잠재 공간으로 인코딩 E(I) = zI. 움직임의 정도를 정량화하기 위해 affinity score s 도입. 훈련 중에 s는 각 프레임에서 첫 번째 프레임과의 L1 거리를 통해 계산되고 [0,1]로 정규화된다. zI와 정렬하기 위해 si를 1×h×w로 확장하고 co..

Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis Temporal attention 없이 cross-frame attention, equivariant fine-tuning만으로 시간적 일관성 향상 [Project Page] [arXiv](Current version v1) Abstract 30 fps, 4s, 512x384 video를 14초 만에 생성할 수 있는 비디오 편집 확산 모델인 Fairy 제안 Implicit Tracking via Cross-frame Attention Cross-frame attention은 시간 대응 추적의 기능이 있다. 특히 고해상도 feature에서. Fairy: Fast Video-to-Video Synthesis Anchor-Based Model 모든 anchor frame의 K, V를 캐시에 추가. Query ..

StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation [Github] [arXiv](Current version v1) Abstract RTX 4090 GPU에서 최대 91.07 fps로 이미지를 생성할 수 있는 StreamDiffusion 제안 StreamDiffusion Pipeline Stream Batch Residual Classifier-Free Guidance Input-Output Queue Stochastic Similarity Filter Pre-Computation Model Acceleration Tools with a Tiny-Autoencoder Batching the denoise step 위 그림과 같이 stream batch를 사용하여 이전 이미지의 생성이 끝날 때까지 기다리지 않고 새로운 이미지 생성을 시작할 수 있다. 추론..

InstructVideo: Instructing Video Diffusion Models with Human Feedback [Project Page] [arXiv](Current version v1) Abstract Human feedback을 통해 text-to-video 확산 모델을 fine-tuning 하는 InstructVideo 제안 InstructVideo Reward Fine-tuning as Editing 우리의 목표는 출력을 크게 변경하는 것이 아니라 인간의 선호에 따라 미묘하게 조정하는 것이다. 입력 video-text pair (x, c)에 대해 x를 잠재 latent z로 추출하고 적당한 노이즈를 더한 다음(SDEdit) DDIM sampling step D의 일부(τ) 만큼 denoising 하여 z0을 얻은 후 x0g로 디코딩한다. Reward Fine-tuning with Image Reward M..

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models [Project Page] [Github] [arXiv](Current version v1) Abstract Decoupled cross-attention을 통해 image feature를 분리하여 prompting 하는 IP-adapter 제안 Method Image Prompt Adapter Image Encoder CLIP image encoder의 출력을 projection layer를 통해 길이 N, 텍스트 임베딩과 같은 차원을 가진 feature로 투영. Decoupled Cross-Attention Text embedding과 image embedding을 통합하는 대신 새로운 cross-attention layer를 추가하고 같은 Query에 대해 수행된 각각의 cross-attention..

SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing [Project Page] [arXiv](Current version v1) Abstract SC-Tuner를 통해 skip connection을 편집하는 SCEdit 제안 Introduction LoRA, ControlNet, T2I-Adapter와 같은 방법들보다 더 적은 메모리로 더 높은 품질의 이미지를 생성할 수 있다. Skip Connection(이하 SC)을 제거하면 분산이 작아지고 세부정보가 소실된다. ControlNet 피라미터의 7.9%만 사용하고 메모리 사용량을 30% 절감한다. Method Tuner modules SC-Tuner (ϕ = GELU) Controllable SC-Tuner 여러 개의 condition을 동시에 입력할 수 있다. SCEdit framework CSC-Tu..

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models Text encoder를 설계하여 텍스트 합성에 특화 [Project Page] [Github] [arXiv](Current version v1) Abstract 사전 훈련된 확산 모델과 경량 텍스트 인코더를 통해 높은 정확도로 텍스트를 합성할 수 있는 UDiffText 제안 Method Character-level Text Encoder CLIP, T5와 같은 텍스트 인코더는 개별 문자의 구조를 인식하지 못한다. 새로운 텍스트 인코더는 다음과 같다. 해당 단어는 인덱스에 매핑된 다음 코드북을 통해 임베딩으로 변환되고 transformer를 거쳐 최종 출력을 생성한다. Multi-label classification head HMLC는 텍스트 임베딩에서 text index Id를 예측하도록 cross-e..

VidToMe: Video Token Merging for Zero-Shot Video Editing [arXiv](Current version v1) Abstract 프레임 간에 self-attention token을 병합하여 계산 비용을 줄임과 동시에 시간적 일관성을 향상하는 VidToMe 제안 Introduction 한 프레임의 토큰을 다른 프레임의 가장 유사한 토큰과 병합하여 정렬한다. 모든 프레임을 한 번에 처리할 수 없으므로 비디오를 청크로 나누고 청크 내 local merging과 청크 간 global merging을 수행하여 단기, 장기 일관성을 보장한다. Preliminaries Latent Diffusion Model LDM Token Merging Token Merging은 먼저 토큰 시퀀스를 src, dst로 분할하고 유사도가 높은 토큰끼리 연결한 뒤 가장 유사한 r개의 pair를 ..

SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models Sparse Condition Encoder를 통해 sparse signal로 제어 가능한 비디오 생성 [Project Page] [Github] [arXiv](Current version v1) Abstract 최근 Text-to-Video 분야는 크게 발전했다. 하지만 text prompt에만 의존하면 제어가 힘들고, dense signal은 추론 비용에 부담이 된다. 이러한 문제를 해결하기 위해 sparse signal을 통해 유연한 제어를 가능하게 하는 SparseCtrl을 제안한다. SparseCtrl T2V Diffusion Model Sparse Condition Encoder Application Text-to-Video Diffusion Models 일반적인 최근의 T2V 모델은 공간 계..

FineControlNet: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection 인스턴스가 여러 개일 때 각각을 분리해서 attention을 수행하여 제어능력 향상 논문 설명이 좀 애매모호해서 추정이 좀 있습니다. 초기 버전이라 보완이 좀 필요한 논문 같네요. [Project Page] [arXiv](Current version v1) Abstract FineControlNet > ControlNet FineControlNet Spatial Alignment of Text and 2D Pose Text prompt를 통해 각 인스턴스를 제어하는 것은 어렵다. 2D pose 목록 {pi2D}가 주어지면 pose occupancy map(attention mask) {mi}를 추출하고 timestep t의 잠재 임베딩 h를 다음과 같이 정의한다. h는 인스턴스 수만큼 hi로 복사하여 t..

이전 1 ··· 22 23 24 25 26 27 28 ··· 48 다음

티스토리툴바