본문 바로가기

분류 전체보기

(574)

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [Github] [arXiv](2024/01/17 version v1) Abstract Bidirectional Mamba block을 사용한 새로운 vision backbone인 Vision Mamba (Vim) 제안 Method Preliminaries Vision Mamba Vim Block Preliminaries 필독!!! Mamba Mamba: Linear-Time Sequence Modeling with Selective State Spaces SSM에 선택성을 부여하고 하드웨어 최적화 [Github] [arXiv](2023/12/01 version v1) Abstract Transformer 기반 모델들이 긴 시퀀스 처리에서 보여주는 계산 비효율성을 해결하기 위해 Mamba라는 새로운 신경..

InstantID: Zero-shot Identity-Preserving Generation in Seconds IP-Adapter, ControlNet을 통해 ID 조건화 [Project Page] [Github] [arXiv](2024/01/15 version v1) Abstract Plug&Play 방식, 하나의 얼굴 이미지만으로 개인화를 능숙하게 처리하는 InstantID 제안 Methods Preliminaries Stable Diffusion ControlNet IP-Adapter Methodology IP-Adapter의 문제점: CLIP encoder는 참조 이미지의 세부 사항을 포착하지 못한다. Cross-attention 만으로는 토큰 시퀀스를 세밀하게 제어하지 못한다. 본문의 개선점: 사전 훈련된 기성 face model을 사용하여 feature를 추출한다. 생성 이미지의 세밀한 제어를 위해 C..

Scalable Pre-training of Large Autoregressive Image Models (AIM) [Github] [arXiv](2024/01/16 version v1) Abstract Autoregressive objective를 통해 ViT를 크게 확장하여 downstream task에서 강력한 성능을 보여주는 대규모 비전 모델인 AIM(Autoregressive Image Model) 구축 Pre-training Dataset Common Crawl에서 Data Filtering Nework로 필터링된 12.8B text-image pair가 있는 DFN dataset에서 alignment score가 상위 15%인 DFN-2B dataset이 있다. LLM 사전 훈련의 일반적인 관행에서 착안하여 p = 0.8로 DFN-2B에서 샘플링하고 p = 0.2로 ImageNet-1K에서 이미지를 샘플링..

PALP: Prompt Aligned Personalization of Text-to-Image Models 본문에서 제안하는 prompt personalization 방법은 일일이 개인화해야 하기 때문에 별로 실용성 있는 기술은 아니다. [Project Page] [arXiv](2024/01/11 version v1) Abstract 단일 prompt에 대한 개인화를 통해 복잡한 prompt로부터 정확한 이미지를 생성할 수 있는 Prompt-Aligned Personalization (PALP) 제안 Prompt Alignment Method Diffusion model G: Overview: Personalization 모델 G의 self, cross-attention 계층을 LoRA를 통해 업데이트한다. Prompt-Aligned Score Sampling 모델에 의해 단일 step에서 추정된 표본 x̂0..

Delta Denoising Score (DDS) [Project Page] [Github] [arXiv](2023/04/14 version v1) Abstract Score Distillation Sampling을 응용하여 최소한의 수정으로 이미지를 편집할 수 있는 Delta Denoising Score (DDS) 소개 Delta Denoising Score (DDS) Score Distillation Sampling Editing with SDS 판다를 다람쥐로 바꾸기 위해 초기 이미지 z를 판다 이미지로 초기화하고 SDS를 수행했을 때 아래 그림과 같이 점점 흐려지며 세부 사항이 소실되는 것을 볼 수 있다. 우리의 목적은 text로 안내되는 방향을 δtext, 나머지 방향을 δbias라고 했을 때, 두 방향을 분리하여 δtext만 업데이트하는 것이..

Towards Conversational Diagnostic AI (AMIE) AI가 의사를 뛰어넘었다고 한다 ㄷㄷ [arXiv](2024/01/11 version v1) Nature article : Google AI는 인간 의사보다 더 나은 침상 매너를 갖추고 있으며 더 나은 진단을 제공합니다. Abstract Self-play 시뮬레이션 환경을 통해 의료 진단에 최적화된 AI system인 AMIE (Articulate Medical Intelligence Explorer) 제안 AMIE: An LLM based AI System for Diagnostic Dialogue Real-world Datasets for AMIE 미국 의사 면허 시험 객관식 스타일 문제 MedQA MultiMedBench의 QA 질문에 대해 전문가가 작성한 long-form Medical QA 의료..

Object-Centric Diffusion for Efficient Video Editing 배경 영역의 계산을 줄임 [arXiv](2024/01/11 version v1) Abstract 중요한 영역에 더 많은 계산을 할당하여 빠르게 비디오를 편집할 수 있는 Object-Centric Diffusion(OCD) 제안 Off-the-shelf acceleration FateZero 기반 Faster self-attention ToMe, ToMe for Stable Diffusion 더보기 Pairing token locations from inversion FateZero는 inversion으로 얻은 attention map에 의존하기 때문에 inversion과 sampling에서 토큰이 동일한 짝을 이루는 게 중요하다. Inversion 중에 토큰을 병합하고 sampling에서 동일한 짝을 사..

FateZero: Fusing Attentions for Zero-shot Text-based Video Editing Inversion을 통해 얻은 attention map을 사용해 시간적 일관성 향상 [Project Page] [Github] [arXiv](2023/10/11 version v3) Abstract Inversion을 통해 zero-shot video editing을 수행하는 FateZero 제안 Methods Tune-A-Video 기반 Preliminary: Latent Diffusion and Inversion LDM DDIM Sampling: DDIM Inversion: FateZero Video Editing Inversion Attention Fusion Inversion noise를 직접 사용하면 많은 denoising step에 따른 오류 누적, 높은 cfg 가중치 때문에 프레임 불일치 발..

Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation Multi-reward 간에 Pareto 최적인 샘플만 훈련에 사용 [arXiv](2024/01/11 version v1) Abstract Text-to-Image generation을 위한 multi-reward Reinforcement Learning framework인 Parrot 소개. Pareto optimal selection을 사용하여 reward 간의 균형을 맞추기 때문에 Parrot이라는 이름을 붙였다. Preliminary 보상 모델 r의 목적 함수 J: 사전 훈련된 확산 모델 pθ에 대해 표기: Method Parrot Overview Parrot은 Prompt Expansion Network(PEN)와 T2I model로 구성된다. Batch-wise Pareto-optimal Se..

PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models [Project Page] [Github] [arXiv](2024/01/10 version v1) Abstract PixArt-α에 Latent Consistency Model과 ControlNet을 통합한 PixArt-δ 소개. LCM in PixArt-δ Algorithm and Modification Training Efficiency and Inference Speedup Training Details Algorithm and Modification LCD Algorithm 가변 guidance scale w 대신 고정된 w를 사용하는 것 외에는 LCM과 똑같다. Effect of Hyper-parameters bs = batch size, w_fix = 고정된 w, w_Embed = 가변 w C..

Score Distillation Sampling with Learned Manifold Corrective (LMC-SDS) [arXiv](2024/01/10 version v1) Abstract Score Distillation Sampling (SDS)를 심층적으로 분석하고 더 깨끗한 gradient를 제공하는 Score Distillation Sampling with Learned Manifold Corrective (LMC-SDS) 제안 Analysis 확산 손실: Score Distillation Sampling: Classifier-free Guidance: SDS를 다음과 같이 다시 쓸 수 있다. w가 높으면 지나치게 채도가 높은 이미지와 아티팩트를 생성하고, 낮으면 지나치게 흐릿한 이미지를 생성한다. 이를 분석하기 위해 SDS rendering function g()가 ID인, 즉 z = θ인 경우에 각 scor..

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts [arXiv](2024/01/08 version v1) [Mamba 논문 리뷰] Model Architecture Switch Transformer에서 사용한 switch 기반 MoE를 사용한다. 또한 원래 2개의 mamba block이 1개의 transformer block과 대응하는데, 위 그림에도 나오듯이 MoE를 추가하면 transformer block과 1대1 대응된다. Main Results

이전 1 ··· 19 20 21 22 23 24 25 ··· 48 다음

티스토리툴바