본문 바로가기

분류 전체보기

(582)

Mistral 7B 효율성이 높은 LLaMA2 기반 모델. LLaMA2 보다 성능 좋음. [Project Page] [Github] [arXiv](2023/10/10 version v1) Model Architecture Sliding Window Attention Window size W개의 이전 토큰에 대해 attention을 수행한다. 이전 토큰은 또다시 이전 레이어에서 이전 토큰에 대한 attention을 수행하므로 마지막 레이어의 토큰은 최대 약 13만개(4096x32) 토큰의 영향을 받는다. FlashAttention과 xFormers를 추가로 채택하여 Vanilla attention에 비해 2배의 속도 향상을 얻었다. Rolling Buffer Cache 고정된 캐시 크기를 사용한다. 아래 그림은 캐시 크기 =..

Denoising Vision Transformers (DVT) Positional Embedding에 의해 유발되는 noise artifact를 제거하는 네트워크 [Project Page] [Github] [arXiv](2024/01/05 version v1) Abstract ViT의 출력에서 나타나는 noise artifact를 분리하고 제거할 수 있는 Denoising Vision Transformers (DVT) 제안 Introduction 아래 그림은 원시 ViT 출력에 클러스터링 알고리즘을 적용하면 노이즈가 많은 클러스터가 생성된다는 것을 보여준다. 연구진은 3가지 이유로 위치 임베딩이 이러한 현상에 기여한다고 가정했다. Zero-tensor를 입력해도 유사한 노이즈 패턴이 발생한다. 위치 임베딩 없이 훈련한 모델에서는 노이즈 패턴이 발생하지 않는다. 입력..

Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer Level Loss [arXiv](2024/01/05 version v1) Abstract SDXL을 증류하여 30% 크기에 100% 속도 향상을 제공하는 [Segmind-Vega], 50% 크기에 60% 속도 향상을 제공하는 [SSD-1B] 모델 소개 Methodology Architecture 일단 SDXL의 U-Net은 Stable Diffusion에서 Down/Up block을 하나씩 줄이고 블록 당 attention layer를 10개씩 사용한다. Stable Diffusion, SDXL architecture 차이 따라서 SD의 블록 자체를 제거한 BK-SDM과 다르게 attention의 수만 줄이는 방법을 사용했다. (Up Blocks인데 그림에서 Down Blocks라고 표기되어 있다. 잘못 적은 거 아닌가?..

BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion [Github] [arXiv](2023/11/16 version v3) Abstract Stable Diffusion에서 residual block과 attention block 제거, 증류를 통해 모델 크기를 줄인 BK-SDM 제안 Compact U-Net architecture Pruning 후 generation score의 차이를 측정하여 민감도 분석. 높을수록 제거가 가능함을 의미한다. Fewer blocks in the down and up stages Down stages 첫 번째 RA pair가 변경된 공간 정보를 처리하기 때문에 더 중요하다. 민감도 분석과도 일치. 따라서 두 번째 RA pair를 제거. Up stages Down stage의 두 번째 RA pair와 잔차 연결된 RA p..

Instruct-Imagen: Image Generation with Multi-modal Instruction 생성 모델을 자연어로 제어할 수 있도록 multi-modal instruction tuning. [arXiv](2024/01/03 version v1) Abstract 자연어를 사용하여 서로 다른 양식을 통합함으로써 다양한 생성 의도를 통일된 형식으로 표준화할 수 있는 Instruct-Imagen 제안 Multi-modal Instructions for Generation Multi-modal instruction은 2가지 핵심 구성요소로 이루어져 있다. 마커([ref#1])와 함께 작업에 대한 설명을 제공하는 text instruction 마커와 pairing 된 multi-modal context Instruct-Imagen Imagen with Multi-modal Instruction Traini..

TinyLlama: An Open-Source Small Language Model [Github] [arXiv](2024/01/04 version v1) Abstract LLaMA2의 architecture, tokenizer를 기반으로 3 epochs에 걸쳐 1T 개의 토큰으로 pretraining 된 1.1B 언어 모델인 TinyLlama 소개 Introduction Chinchilla scaling raw에서 제시하는 것보다 훨씬 더 많은 토큰으로 훈련했을 때 작은 모델의 행동을 탐구하기 위해 3T 개의 토큰을 사용해 1.1B decoder-only transformer를 훈련한다. Pretraining Pre-training data SlimPajama : 대부분 영어로 구성된 1.2T token dataset인 RedPajama에서 저품질 데이터를 필터링하고 중복을 제거하여 ..

Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions IP-Adapter에 temporal attention을 더하여 비디오 생성 [Project Page] [arXiv](2024/01/03 version v1) Robotic eagle, 8k unreal engine render, wires and gears A disoriented astronaut, lost in a galaxy of swirling colors, floating in zero gravity Abstract Multimodal Video Block (MVB)을 통해 이미지, 텍스트 조건을 처리할 수 있는 비디오 생성 모델인 Moonshot 소개 Architecture and Adaptations Multimodal Video Block MVB의 목표: 고품질의 비디오 프레임을 지속적으..

DocLLM: A layout-aware generative language model for multimodal document understanding [arXiv](2023/12/31 version v1) Abstract LLM이 layout을 고려하여 시각적 문서를 추론하도록 경량으로 확장한 DocLLM 소개 DocLLM Framework Model Architecture Baseline: LLaMA2 OCR을 사용하여 얻은 text token과 공간 정보를 별개의 양식으로 취급하여 별개의 벡터를 사용한다. Disentangled Spatial Attention Text token t의 hidden vector H에 대한 일반적인 self-attention: DocLLM에서는 입력 {(x, b)}에 대해 bbox를 hidden vector S로 임베딩하고 text-to-text, text-to-spatial, spatial-to-text, spati..

LLaMA Beyond English: An Empirical Study on Language Capability Transfer 다른 언어로의 전이 학습에 대한 조사 [arXiv](2024/01/02 version v1) AbstractLanguage generation, following instruction 능력을 비영어권 언어로 효과적으로 이전하는 방법에 초점을 맞추어 1440 이상의 GPU 시간이 축적될 동안 어휘 확장, 추가 사전 훈련, 명령어 튜닝과 같은 요인이 전이에 미치는 영향을 분석했다. Background and OverviewInstruction-following LLM을 개발하기 위한 필수 단계 소개. Step 1: Pretraining to acquire language capability and knowledge Large corpus D가 주어지면 prefix sequence를 기반으로 다음 손실을 최소..

TrailBlazer: Trajectory Control for Diffusion-Based Video Generation [Project Page] [Github] [arXiv](2023/12/31 version v1) Abstract 간단한 bounding box를 통해 비디오에서 피사체를 안내할 수 있는 TrailBlazer 제안 Method 깜빡임 없이 고품질 비디오를 생성하는 것으로 유명한(?) VideoFusion의 fine-tuned version인 ZeroScope cerspense를 추가적인 훈련 없이 그대로 사용한다. VideoFusion은 모든 frame에서 공유하는 base noise와 residual noise를 따로 예측한다. Pipeline Spatial Cross Attention Guidance Temporal Cross-Frame Attention Guidance Scene compositin..

Directed Diffusion: Direct Control of Object Placement through Attention Guidance [Project Page] [Github] [arXiv](2023/09/26 version v3) Abstract Cross-attention map에 activation을 생성하여 위치를 제어할 수 있는 Directed Diffusion 제안 Method 아래 두 줄은 각각 처음과 마지막 denoising process의 cross-attention map을 보여준다. Process의 초기에 위치가 확립되며 cross-attention은 명확한 공간적 해석을 갖는다. Pipeline LDM(Stable Diffusion) 기반. 영역 정보 R = {B,I}는 bbox B와 해당 bbox에 대한 prompt index I로 구성됨. e.g. I = {2} = "cat" Cross-Attention Map..

VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM LLM, Img model, Vid model을 활용하여 일관성 있는 multi-scene video 생성 [Project Page] [arXiv](2024/01/02 version v1) Abstract LLM을 활용하여 content 일관성이 있는 multi-scene video를 생성할 수 있는 VideoDrafter 제안 VideoDrafter 1. Multi-Scene Video Script Generation 2. Entity Reference Image Generation 3. Video Scene Generation VideoDrafter-Img VideoDrafter-Vid Multi-Scene Video Script Generation LLM은 배포 유연성과 추론 효율성을 고려하여 Ch..

이전 1 ··· 21 22 23 24 25 26 27 ··· 49 다음

티스토리툴바