LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

LLaMA-Adapter 개선 + Multi-Modal

Github

arXiv

Abstract

LLaMA-Adapter 보강
Early fusion strategy
Joint training paradigm

Introduction

LLaMA-Adapter : 매우 적은 피라미터로 fine-tuning 가능하지만 multi-modal 학습 안됨

MiniGPT : Multi-modal 가능하지만 무겁고 많은 양의 고품질 데이터셋 필요함

LLaMA-Adapter를 시작점으로 visual projection layer를 최적화하여 모델을 개선할 수 있다.

하지만 시각적 특징이 프롬프트를 지배하는 현상 관찰.

따라서 image-text alignment와 language instruction tuning 간의 간섭을 해결하는 early fusion of visual knowledge 제안.

후기 L 레이어가 아니라 초기 K 레이어에만 visual prompt 전달.

이 방법을 사용하면 multi-modal instruction data가 없어도 joint training with disjoint parameters가 가능하다고 함.

또한 학습 가능한 레이어를 잠금 해제하는 bias tuning을 통해 LLaMA-Adapter를 보강.

Captioning, detection, OCR system과 같은 전문 모델을 도입하여 차별화 가능.

LLaMA-Adapter

LLaMA-Adapter 논문 리뷰

LLaMA-Adapter V2

Bias Tuning of Linear Layers

LLaMA-Adapter에서는 LLaMA의 가중치를 전혀 건들지 않아 fine-tuning 능력에 제한이 있었다.

V2에서는 LLaMA의 모든 정규화 계층의 고정을 해제하고 각 선형 레이어에 bias, scale factor 추가

Joint Training with Disjoint Parameters

50만 개의 image-text caption data와 5만 개의 instruction data 간의 크기 차이로 인해 그냥 훈련하면 instruction-following 기능이 심각하게 손상될 수 있음.

따라서 초기 K 레이어의 visual projection과 early zero-initialized attention은 image-text caption data에 대해 훈련하고,

후기 L 레이어와 잠금해제된 LLaMA의 피라미터들은 instruction data로 훈련된다.

Early Fusion of Visual Knowledge

Visual prompt의 경우 zero-initialized attention과 함께 첫 번째 transformer layer의 단어 토큰에 직접 추가되고 후기 L 레이어에서 적응 프롬프트와 합쳐짐.

Integration with Experts

Adapter 자체가 효율성에 중점이 있는 만큼, 대규모 데이터에서 오랫동안 fine-tuning 한 모델보다 부정확한 응답이 발생할 확률이 높다.

더 강력한 모듈이나 더 많은 데이터를 수집하는 대신, 전문 시스템을 통합하여 보완한다.

(위의 그림은 사전 훈련된 캡션 모델을 사용하여 추가적인 정보를 준 모습)

Experiments

Hugging Face space

LLaMA Adapter - a Hugging Face Space by csuhan

huggingface.co

Youtube

Github(하단에 간단한 예시들과 V1과의 비교가 있음)

GitHub - OpenGVLab/LLaMA-Adapter: Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters

Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters - GitHub - OpenGVLab/LLaMA-Adapter: Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters

github.com

저작자표시 (새창열림)

'논문 리뷰 > Language Model' 카테고리의 다른 글

Efficient Streaming Language Models with Attention Sinks (StreamingLLM) (2)	2023.10.06
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models (0)	2023.09.26
Augmenting Language Models with Long-Term Memory (LongMem) (1)	2023.07.08
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention (0)	2023.06.15
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding (0)	2023.06.11
LoRA: Low-Rank Adaptation of Large Language Models (0)	2023.01.30

Ostin X

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Abstract

Introduction

LLaMA-Adapter