Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Abstract

압축 메모리와 linear attention을 활용하여 제한된 메모리에서도 무한히 긴 context를 처리할 수 있는 Infini-attention 제안

[arXiv](2024/04/10 version v1)

Infini-attention은 오래된 KV state를 버리지 않고 압축 메모리에 저장하며 후속 토큰을 처리할 때 메모리에서 값을 검색하여 집계한다. 후술 하겠지만 실제로 저장과 검색이 수행되는 것은 아니다.

Scaled Dot-product Attention

Compressive Memory

1. Memory retrieval

검색이라는 단어를 계속 사용하고 있지만 '압축'이라는 말이 들어가 있듯 실제로 KV의 리스트를 저장하고 검색하는 것은 아니다.

정보를 저장하는 방식부터 attention 방법까지 Linear Transformer의 아이디어를 많이 가져왔다.

실제로 압축 메모리는 이전까지의 모든 정보를 포괄하는 '단일 상태'를 가지며, 현재의 쿼리는 이와 attention을 수행한다.

2. Memory update

일반적인 업데이트:

Delta rule을 이용한 업데이트:

현재 K와 연관된 기존 메모리의 정보를 뺌으로써 기존 정보와 새로운 정보의 충돌 최소화. 기존 정보와 새로운 정보가 중복되는 내용일 경우 기존 정보가 변경되지 않는다.

3. Long-term context injection

Multi-head 집계:

β, W_O는 학습 가능하다.

Segment 수준의 메모리를 갖춘 다른 transformer 모델들의 설정과 비교.

Long-context Language Modeling

LLM Continual Pre-training

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models (1)	2024.04.18
JetMoE: Reaching Llama2 Performance with 0.1M Dollars (0)	2024.04.18
Rho-1: Not All Tokens Are What You Need (0)	2024.04.17
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders (0)	2024.04.16
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (0)	2024.04.16
ReFT: Representation Finetuning for Language Models (1)	2024.04.09