You Only Cache Once: Decoder-Decoder Architectures for Language Models (YOCO)

Abstract

Decoder를 KV 캐시를 생성하는 self-decoder, 생성된 캐시를 재사용하는 cross-decoder로 분리하여 효율성을 향상시키고 context를 확장한다.

[arXiv](2024/05/09 version v2)

L개의 블록 중 L/2개는 self-decoder, 나머지는 cross-decoder로 구성되어 있다.

Self-decoder는 efficient self-attention (ESA)를 사용한다.

ESA는 어떤 새로운 방법이 아니라 sliding window attention와 같이 메모리 효율적인 어떤 방법이든 상관 없다.

Self-Decoder의 출력으로 K, V 캐시를 생성하고,

cross-decoder는 attention 시 캐시에서 K, V를 가져와서 사용한다.

레이어의 절반에서 K, V를 계산하지 않기 때문에 속도가 크게 향상된다.

xLSTM: Extended Long Short-Term Memory (0)	2024.05.20
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (0)	2024.05.19
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? (5)	2024.05.16
AlphaMath Almost Zero: process Supervision without process (0)	2024.05.14
What matters when building vision-language models? (Idefics2) (0)	2024.05.13
Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training (0)	2024.05.10