Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Abstract

특정 계층에서 계산에 참여하는 토큰 수를 제한하여 FLOPs를 동적으로 할당하는 MoD(Mixture-of-Depths) 제안

[arXiv](2024/04/02 version v1)

(1) Self-attention & MLP, (2) Residual connection (계산 우회) 중 하나로 토큰을 라우팅하는 설정을 고려한다.

라우팅 방식에는 2가지가 있다.

(왼쪽) Token-choice routing은 각 토큰에 대해 전문가 선호도 분포를 생성하여 라우팅하는 것으로 불균형 할당 문제가 있다.

(중간) Expert-choice routing은 반대로 각 전문가가 토큰을 선택하는 것이다. 완벽한 load balancing을 보장하지만 일부 토큰이 과잉, 과소 처리되는 경우가 생김.

Expert-choice routing을 선택한 이유:

라우터 가중치 top-k에 들지 못하는 토큰 x는 block을 우회, 속하는 토큰은 self-attention & MLP를 통과한다.

추가로 라우터 가중치 r을 출력에 곱하여 gradient에 속하도록 한다.

한 가지 문제가 있는데 top-k 작업은 비인과적(non-causal)으로, autoregressive sampling 시에는 미래 토큰에 접근할 수 없기 때문에 적용할 수 없다.

두 가지 해결 방법:

IsoFLOP 비교.

Autoregressive model에서 성능 저하 좀 있음.

MoE와 통합 가능. (단계적 라우팅 또는 통합 라우팅)

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (0)	2024.04.16
ReFT: Representation Finetuning for Language Models (1)	2024.04.09
Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models (Think-and-Execute) (1)	2024.04.08
Octopus v2: On-device language model for super agent (0)	2024.04.05
Advancing LLM Reasoning Generalists with Preference Trees (Eurus) (0)	2024.04.04
Gecko: Versatile Text Embeddings Distilled from Large Language Models (0)	2024.04.03