Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent

논문 리뷰/Language Model

Ostin 2024. 5. 1. 15:21

Abstract

Octopus v2 + vision : on-device function calling

[arXiv](2024/04/18 version v2)

Encoding visual information

CLIP 기반 vision encoder를 사용한다.

Functional token

Octopus v2와 똑같이 functional token을 사용하여 기능을 호출한다.

Multi-stage training

사전 훈련된 vision encoder, LLM의 정렬을 학습 → functional token 학습

모델의 피라미터 수는 1B 미만이며 아래 결과들은 출력 parser가 필요하지 않은 직접 출력이다.

이메일 보내기

구글 검색하기

아마존 구매

DoorDash (음식 배달 회사)