Tag2Text: Guiding Vision-Language Model via Image Tagging

Detector가 아닌 tagging을 이용한 vision-language pretraining framework

Github

arXiv

Abstract

Vision-language model 모델에 이미지 태깅을 도입하는 vision-language pretraining(VLP) 프레임워크인 Tag2Text 제안

Introduction

(1) Detector 기반 기존의 vision-language (V+L) task framework

(2) Image tagging을 이용한 새로운 방식

자동적인 text semantic parsing을 통해 대규모 데이터 사용 가능
태그 카테고리가 단순 객체를 넘어 장면, 속성, 동작과 같이 다양하기 때문에 image-text의 더 나은 연결을 제공
Detector보다 가벼움

위와 같은 이유로 Tag2Text 제안.

Approach

Mining Tags from Texts

Text Semantic Parser는 입력 문장에서 entity(head+modifier)와 관계를 식별하여 이미지의 태그를 얻음.

Tag Category System의 구성은 빈도가 높은 태그가 이미지 설명의 공통 요소를 반영하기 때문에 더 중요하게 간주된다는 원칙을 기반으로 400만 개의 오픈소스 image-text 쌍 약 3000개의 태그 선택.

Overview Framework

Tag2Text Pre-training

Image Tagging

이미지의 feature를 해당 태그와 연결하는 것을 목표로 하는 image-tag recognition decoder 사용.

Asymmetric Loss로 훈련.

Image-Tag-Text Generation

자기회귀 방식의 단방향 언어 모델을 통해 태그와 이미지의 feature를 입력으로 텍스트를 생성하도록 함.

Image-Text Alignment

Image-Text Contrastive Loss(ITC), Image-Text Matching Loss(ITM)를 통해 image-text 정렬

저작자표시 (새창열림)

'논문 리뷰 > Vision Transformer' 카테고리의 다른 글

Fast Segment Anything (FastSAM) (1)	2023.07.14
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA) (0)	2023.07.08
Recognize Anything: A Strong Image Tagging Model (RAM) (0)	2023.06.19
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (0)	2023.06.19
Matting Anything (MAM) (0)	2023.06.15
Matte Anything: Interactive Natural Image Matting with Segment Anything Models (MatAny) (1)	2023.06.15

Ostin X

Tag2Text: Guiding Vision-Language Model via Image Tagging

Abstract

Introduction

Approach

Mining Tags from Texts

Overview Framework

Tag2Text Pre-training

Image Tagging

Image-Tag-Text Generation

Image-Text Alignment

'논문 리뷰 > Vision Transformer' 카테고리의 다른 글

티스토리툴바

Tag2Text: Guiding Vision-Language Model via Image Tagging

Abstract

Introduction

Approach

Mining Tags from Texts

Overview Framework

Tag2Text Pre-training

Image Tagging

Image-Tag-Text Generation

Image-Text Alignment

'논문 리뷰 > Vision Transformer' 카테고리의 다른 글

'논문 리뷰/Vision Transformer' Related Articles

티스토리툴바