Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Close-set detector인 DINO를 open-set detector로 개조

Abstract

Transformer 기반 detector인 DINO와 grounded pre-training을 결합한 Grounding DINO 제안.

인간이 지정한 임의의 객체를 감지하는 것을 open-set object detection이라고 명명.

언어와 이미지를 모두 처리할 수 있고 대규모 데이터를 활용할 수 있는 transformer인 DINO 채택.

Close-set detector를 open-set detector로 확장하는 기존 방식:

Closed-set detector를 세 부분으로 나누고 세 번의 feature fusion을 수행하며 neck, head의 출력에서 대조 손실을 계산.

Grounding DINO의 개선점:

Grounding DINO는 주어진 (image, text) 쌍에 대해 object box와 명사구를 출력.

REC(refrring expression comprehension) : 각 텍스트 입력에 대한 경계상자 중 점수가 높은 object를 출력.

백본 모델을 통해 multi-scale image, text feature를 추출하고 self attention, cross attention.

Text feature를 기반으로 가장 관련 있는 image features를 선택하고 DINO decoder의 위치 쿼리로 사용한다.

DINO의 디코더에서 text cross attention 레이어 추가.

(a) : 단어 사이의 영향이 제거되고 세밀한 정보를 잃음

(b) : 관련 없는 단어들 간의 불필요한 종속성이 생김

따라서 (c)와 같은 attention mask 도입.

DETR-like 모델과 같이 L1 loss, GIOU loss,

GLIP과 같이 대조 손실 사용.

(따로 수식으로 정리되어 있지 않고 간략하게만 언급됨.)

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA) (0)	2023.07.08
Recognize Anything: A Strong Image Tagging Model (RAM) (0)	2023.06.19
Tag2Text: Guiding Vision-Language Model via Image Tagging (0)	2023.06.19
Matting Anything (MAM) (0)	2023.06.15
Matte Anything: Interactive Natural Image Matting with Segment Anything Models (MatAny) (1)	2023.06.15
Segment Anything in High Quality (HQ-SAM) (0)	2023.06.10