Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision

Clustering 기반 표현 학습으로 audio embedding network 훈련

[arXiv](Current version v1)

Introduction

Labeling 된 대규모 예제 세트 대신 인간 유아와 비슷한 방법으로 지식을 습득하는 학습 framework.

의미론적으로 구조화된 표현에서 범주형 구조를 발견하기 위해 clustering 절차 제안.

Cluster-based active learning procedure를 채택하여 발견된 범주에 약한 label 지정.

목표는 audio embedding network를 훈련하는 것이다.

본 논문의 접근 방식은 Look, Listen and Learn의 일반화이며 개선 사항은 다음과 같다.

인간의 경험에 미루어 보면 소리를 내는 대상을 굳이 보지 않아도 대충 예상할 수 있으므로 less restrictive coincidence prediction을 사용
Uni-modal(Audio-Audio), cross-modal(Visual-Audio) data를 모두 사용
미니배치의 모든 non-coincident pair를 활용하여 최적화 개선

AA 예측 작업(빨간 선)은 이진 분류 네트워크 p_AA에 대해 수행되고 B개의 coincident pair와 B(B-1)개의 non-coincident pair에 대한 balanced cross-entropy:

AV 예측 작업(파란 선):

클러스터 매핑 p_clust 추가. (녹색 선)

Data point-wise entropy를 줄임으로써 확신 있는 클러스터 할당을 장려할 수 있지만 동시에 하나의 클러스터에 몰리는 사소한 솔루션을 방지하기 위해 p_clust의 전체 entropy를 증가시키는 term을 추가한다.

각 클러스터의 단일 data point에게 label을 요청하고 해당 label을 모든 클러스터 구성요소에 전파한다.

그다음 cross-entropy를 통해 classifier를 훈련한다. (노란 선)

더 많은 data point에서 label을 요청하는 것은 단순히 더 많은 클러스터를 사용하는 것보다 효과적이지 않다고 한다.

먼저 L_AV를 수렴될 때까지 최적화한 후 L_coin → L_joint → L_class 순서로 최적화한다.

Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution (LP-FT) (0)	2024.01.25
Compositional Visual Generation and Inference with Energy Based Models (0)	2024.01.02
Implicit Generation and Modeling with Energy-Based Models (0)	2024.01.02
Look, Listen and Learn (0)	2023.12.15
Sketch Video Synthesis (0)	2023.12.10
Layered Neural Atlases for Consistent Video Editing (2)	2023.12.07