Flow-Guided Transformer for Video Inpainting (FGT)

시간적, 공간적 transformer를 이용한 flow-guided video inpainting

[Github]

[arXiv], [Supplementary]

Abstract

Flow-guided Transformer 제안

Introduction

FGT는 2 part로 구성:

Flow completion network
Flow-guided transformer

Flow completion network :

시공간적으로 분리된 P3D block을 U-Net에 통합.

Edge loss 도입.

Flow-guided transformer :

Spatial, temporal attention 분리

Optical flow는 spatial transformer에서만 사용

Temporal attention은 시공간적으로, spatial attention은 동일 프레임 내에서만 수행

Flow-reweight module

Window 분할

Dual perspective spatial MHSA(Multi-Head self-attention)

Method

Network overview

Local Aggregation Flow Completion network (LAFC)와

Flow completion and a Flow-Guided Transformer (FGT)로 구성됨.

Local aggregation flow completion network

Local flow aggregation

LAFC에 pseudo-3D(P3D) block, 잔차 연결 채택.

F_forward, F_backward를 모두 F로 표기.

Target flow에 대해 손상된 흐름 시퀀스가 주어지면 Laplacian filling으로 초기화된 흐름 시퀀스 F̃를 얻고 LAFC에 입력

m번째 P3D block의 입력 흐름을 f^m이라 할 때,

(Temporal convolution, Spatial convolution)

Edge loss

대부분의 영역이 매끄러운 flow field의 특성 때문에 edge가 모호해질 수 있음.

F̂_t에 대해 간단한 projection network P_e를 사용하여 ground truth에서 canny edge detector로 검출한 edge와 비교하는 edge loss 도입.

Loss function

L1 loss:

Smoothness loss:

큰 warp error를 방지하기 위한 L_w,

Total loss:

Flow-guided transformer for video inpainting

먼 flow 간의 상관관계가 높지 않아 temporal attention에는 flow를 통합하지 않음.

첫 번째 transformer block 이후에 CVPT에서 제안한 positional encoding 방법을 사용한다.

Temporal transformer

Token map을 큰 window로 나누고 서로 다른 프레임에 걸쳐 self-attention 수행.

Flow guidance integration

Optical flow에서 foreground와 background의 움직임의 차이는 둘의 관계를 알려주고 유사한 움직임을 가진 토큰은 관련성이 더 높다. Flow가 유용한 정보를 주기 때문에 frame token과 통합.

하지만 flow token(TF)과 frame token(TI)을 연결하는 것은 문제가 있는데,

완성된 flow가 완벽하지 않음
외견은 객체 내에서 많이 다를 수 있지만, flow는 유사해 혼란을 줄 수 있다.

따라서 flow-reweight module 설계

(+ concat frame token map)

Dual perspective spatial MHSA

입력 token map을 window로 나누고 내부에서 self-attention.

하지만 window에 누락된 영역의 토큰이 많이 포함되어 있으면 정확도가 떨어짐.

따라서 Depth-wise convolution으로 압축된 global token들을 가져와서 Key, Value로 사용.

Loss function

재구성 손실(M = mask, Y = ground truth)

T-Patch GAN을 기반으로 한 적대적 hinge loss

Total loss:

Experiments

https://www.youtube.com/watch?v=BC32n-NncPs

저작자표시 (새창열림)

'논문 리뷰 > Vision Transformer' 카테고리의 다른 글

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale (1)	2023.10.18
Semantic-SAM: Segment and Recognize Anything at Any Granularity (1)	2023.10.18
ProPainter: Improving Propagation and Transformer for Video Inpainting (3)	2023.10.12
FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting (0)	2023.10.10
StyleDrop: Text-to-Image Generation in Any Style (1)	2023.09.26
Fast Segment Anything (FastSAM) (0)	2023.07.14

Ostin X

Flow-Guided Transformer for Video Inpainting (FGT)

Abstract

Introduction