Diffusion-Refined VQA Annotations for Semi-supervised Gaze Following

Link
Abstract

Training gaze following models requires a large number of images with gaze target coordinates annotated by human annotators, which is a laborious and inherently ambiguous process. We propose the first semi-supervised method for gaze following by introducing two novel priors to the task. We obtain the first prior using a large pretrained Visual Question Answering (VQA) model, where we compute Grad-CAM heatmaps by ‘prompting’ the VQA model with a gaze following question. These heatmaps can be noisy and not suited for use in training. The need to refine these noisy annotations leads us to incorporate a second prior. We utilize a diffusion model trained on limited human annotations and modify the reverse sampling process to refine the Grad-CAM heatmaps. By tuning the diffusion process we achieve a trade-off between the human annotation prior and the VQA heatmap prior, which retains the useful VQA prior information while exhibiting similar properties to the training data distribution. Our method outperforms simple pseudo-annotation generation baselines on the GazeFollow image dataset. More importantly, our pseudo-annotation strategy, applied to a widely used supervised gaze following model (VAT), reduces the annotation need by 50%. Our method also performs the best on the VideoAttentionTarget dataset. Code is available at https://github.com/cvlab-stonybrook/GCDR-Gaze.git.

Synth

Problem:: 시선 목표 추정 데이터 셋 라벨링이 너무 힘듬

Solution:: VQA의 Grad-CAM으로 후보군을 추리고 Diffusion으로 Refine하여 Pseudo-Label 생성

Novelty:: 최초의 Semi-supervised Gaze Following 연구

Note:: 아이디어를 현존하는 VQA랑 Diffusion으로 손쉽게 구현한 연구, 다 있던거지만 잘 구성한 듯

Summary

Motivation

Method

file-20250320013619032.png|725

Mean Teacher는 단순히 Diffusion Model 성능 향상을 위한 기법

Diffusion Model Training

file-20250320014819685.png|700

Method 검증