Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

Link
Abstract

We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person's gaze target requires reasoning both about the person's appearance and the contents of the scene. Prior works have developed increasingly complex, hand-crafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices. Our code is available at: http://github.com/fkryan/gazelle

Synth

Problem:: 시선 타겟 예측에는 사람의 외모와 장면 콘텐츠에 대한 복합적 추론이 필요함/기존 방식은 복잡한 수작업 파이프라인으로 Scene Encoder, Head Encoder, 깊이/포즈 보조 모델 등의 특징을 별도로 융합함

Solution:: DINOv2 인코더의 고정된 특징을 활용한 Transformer 프레임워크 Gaze-LLE 제안/장면에 대한 단일 Feature Representation 추출 및 Person-Specific Positional Prompt 적용

Novelty:: 복잡한 수작업 Pipeline 대신 General-Purpose Feature Extractor 활용/단일 특징 표현과 Lightweight 디코딩 모듈을 통한 간소화/Person-Specific Positional Prompt를 통한 효과적인 시선 디코딩

Note:: 코드는 GitHub에서 공개 (http://github.com/fkryan/gazelle)

Summary

Motivation

file-20250321015945039.png|450

Method

file-20250321020002488.png

Method 검증

벤치마크 성능

Ablation Study

그래도 Feature가 너무 좋아서, 사람이 적은 경우 Head 위치가 주어지지 않아도 유의미한 결과를 보임