Un-Gaze: A Unified Transformer for Joint Gaze-Location and Gaze-Object Detection

Link
Abstract

This paper proposes an efficient and effective method for joint gaze location detection (GL-D) and gaze object detection (GO-D), i.e., gaze following detection. Current approaches frame GL-D and GO-D as two separate tasks, employing a multi-stage framework where human head crops must first be detected and then be fed into a subsequent GL-D sub-network, which is further followed by an additional object detector for GO-D. In contrast, we reframe the gaze following detection task as detecting human head locations and their gaze followings simultaneously, aiming at jointly detect human gaze location and gaze object in a unified and single-stage pipeline. To this end, we propose GTR, short for Gaze following detection TRansformer, streamlining the gaze following detection pipeline by eliminating all additional components, leading to the first unified paradigm that unites GL-D and GO-D in a fully end-to-end manner. GTR enables an iterative interaction between holistic semantics and human head features through a hierarchical structure, inferring the relations of salient objects and human gaze from the global image context and resulting in an impressive accuracy. Concretely, GTR achieves a 12.1 mAP gain ( \mathbf 25.1% ) on GazeFollowing and a 18.2 mAP gain ( \mathbf 43.3% ) on VideoAttentionTarget for GL-D, as well as a 19 mAP improvement ( \mathbf 45.2% ) on GOO-Real for GO-D. Meanwhile, unlike existing systems detecting gaze following sequentially due to the need for a human head as input, GTR has the flexibility to comprehend any number of people’s gaze followings simultaneously, resulting in high efficiency. Specifically, GTR introduces over a \times 9 improvement in FPS and the relative gap becomes more pronounced as the human number grows.

Synth

Problem:: 기존 시선 추적 감지 시스템은 GL-D와 GO-D를 별도의 작업으로 분리하여 처리/사람 머리 crop을 입력으로 요구하여 추가 머리 검출기 필요/한 번에 한 사람의 시선 추적만 처리 가능하여 다수의 사람이 있는 경우 비효율적/전체 장면과 머리 포즈 특징을 분리해서 추출하여 맥락적 관계 이해 부족

Solution:: 시선 추적 감지를 사람 머리 위치와 시선 추적을 동시에 감지하는 문제로 재정의/시각 인코더와 두 개의 디코더로 구성된 단일 단계 파이프라인 설계/가중치 안내 임베딩(w-GE)을 통해 인간 쿼리와 전체 장면 맥락 간 동적 상호작용 구현/두 디코더 간 계층적이고 반복적인 정보 흐름을 통해 효과적인 관계 추론

Novelty:: 얼굴 검출과 시선 목표 추정을 개별 Branch에서 처리하는 최초의 연구

Note:: 여전히 통합된 Hungarian Matching을 사용함 → 두 Task간의 갈등이 존재 할 수 있음/논문에 보고된 두 Branch를 연결 하지 않았을 때 성능이 하락됨 → 두 Task를 별도의 처리 없이 하나의 Feature로 처리하면 성능이 하락됨

Summary

Motivation

Method

file-20250324201534740.png

Method 검증

Main Results

Ablation Studies