Gaze Target Detection Based on Head-Local-Global Coordination

#gaze-target-estimation #contrastive-learning #local-global-coordination #representation-consistency

Link

https://link.springer.com/chapter/10.1007/978-3-031-73383-3_18

Abstract

This paper introduces a novel approach to gaze target detection leveraging a head-local-global coordination framework. Unlike traditional methods that rely heavily on estimating gaze direction and identifying salient objects in global view images, our method incorporates a FOV-based local view to more accurately predict gaze targets. We also propose a unique global-local position and representation consistency mechanism to integrate the features from head view, local view, and global view, significantly improving prediction accuracy. Through extensive experiments, our approach demonstrates state-of-the-art performance on multiple significant gaze target detection benchmarks, showcasing its scalability and the effectiveness of the local view and view-coordination mechanisms. The method’s scalability is further evidenced by enhancing the performance of existing gaze target detection methods within our proposed head-local-global coordination framework.

Synth

Problem:: 시선 방향 부정확성으로 인한 Gaze Cone 기반 예측 성능 저하/GazeCone이 task-irrelevant feature를 제대로 활용하지 못함/Global View에 포함된 irrelevant feature로 인해 품질 저하 발생

Solution:: Local View를 도입해 시선 방향 부정확성 영향 최소화/Global View를 활용하여 전체 이미지의 맥락 정보 이용/Local-Global Feature간 Contrastive Learning을 이용하여 Representation Consistency 확보

Novelty:: Local View 개념 도입/서로 다른 View의 Feature의 Consistency 확보를 위한 두 가지 방법(Position/Representation)/두 가지 View를 모두 사용해 시선 정보의 품질에 영향 최소화

Note::

Summary

Motivation

기존 방식들은 시선 방향을 기준으로한 Gaze Cone을 이용하였음. 정확한 시선 방향 추정을 가정하면, Gaze Cone은 이론적으로 얼굴 특징, 머리 위치, 시선 목표 특징, 그리고 시선 목표 위치를 포함한 Gaze Target Estimation에 효과적인 특징들을 포함하고 있음.

Gaze Cone 예시

하지만 예측된 시선 방향이 올바르지 않은 경우, 이를 이용한 GazeCone은 오히려 성능에 악영향을 미침.

따라서, 부정확한 시선 방향으로 인한 악영향을 약화시키면서 GazeCone을 효과적으로 활용하기 위해, 우리는 사람의 머리 위치와 전체 시선 추정 범위를 포함하는 가장 작은 직사각형 영역인 Local View를 이용함. 그리고 Task-irrelavant feature들도 활용하기 위해 전체 이미지를 의미하는 Global View도 이용함.

Method

점선 상자는 Training Phase에서만 이용됨
Local View와 Global View에 사용되는 Network는 동일함

전체 프로세스

Head View에서 시선 방향 추정
추정된 시선 방향을 이용해 Local View 생성
각 View에 Depth/Head Position/Image를 넣어서 Feature 생성
Local/Global View의 Feature를 하치고, 이를 다시 Head View Feature와 합쳐서 Heatmap 추정

Head Attention Module

Head View Branch의 feature $F_{h}$ 를 pooling해서 $f_{h}$ 생성,
$M_{l h / g h}$ 를 $28 \times 28$ 로 resize한 후 flatten하여 $f_{h}$ 와 concat
이를 FC layer $F_{h}$ 에 넣어 $1 \times 7 \times 7$ 의 $A_{l h / g h}$ 를 생성

Representation Consistency

그림에는 Global이 query, Local이 key지만 논문의 수식은 Global이 key, Local이 query임 → Local과 일치하는 Global만 positive, 나머지는 negative라는 의미

Global-view는 Gaze Cone과 관련된 Local-view 영역 외에도 넓은 영역이 존재하며, 따라서 Task-irrelavant feature가 많음 → Feature의 품질을 낮춰 모델의 성능에 악영향을 줌
Local-view는 Task-relavant feature로 구성되어 품질이 높음 → Global-view 중 Local-view와 동일한 영역의 feature와 local-view의 feature의 mutual information을 향상시켜 전체 품질을 높임
How? 동일한 영역 Feature들간의 Contrastive Learning