Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

Link
Abstract

This paper presents a general scheme for enhancing the convergence and performance of DETR (DEtection TRansformer). We investigate the slow convergence problem in transformers from a new perspective, suggesting that it arises from the self-attention that introduces no structural bias over inputs. To address this issue, we explore incorporating position relation prior as attention bias to augment object detection, following the veri cation of its statistical signi cance using a proposed quantitative macroscopic correlation (MC) metric. Our approach, termed Relation-DETR, introduces an encoder to construct position relation embeddings for progressive attention re nement, which further extends the traditional streaming pipeline of DETR into a contrastive relation pipeline to address the con icts between non-duplicate predictions and positive supervision. Extensive experiments on both generic and task-speci c datasets demonstrate the e ectiveness of our approach. Under the same con gurations, Relation-DETR achieves a signi cant improvement (+2.0% AP compared to DINO), state-of-the-art performance (51.7% AP for 1× and 52.1% AP for 2× settings), and a remarkably faster convergence speed (over 40% AP with only 2 training epochs) than existing DETR detectors on COCO val2017. Moreover, the proposed relation encoder serves as a universal plug-in-and-play component, bringing clear improvements for theoretically any DETR-like methods. Furthermore, we introduce a class-agnostic detection dataset, SA-Det-100k. The experimental results on the dataset illustrate that the proposed explicit position relation achieves a clear improvement of 1.3% AP, highlighting its potential towards universal object detection. The code and dataset are available at https://github.com/xiuqhou/Relation-DETR.

Synth

Problem:: Self-Attetnion 과정에 명시적 구조적 편향이 없어 수렴이 느림/One-to-One Matching의 장점인 중복 제거와 단점인 Positive Supervision의 부족이 서로 상충됨

Solution:: Decoder의 Self-Attention에 Relation을 명시적으로 주입/Matching 되지 않은 Query에 대해 One-to-Many Matching 사용

Novelty:: Relation이 실제로 Detection에 도움이 되는지를 제안한 MC 지표를 통해 통계적으로 보임

Note:: 문제와 해결책이 깔끔하게 쓰였고, 제안한 MC 지표로 왜 해당 방법을 제안했는지에 대한 근거를 보충함

Summary

Motivation

Statistical Significance of Object Position Relation

file-20250317194907023.png|525

Method

Position Relation Encoder

file-20250317195313289.png

좌: 기존 DETR, 우: 제안 방법 , Decoder의 Object Query들의 Self-Attention에 Relation을 반영

AttnSelf(Ql)=Softmax(Rel(bl1,bl)+Que(Ql)Key(Ql)dmodel)Val(Ql)

Contrast Relation Pipeline

file-20250317200136435.png|500

Method 검증