MS-DETR: Efficient DETR Training with Mixed Supervision

Link
Abstract

DETR accomplishes end-to-end object detection through iteratively generating multiple object candidates based on image features and promoting one candidate for each ground-truth object. The traditional training procedure using one-to-one supervision in the original DETR lacks direct supervision for the object detection candidates. We aim at improving the DETR training efficiency by explicitly supervising the candidate generation procedure through mixing one-to-one supervision and one-to-many supervision. Our approach, namely MS-DETR, is simple, and places one-to-many supervision to the object queries of the primary decoder that is used for inference. In comparison to existing DETR variants with one-to-many supervision, such as Group DETR and Hybrid DETR, our approach does not need additional decoder branches or object queries; the object queries of the primary decoder in our approach directly benefit from one-to-many supervision and thus are superior in object candidate prediction. Experimental results show that our approach outperforms related DETR variants, such as DN-DETR, Hybrid DETR, and Group DETR, and the combination with related DETR variants further improves the performance. Code is available at: https://github.com/Atten4Vis/MS-DETR.

Synth

Problem:: DETR의 Candidate 생성 단계에 명시적 Supervision 부족으로 품질 저하 발생

Solution:: 기존의 One-to-One Supervision에 추가적으로 One-to-Many Supervision을 결합한 Mixed Supervision 제안

Novelty:: 기존 연구와 달리 추가적인 Decoder Branch나 Query 없이 Candidate의 품질 향상

Note::

Summary

Motivation

1행: GT, 2행: DETR, 3행: 제안 방식 → 2행을 보면 앞서 언급한 문제로 후보 생성이 제대로 이루어지지 않음

Method

file-20250313220928201.png

(a) DETR, (b) MS-DETR, (c) Group DETR/DN-DETR, (d) Hybrid DETR
제안 방식은 Weight Shared Decoder와 추가 Query들을 요구하지 않음 → 실용적

Method 검증

흥미로운 점

file-20250313221423780.png

(c) > (d) > (b) > (a)
(a), (b)의 경우 box11box1m에 통합됨 → 이유는 밝히지 않음

file-20250313223354787.png|750