Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity

Link
Abstract

DETR is the first end-to-end object detector using a transformer encoder-decoder architecture and demonstrates competitive performance but low computational efficiency on high resolution feature maps. The subsequent work, Deformable DETR, enhances the efficiency of DETR by replacing dense attention with deformable attention, which achieves 10x faster convergence and improved performance. Deformable DETR uses the multiscale feature to ameliorate performance, however, the number of encoder tokens increases by 20x compared to DETR, and the computation cost of the encoder attention remains a bottleneck. In our preliminary experiment, we observe that the detection performance hardly deteriorates even if only a part of the encoder token is updated. Inspired by this observation, we propose Sparse DETR that selectively updates only the tokens expected to be referenced by the decoder, thus help the model effectively detect objects. In addition, we show that applying an auxiliary detection loss on the selected tokens in the encoder improves the performance while minimizing computational overhead. We validate that Sparse DETR achieves better performance than Deformable DETR even with only 10% encoder tokens on the COCO dataset. Albeit only the encoder tokens are sparsified, the total computation cost decreases by 38% and the frames per second (FPS) increases by 42% compared to Deformable DETR. Code is available at https://github.com/kakaobrain/sparse-detr

Synth

Problem:: Deformable DETR에서 DETR보다 Encoder Token 수의 급격한 증가로 인해 Encoder가 연산의 병목이 되는 문제

Solution:: 중요한 토큰만 업데이트하는 Sparsification/Decoder를 이용한 토큰 선택 기준점 제안

Novelty:: Deformable DETR의 Key Sampling과 함께 사용 가능한 Query Sparsification 제안/Encoder에 Layerwise Aux Loss 사용

Note:: Deformable DETR에 Encoder Aux Loss, Topk Decoder Query의 개념을 처음 도입한 것으로 보임/Objectness Score가 제안한 DAM 보다 효과적이지 않음을 보인 것과 Encoder에서 추출한 Query와 Decoder가 참조하는 Query간의 Corr를 측정한게 인상깊음

Summary

Motivation

Method

file-20250317012737609.png

중요한 것: Scoring Network, Encoder Aux. Head, Encoder 내부 중요 쿼리들만 Self-Attn의 쿼리로 이용

Key Idea

file-20250317013150932.png|725

Method 검증