Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors

Link
Abstract

Multi-scale features have been proven highly effective for object detection but often come with huge and even prohibitive extra computation costs, especially for the recent Transformer-based detectors. In this paper, we propose Iterative Multi-scale Feature Aggregation (IMFA) – a generic paradigm that enables efficient use of multi-scale features in Transformer-based object detectors. The core idea is to exploit sparse multi-scale features from just a few crucial locations, and it is achieved with two novel designs. First, IMFA rearranges the Transformer encoderdecoder pipeline so that the encoded features can be iteratively updated based on the detection predictions. Second, IMFA sparsely samples scale-adaptive features for refined detection from just a few keypoint locations under the guidance of prior detection predictions. As a result, the sampled multi-scale features are sparse yet still highly beneficial for object detection. Extensive experiments show that the proposed IMFA boosts the performance of multiple Transformer-based object detectors significantly yet with only slight computational overhead.

Synth

Problem:: DETR의 Multi-Scale Feature 활용은 많은 연산량을 요구함

Solution:: Object Query에 Feature를 Aggregation하여 사용/Multi-Scale Feature를 전부 사용하지 않으므로 매번 업데이트

Novelty:: 기존 Stacked Encoder 구조를 벗어나 Stacked Encoder-Decoder 구조 제안/DETR에서 FPN을 대체할 수 있는 Multi Scale Feature 사용 방식에 대한 최초의 연구

Note:: 코드와 내용 모두 이해하기 쉬웠음. 저자들이 비슷한 연구로 QueryDET을 언급

Summary

Motivation

Method

Iterative Update of Encoded Features

file-20250317112334497.png

왼쪽: 기존 DETR 모델들, 오른쪽: 제안 방식

Sparse Multi-Scale Feature Sampling and Aggregation

file-20250317112444723.png

Method 검증

선택된 Location과 Level 시각화

file-20250317113430714.png