DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model

Link
Abstract

This paper is motivated by an interesting phenomenon: the performance of object detection lags behind that of instance segmentation (i.e., performance imbalance) when investigating the intermediate results from the beginning transformer decoder layer of MaskDINO (i.e., the SOTA model for joint detection and segmentation). This phenomenon inspires us to think about a question: will the performance imbalance at the beginning layer of transformer decoder constrain the upper bound of the final performance? With this question in mind, we further conduct qualitative and quantitative pre-experiments, which validate the negative impact of detection-segmentation imbalance issue on the model performance. To address this issue, this paper proposes DI-MaskDINO model, the core idea of which is to improve the final performance by alleviating the detection-segmentation imbalance. DI-MaskDINO is implemented by configuring our proposed De-Imbalance (DI) module and Balance-Aware Tokens Optimization (BATO) module to MaskDINO. DI is responsible for generating balance-aware query, and BATO uses the balance-aware query to guide the optimization of the initial feature tokens. The balance-aware query and optimized feature tokens are respectively taken as the Query and Key&Value of transformer decoder to perform joint object detection and instance segmentation. DI-MaskDINO outperforms existing joint object detection and instance segmentation models on COCO and BDD100K benchmarks, achieving +1.2 APbox and +0.9 APmask improvements compared to SOTA joint detection and segmentation model MaskDINO. In addition, DI-MaskDINO also obtains +1.0 APbox improvement compared to SOTA object detection model DINO and +3.0 APmask improvement compared to SOTA segmentation model Mask2Former.

Synth

Problem:: Object Detection과 Instance Segmentation 간의 성능 불균형(Imbalance)

Solution:: Decoder Query에 Detection에 효과적인 정보를 주입/Decoder Key&Value를 Task에 맞도록 변형하여 사용

Novelty:: Detection-Segmentation 간의 성능 불균형(Imbalance) 문제를 해결하는 전용 모듈을 최초로 제안

Note:: 인코더에서 뽑은 feature를 디코더에 넣을때 query와 feature 둘 다 손봄/단순한 Loss weight 조정은 성능 향상/하락에 영향이 적음

Summary

Motivation

(a), (c): MaskDINO, (b), (d): DI-MaskDINO

Method

file-20250313163249640.png

회색 영역: MaskDINO, 연두색 영역: 제안 방식
편의성을 위해, 연두색 영역에는 Content Token과 Position Token을 따로 구분하지 않음 → 실제로 둘 다 존재함

Method 검증

Tolerance Test

file-20250313165913218.png

Loss Weight Constraint: 학습 시 Detection Loss의 가중치를 Segmentation의 1/10으로 설정
Position Token Constraint: Detection에 효과적인 Position Token을 Random Init하여 성능 평가