Representation Surgery for Multi-Task Model Merging

#multi-task-learning #model-merging #representation-bias #unsupervised-optimization #representation-surgery

Link

https://proceedings.mlr.press/v235/yang24t.html

Abstract

Multi-task learning (MTL) compresses the information from multiple tasks into a unified backbone to improve computational efficiency and generalization. Recent work directly merges multiple independently trained models to perform MTL instead of collecting their raw data for joint training, greatly expanding the application scenarios of MTL. However, by visualizing the representation distribution of existing model merging schemes, we find that the merged model often suffers from the dilemma of representation bias. That is, there is a significant discrepancy in the representation distribution between the merged and individual models, resulting in poor performance of merged MTL. In this paper, we propose a representation surgery solution called “Surgery" to reduce representation bias in the merged model. Specifically, Surgery is a lightweight task-specific plugin that takes the representation of the merged model as input and attempts to output the biases contained in the representation from the merged model. We then designed an unsupervised optimization objective that updates the Surgery plugin by minimizing the distance between the merged model’s representation and the individual model’s representation. Extensive experiments demonstrate significant MTL performance improvements when our Surgery plugin is applied to state-of-the-art (SOTA) model merging schemes.

Synth

Problem:: 기존 Model Merging 방식들은 병합된 모델과 개별 모델 간 Representation 분포에 차이가 발생하는 Representation Bias 문제를 겪음

Solution:: 병합된 모델의 Representation을 입력받아 Bias를 제거하는 Task-Specific 경량 모듈 추가 / 병합 모델 Representation과 개별 모델 Representation 간 거리 최소화를 통한 Target Task의 데이터 없이 학습

Novelty:: Model Merging의 주요 문제점으로 Representation Bias를 최초로 식별 및 분석 / 기존 Weight-Space Merging 방식들과 달리 Post-Merging 단계에서 Representation-Space 문제를 해결하는 Orthogonal 접근법 제시 / Labeled Data 없이 Unlabeled Test Data와 개별 모델만으로 Bias 제거 모듈을 학습시키는 Unsupervised 방식 설계

Note:: 본인들의 방식을 다양한 분야 (CV, NLP), 다양한 데이터 셋, 다양한 모델, 다양한 방법론에 적용해서 모두 효과적임을 보임

Summary

Motivation

기존 Model Merging 방식들은 여러 독립적으로 훈련된 모델을 병합하여 단일 모델로 MTL을 수행하려 하지만, 성능 저하 문제가 발생함
Representation Bias: 기존 연구들의 Representation 분포를 시각화한 결과, 병합된 모델과 개별 모델 간의 Representation 분포에 상당한 차이가 존재함을 발견
- 이러한 Representation Bias는 Task, Architecture, Merging Method 전반에 걸쳐 존재함
- Weight Averaging에서 Task Arithmetic, Ties-Merging, AdaMerging으로 갈수록 Representation Bias가 감소하며, 이는 성능 향상과 일치함 → Representation Bias를 줄이는 것이 성능 향상하는 방법
따라서 Representation Bias 문제가 Model Merging 기반 MTL 성능 저하의 주요 원인이며, 이를 해결하는 것이 성능 향상의 핵심이라고 판단

Method

목표: 병합된 모델( $f_{θ_{m t l}^{m}}$ )의 Representation( $Z_{t}^{m t l}$ )과 개별 모델( $f_{θ_{t}}$ )의 Representation( $Z_{t}^{i n d}$ ) 간의 Representation Bias를 최소화하는 것
Representation Surgery Module ( $Φ_{t}$ ):
- 병합된 모델의 Representation( $Z_{t}^{m t l}$ )을 입력으로 받아 Representation Bias( $Φ_{t} (Z_{t}^{m t l})$ )를 필터링하도록 설계된 Task-Private 경량 모듈
- Adapter와 유사한 구조로 구현됨: $Φ_{t} (Z_{t}^{m t l}) = W_{u p} \cdot R e L U (W_{d o w n} \cdot Z_{t}^{{m t t}^{⊤}})$
  - $W_{u p} \in R^{k \times r}$ , $W_{d o w n} \in R^{r \times k}$ 는 학습 가능한 행렬, $r$ 은 Rank (Hyperparameter)
Unsupervised Optimization Objective:
- Surgery 후의 Representation( ${\hat{Z}}_{t}^{m t l} = Z_{t}^{m t l} - Φ_{t} (Z_{t}^{m t t})$ )과 개별 모델의 Representation( $Z_{t}^{i n d}$ ) 간의 $L_{1}$ 거리를 최소화하여 Surgery Module의 파라미터( $θ_{Φ_{t}} = {W_{u p}, W_{d o w n}}$ )를 학습함
- 최적화 목표: $a r g m i n_{{θ_{Φ_{1}}, . . ., θ_{Φ_{T}}}} \sum_{t = 1}^{T} \frac{1}{| D_{t e}^{t} |} | | {\hat{Z}}_{t}^{m t t} - Z_{t}^{i n d} | |_{1}$
- Labeled Training Data 없이 Unlabeled Test Data와 개별 모델을 Self-Supervised Signal로 활용하여 학습함
특징:
- 기존 Merging 방식과 Orthogonal: 기존 방식들이 Weight Space에서 "공통점 찾기(Seek Common)"에 집중하는 반면, Surgery는 Representation Space에서 "차이점 보존(Reserve Differences)"하며 Post-Merging 단계에서 작동함
- 경량성: 추가되는 파라미터 수가 매우 적음 (ViT-B/32 기준 약 0.014%)

Method 검증

Representation Bias 감소 검증: 다양한 Merging 방식과 Architecture에서 Surgery 적용 전후의 Representation 분포 시각화 및 $L_{1}$ 거리 비교
- Surgery 미적용 시 병합 모델(Red)과 개별 모델(Blue)의 Representation 분포가 상이하나, Surgery 적용 후 두 분포가 훨씬 가까워짐 → Surgery가 Representation 공간에서 병합 모델을 개별 모델에 가깝게 교정함
- Surgery 적용 후(Red Bar)와 미적용 시(Blue Bar)의 $L_{1}$ 거리 비교 → Surgery가 병합 모델과 개별 모델 간 Representation 분포 차이를 효과적으로 줄임
- 통찰: Representation Surgery는 시각적 및 정량적으로 Representation Bias 문제를 유의미하게 완화함
MTL 성능 향상 검증 (ViT-B/32): 8개 CV Task에서 ViT-B/32 모델을 다양한 Merging 방식으로 병합하고 Surgery 적용 전후 성능 비교.
- 비교군: Pretrained, Individual, Traditional MTL, Weight Averaging, Fisher Merging, RegMean, Task Arithmetic, Ties-Merging, Concrete TA, Concrete AM, AdaMerging
- 정량적 성능 (Avg. Accuracy):
  - Weight Averaging: 65.8% → 80.0% (w/ Surgery, +14.2%p)
  - Task Arithmetic: 69.1% → 80.9% (w/ Surgery, +11.8%p)
  - Ties-Merging: 72.9% → 83.1% (w/ Surgery, +10.2%p)
  - AdaMerging (SOTA): 80.1% → 86.1% (w/ Surgery, +6.0%p) / 87.5% (rank=64, +7.4%p)
- 통찰: Surgery는 기존 Merging 방식들의 성능을 일관되게 개선하며, 특히 SOTA 방법과 결합 시 Traditional MTL에 근접하는 높은 성능을 달성함
MTL 성능 향상 검증 (ViT-L/14 & ViT-B/16): 더 크거나(ViT-L/14) 중간 크기(ViT-B/16) Architecture에서 Surgery 적용 전후 성능 비교
- 정량적 성능 (ViT-L/14, AdaMerging Avg. Accuracy): 90.8% → 92.3% (w/ Surgery, +1.5%p)
- 정량적 성능 (ViT-B/16, AdaMerging Avg. Accuracy): 84.9% → 88.8% (w/ Surgery, +3.9%p)
- 통찰: Surgery의 효과는 특정 모델 크기에 국한되지 않고 일반적임 → 다양한 모델 크기에서도 Surgery의 성능 향상 효과가 일관되게 나타남
Surgery Module Rank 영향 분석: Surgery Module의 Rank( $r$ ) 변화에 따른 Task Arithmetic 및 AdaMerging의 Avg. Accuracy 변화 (ViT-B/32)
- 정량적 성능 (AdaMerging Avg. Accuracy): Rank 4(83.5%)에서 Rank 64(87.5%)로 증가함에 따라 성능이 꾸준히 향상됨 → Surgery Module의 Capacity가 클수록 성능 개선 폭이 커짐
- 통찰: Rank는 MTL 성능과 Trade-off 관계에 있으며, 더 높은 성능을 위해 Rank를 조절할 수 있음
학습 반복 횟수 영향 분석: Surgery Module 학습 Iteration 증가에 따른 4가지 Merging 방식 (w/ Surgery)의 Avg. Accuracy 변화 (ViT-B/32)
- 정량적 성능: 학습 초기(약 200 iterations)에 성능 향상이 두드러지며 이후 안정화됨 → 적은 학습 비용으로도 Surgery의 상당한 효과를 얻을 수 있음
- 통찰: Surgery는 효율적으로 학습되어 빠르게 성능을 개선할 수 있음

Summary

Motivation

Method

Method 검증

Method 검증