Steering CLIP's vision transformer with sparse autoencoders

Link
Abstract

While vision models are highly capable, their internal mechanisms remain poorly understood -- a challenge which sparse autoencoders (SAEs) have helped address in language, but which remains underexplored in vision. We address this gap by training SAEs on CLIP's vision transformer and uncover key differences between vision and language processing, including distinct sparsity patterns for SAEs trained across layers and token types. We then provide the first systematic analysis on the steerability of CLIP's vision transformer by introducing metrics to quantify how precisely SAE features can be steered to affect the model's output. We find that 10-15% of neurons and features are steerable, with SAEs providing thousands more steerable features than the base model. Through targeted suppression of SAE features, we then demonstrate improved performance on three vision disentanglement tasks (CelebA, Waterbirds, and typographic attacks), finding optimal disentanglement in middle model layers, and achieving state-of-the-art performance on defense against typographic attacks.

Synth

Problem:: Vision Transformer의 내부 작동 방식에 대한 이해 부족 / Language Model에서 사용된 해석 가능성 기법(SAE)이 Vision 분야에서는 아직 초기 단계

Solution:: CLIP-ViT의 Activation에 Sparse Autoencoders (SAEs)를 훈련하여 해석 가능한 Feature로 분해 / Feature 조작의 영향력을 정량화하는 'Steerability' 지표 도입 / 식별된 Spurious Feature를 억제하여 Disentanglement Task 성능 향상

Novelty:: CLIP-ViT Feature에 대한 최초의 체계적, 정량적 Steerability 분석 / Vision과 Language 모델의 근본적인 Sparsity 패턴 차이 규명 / SAE Feature Steering을 통해 Typographic Attack 방어에서 SOTA 성능 달성

Note:: 후반부 Layer의 Feature 중 약 10-15%만이 Steerable함 / Spurious Feature의 종류에 따라 최적의 Disentanglement Layer가 다름 (배경은 초기 Layer, 머리색 등은 후기 Layer)

Summary

Motivation

CelebA 데이터셋에서 Blondeness Feature 억제를 통한 Gender Classification 개선 방법의 전체적인 개요 제시

Method

Method 검증