Scaling and evaluating sparse autoencoders

Link
Abstract

Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and sparsity objectives and the presence of dead latents. We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. Additionally, we find modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release training code and autoencoders for open-source models, as well as a visualizer.

Synth

Problem:: 기존 희소 오토인코더(SAE)의 스케일링 어려움 / 재구성(Reconstruction)과 희소성(Sparsity) 간의 균형 문제 및 Dead Latent 문제 발생

Solution:: L1​ 페널티 대신 TopK 활성화 함수를 사용하여 희소성을 직접 제어 / Encoder-Decoder 가중치 초기화 및 Auxiliary Loss를 도입하여 Dead Latent 방지

Novelty:: TopK 활성화 함수와 Dead Latent 방지 기법을 결합하여 대규모 SAE를 안정적으로 훈련하는 방법론 제시 / 재구성 손실 외 Downstream Loss, Probe Loss 등 새로운 질적 평가 지표 제안 및 검증

Note:: 오토인코더의 크기가 클수록 학습된 Feature의 질이 전반적으로 향상됨 / 학습된 Feature는 기존 모델의 채널보다 훨씬 희소하고(sparse) 해석 가능한 효과를 가짐 (Ablation 효과 10-14% vs 60%) / TopK 방식은 L1​ 페널티의 부작용인 Activation Shrinkage를 방지함

Summary

Motivation

Method

Method 검증

실험 1: Scaling Laws

file-20250620051713177.png|775

L(C)와 L(N, K)

L(N)과 Subject Model Size

실험 2: Downstream Loss

file-20250620052009566.png|850

왼쪽은 Latent 수를 고정했을 때, 오른쪽은 Sparsity Level을 고정 했을 때의 성능

실험 3: Probe Loss (Feature Recovery)

file-20250620052317020.png|500

실험 4: Explainability (N2G)

실험 5: Sparsity of Ablation Effects

file-20250620052442458.png|500

추가 분석: TopK Activation Function

Activation Shrinkage

file-20250620053819544.png

다른 활성화 함수와의 비교

file-20250620053927475.png

왼쪽은 Latent 수를 고정했을 때, 오른쪽은 Sparsity Level을 고정 했을 때의 성능

Progressive Recovery

file-20250620054021328.png|625