Why Warmup the Learning Rate? Underlying Mechanisms and Improvements

Link
Abstract

It is common in deep learning to warm up the learning rate η, often by a linear schedule between ηinit=0 and a predetermined target ηtrgt. In this paper, we show through systematic experiments using SGD and Adam that the overwhelming benefit of warmup arises from allowing the network to tolerate larger ηtrgt by forcing the network to more well-conditioned areas of the loss landscape. The ability to handle larger ηtrgt makes hyperparameter tuning more robust while improving the final performance. We uncover different regimes of operation during the warmup period, depending on whether training starts off in a progressive sharpening or sharpness reduction phase, which in turn depends on the initialization and parameterization. Using these insights, we show how ηinit can be properly chosen by utilizing the loss catapult mechanism, which saves on the number of warmup steps, in some cases completely eliminating the need for warmup. We also suggest an initialization for the variance in Adam which provides benefits similar to warmup.

Synth

Problem:: Standard Warmup 방식(ηinit=0)은 비효율적일 수 있음 / Adam Optimizer는 높은 초기 Pre-Conditioned Sharpness(λP1H)로 인해 불안정하며, Warmup 없이는 Training Failure 또는 성능 저하 발생 가능성이 큼 / Warmup Duration(Twrm)은 추가적인 Hyperparameter임

Solution:: Warmup 메커니즘을 (Pre-Conditioned) Sharpness와 Catapult 관점에서 상세히 분석 / 초기 Instability Threshold(ηc)를 추정하여 ηinit=ηc로 설정하는 개선된 초기 학습률 선택 방식 제안 / Adam의 Second Moment를 v0=g02으로 초기화하는 GI-Adam 제안 / ηc 추정을 반복하며 Catapult를 유도하는 Parameter-Free 방식인 Persistent Catapult Warmup 개념 제안

Novelty:: Warmup의 주된 이점이 (Pre-Conditioned) Sharpness 감소를 통해 더 높은 ηtrgt를 가능하게 하는 것임을 규명 / Adam 초기 불안정성의 핵심 원인이 (Sharpness λH와 무관하게) 높은 초기 Pre-Conditioned Sharpness(λP1H)임을 지적

Note:: 학습에서 모호했던 Warmup의 역할을 어느정도 이해하는데 도움이 된 논문

Summary

Motivation

1행: Large Initialization, 2행: Small Initialization, 2열의 점선은 η>ηc의 경계 선 → 일반적으로 λH의 역수

Figure3.excalidraw.png|575

ηtrgt-Twrm 평면에서 Test 성능을 보여줌

Method

Method 검증