Schedule-Free Optimizer: Does It Work for LLMs?
Experiments with Llama 3.2: schedule-free vs. standard AdamW
In machine learning, the learning rate determines the size of steps the model takes to minimize error during training. A high learning rate can cause instability by overshooting the optimal solution, while a low rate may lead to slow convergence or getting stuck in suboptimal solutions. That’s why a learning rate schedule is typically used to adjust the rate over time during training.
Early in training, a higher learning rate helps the model learn quickly and capture general patterns. As training progresses, the rate is reduced to fine-tune and converge to an optimal solution. Common schedules include step decay, exponential decay, and adaptive methods, each aiding in efficient and accurate training. Key hyperparameters include schedule type (e.g., linear or cosine), warmup steps, and weight decay, which often require careful tuning.
Although these schedules generally yield satisfactory results, they can still be suboptimal. Schedule-free alternatives exist for popular optimizers but remain underexplored for fine-tuning large language models (LLMs).
In this article, we will first explore how schedule-free optimizers work and why, in theory, they can outperform their schedule-based counterparts. Next, we will experiment with a schedule-free AdamW in a common training scenario: fine-tuning Llama 3.2 with LoRA for chat applications. We will show that while schedule-free AdamW can indeed surpass the performance of traditional AdamW for LLM fine-tuning, it also comes with certain drawbacks that can make it difficult to use in some configurations.
I have implemented schedule-free QLoRA and LoRA fine-tuning for Llama 3.2 in the following notebook:
Schedule-Free Optimizer: How Does It Work?
In AdamW, learning rate schedules like cosine decay are commonly used to improve convergence. Here is an example from the first chapter of my book showing the evolution of the learning rate with different schedules: