The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Schedule-Free Optimizer: Does It Work for LLMs?

Schedule-Free Optimizer: Does It Work for LLMs?

Experiments with Llama 3.2: schedule-free vs. standard AdamW

Benjamin Marie's avatar
Benjamin Marie
Dec 16, 2024
∙ Paid
7

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Schedule-Free Optimizer: Does It Work for LLMs?
Share
Generated with ChatGPT

In machine learning, the learning rate determines the size of steps the model takes to minimize error during training. A high learning rate can cause instability by overshooting the optimal solution, while a low rate may lead to slow convergence or getting stuck in suboptimal solutions. That’s why a learning rate schedule is typically used to adjust the rate over time during training.

Early in training, a higher learning rate helps the model learn quickly and capture general patterns. As training progresses, the rate is reduced to fine-tune and converge to an optimal solution. Common schedules include step decay, exponential decay, and adaptive methods, each aiding in efficient and accurate training. Key hyperparameters include schedule type (e.g., linear or cosine), warmup steps, and weight decay, which often require careful tuning.

Although these schedules generally yield satisfactory results, they can still be suboptimal. Schedule-free alternatives exist for popular optimizers but remain underexplored for fine-tuning large language models (LLMs).

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we will first explore how schedule-free optimizers work and why, in theory, they can outperform their schedule-based counterparts. Next, we will experiment with a schedule-free AdamW in a common training scenario: fine-tuning Llama 3.2 with LoRA for chat applications. We will show that while schedule-free AdamW can indeed surpass the performance of traditional AdamW for LLM fine-tuning, it also comes with certain drawbacks that can make it difficult to use in some configurations.

I have implemented schedule-free QLoRA and LoRA fine-tuning for Llama 3.2 in the following notebook:

Get the notebook (#130)

Schedule-Free Optimizer: How Does It Work?

In AdamW, learning rate schedules like cosine decay are commonly used to improve convergence. Here is an example from the first chapter of my book showing the evolution of the learning rate with different schedules:

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share