The Weekly Kaitchup #13
linear scheduler - 8-bit training - Undo LLMs' safety training - Google Colab's storage for secret keys
Hi Everyone,
In this edition of The Weekly Kaitchup:
Use a linear learning rate scheduler
8-bit training with Microsoft’s MS-AMP
Undoing the safety training of Llama 2
Google Colab’s new storage of secret keys
Next week, I’ll write again about the Mistral and Zephyr models. Now that Zephyr Beta is out, we will see what makes it so good and how we can fine-tune a similar model.
The Kaitchup has now 953 subscribers. This was the fastest-growing week of The Kaitchup. The 1,000 subscribers milestone is not far! Thanks a lot for your support!
If you are a free subscriber, consider upgrading to paid to access all the notebooks and articles. There is a 7-day trial that you can cancel anytime.
If you are a monthly paid subscriber, switch to a yearly subscription to get a 17% discount (2 months free)!
Use a “Linear” Learning Rate Scheduler!
In my recent notebooks and tutorials, I have set the learning rate scheduler to “linear” mainly because it seems to perform better than the other popular schedulers (e.g., “cosine” and “constant”).
I also noticed that more and more papers and tutorials tend to use linear schedulers.
And now, thanks to a recent work co-authored by Google, Meta, Samsung, and Boston University, we have a paper to refer to justify the use of a linear scheduler:
When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement
They show that a linear scheduler outperforms, or performs similarly to, the other types of schedulers.
What convinced me the most in these tables is that they tried very different types of models from various areas and sub-areas: computer vision, NLP, natural language generation, and machine translation, … A linear scheduler works the best for all of them.
Conclusion: We have one less hyperparameter to search for fine-tuning: the learning rate scheduler. Set it to “linear”.
8-bit Training with Microsoft’s MS-AMP
Microsoft has presented its automatic mixed precision package for deep learning, MS-AMP.
With it, we can train LLMs using (almost) exclusively the FP8 data type (8-bit precision). MS-AMP has three different levels of optimization (O1, O2, and O3) that gradually introduce FP8 in the different training components: weights, gradients, and optimizer states.
Note that with O3, only the first-order states of Adam/AdamW are FP8. Second-order states must remain at a higher precision, FP16, to avoid any loss in accuracy. It results in a 62.5% reduction in the GPU’s memory consumption. I wonder how it compares to the paged 8-bit Adam introduced with QLoRA. I assume MS-AMP is faster (because it seems there isn’t any CPU paging) but consumes more VRAM.
You can find the description of MS-AMP and an evaluation using an FP8 GPT-3-like model in this paper:
FP8-LM: Training FP8 Large Language Models
MS-AMP can be downloaded from GitHub but note that the documentation is rather shallow for now.
Undoing Safety Training of Llama 2 Chat Models
If you used Llama 2 Chat models, you might have noticed that Meta fine-tuned them to deny answering potentially harmful questions/instructions. This safety training is supposed to prevent harmful uses of the model.
In the case of Llama 2, I personally think Meta didn’t do a great job for this safety training. Very often, the Llama 2 chat models deny answering prompts looking completely safe to me.
For instance, when I tested Llama 2 chat models for translation, it denied translating some sentences that contained names of political persons even though the content of the sentence itself was not political.
Recent work found that undoing this safety training is quite cheap and easy while preserving the performance of the model.
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Basically, we just need to fine-tune LoRA adapters on top of Llama 2 Chat to exploit its full potential.
Google Colab Can Now Store Your HF Access Token
I use Google Colab and Hugging Face libraries. Every time I want to download a gated model (e.g., Llama 2) from the HF’s hub, I have to enter my access token. Usually, I run a cell with this code:
from huggingface_hub import notebook_login
notebook_login()
Then, I enter my token that I paste from another opened tab. There are alternatives, but none have been completely satisfying.
From this week, Google Colab can now memorize variable names and values.
It seems easy to set up and keeps everything private. I’ll slowly transition toward this in my next notebooks.
What To Read On Substack
Nothing for this week. Don’t hesitate to recommend articles in the comments.
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers:
Have a nice weekend!