The Weekly Kaitchup #30

LoRA-the-Explorer - 1.5 bit LLMs - gpt-fast

Benjamin Marie

Mar 01, 2024

Hi Everyone,

In this edition of The Weekly Kaitchup:

LoRA-the-Explorer: Pre-training LLMs from Scratch with LoRA
Accruate 1.5 bit LLMs
gpt-fast with Hugging Face Transformers

The Kaitchup has now 2,245 subscribers. Thanks a lot for your support!

If you are a free subscriber, consider upgrading to paid to access all the notebooks and articles. There is a 7-day trial that you can cancel anytime.

Evergreen Kaitchup

There are now nearly 50 notebooks listed on the AI Notebooks page of The Kaitchup. I think it’s a good time to start refreshing them regularly as most of them could be relevant for many more months if not years.

In this new section of The Weekly Kaitchup, I’ll mention which notebook I have checked and updated, with a brief description of what I have done.

Since Microsoft has recently applied a lot of modifications to Phi 2’s code, I have updated the following notebook to optimize it and make sure it is still working:

#35 Phi-2: Fine-tuning, quantization, and inference

To sum up, I removed Pytorch’s autocast for inference, updated the support of FlashAttention-2 (works only with Ampere GPUs and more recent GPUs), changed the target modules for fine-tuning with LoRA, gradient checkpointing no longer requires a special revision of the model, and small other modifications.

The article has also been updated to reflect these changes:

Phi-2: A Small Model Easy to Fine-tune on Your GPU

Benjamin Marie

January 1, 2024

Read full story

LoRA-the-Explorer: Pre-training LLMs from Scratch with LoRA

This is not the first time that a paper has proposed LoRA to pre-train LLMs from scratch. Last year, I presented ReLoRA:

ReLoRa: Pre-train a Large Language Model on Your GPU

Benjamin Marie

July 19, 2023

Read full story

LoRA-the-Explorer (LTE) uses a very different approach.

Training Neural Networks from Scratch with Parallel Low-Rank Adapters

LTE is a method for collaborative model training. Initially, every device (i.e., GPU) is equipped with a unique set of LoRA parameters. As training progresses, each device works on a distinct segment of the dataset, utilizing a local optimizer for adjustments.

Periodically, after every T iterations, there's an exchange of LoRA parameters among the devices or with a central parameter server. These parameters are then averaged to update the main parameters of the model. The authors make here a parallel with the commit/merge process applied to git repositories where the repository would be the model and the commit the parameters updates. Following this, the refreshed main parameters are disseminated back to the devices, and the cycle iterates.

When using a high rank (r) and multiple heads, the method seems to perform on par with standard pre-training:

Accruate 1.5 bit LLMs

Last week, I presented AQLM which can quantize LLMs to 2-bit with a high precision.

Run a 7.7x Smaller Mixtral-8x7B on Your GPU with AQLM 2-bit Quantization

Benjamin Marie

February 22, 2024

Read full story

This week Microsoft presented an impressive work training LLMs with 1.58 bit parameters.

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

The paper presents BitNet b1.58, a 1-bit version of LLMs, characterized by its ternary parameters {-1, 0, 1}. The model is as good as its FP16 counterparts in both perplexity and task-specific performance metrics, despite having the same model dimensions and training data volume.

BitNet b1.58 also offers substantial improvements in latency, memory usage, throughput, and energy efficiency.

What’s the catch?

This paper only shows examples of models with up to 3.9B parameters. Strangely, it seems that they have also trained models with up to 70B but only report on latency, memory usage, throughput, and energy efficiency for these larger models. What about their performance in terms of perplexity and accuracy in downstream tasks?

They didn’t release any code or models with the paper so we will have to wait for someone to independently reproduce their results.