The Weekly Kaitchup #28

SPIN - MoE with LoRAs - AQLM

Benjamin Marie

Feb 16, 2024

Hi Everyone,

In this edition of The Weekly Kaitchup:

SPIN: The Framework
Make a Cheap MoE with LoRA Experts
AQLM: 2-bit and 3-bit LLMs Getting More Accurate

The Kaitchup has now 2,036 subscribers. Thanks a lot for your support!

If you are a free subscriber, consider upgrading to paid to access all the notebooks and articles. There is a 7-day trial that you can cancel anytime.

SPIN: The Framework

In The Weekly Kaitchup #22, I presented SPIN: A framework for fine-tuning an LLM using self-generated data. Since then, there have been similar ideas on arXiv, such as the self-rewarding language model by Meta.

However, until now we only had papers and no concrete examples of self-trained LLMs or frameworks. This is not the case anymore as the authors of SPIN released their framework with several checkpoints of SPIN applied to zephyr.

GitHub: uclaml/SPIN (Apache 2.0 license)

The improvements after several SPIN iterations are significant:

I’ll test the framework to find whether we can use it with QLoRA. Note: The examples in the GitHub repository seem to fully fine-tune Zephyr which wouldn’t be possible on consumer hardware. Nonetheless, it seems that they based their code on Hugging Face’s TRL so using QLoRA should be straightforward.

A Cheap Zephyr 7B Beta: Distilled DPO on Consumer Hardware

Benjamin Marie

November 9, 2023

Read full story

Make a Cheap MoE with LoRA Experts

This paper introduces Post-Hoc Adaptive Tokenwise Gating Over an Ocean of Specialized Experts (PHATGOOSE), a new approach to enable zero-shot generalization by exploiting a set of specialized PEFT modules such as LoRA adapters:

Learning to Route Among Specialized Experts for Zero-Shot Generalization

The method freezes the entire model, including PEFT modules, and trains a gate for each module akin to the route network used in mixture of expert (MoE) models. You can see this method as a cheap way to create an MoE model in which each expert is a LoRA adapter.

This gate training is computationally light and improves the model's ability to handle tasks it wasn't explicitly trained on by using a top-k routing strategy for token distribution during inference.

The effectiveness of PHATGOOSE was tested on T5-family models, showing superior performance in zero-shot generalization on standard benchmarks compared to previous methods that merged experts or relied on a single PEFT module. It also sometimes outperformed explicit multitask training.

The code is available:

GitHub: r-three/phatgoose (it seems to work only with the T5 architecture)

AQLM: 2-bit and 3-bit LLMs Getting More Accurate

New quantization algorithms are published every week. The improvements are regular, and now 4-bit quantization is almost lossless for the largest LLMs.

From 16-bit to 2-bit: Finding the Best Trade-off Between Memory-Efficiency and Accuracy

Benjamin Marie

February 1, 2024

Read full story

However, using lower precision such as 2-bit and 3-bit tends to significantly degrade the performance of the model. AQLM is a new method that achieves significant progress for 2-bit and 3-bit quantization.

Extreme Compression of Large Language Models via Additive Quantization

[It] generalizes the classic Additive Quantization (AQ) approach for information retrieval to advance the state-of-the-art in LLM compression, via two innovations: 1) learned additive quantization of weight matrices in input-adaptive fashion, and 2) joint optimization of codebook parameters across entire layer blocks. Broadly, AQLM is the first scheme that is Pareto optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter, and significantly improves upon all known schemes in the extreme compression (2bit) regime.

It is a very different approach from previous work such as GPTQ, AWQ, or SqueezeLLM.

SqueezeLLM: Better 3-bit and 4-bit Quantization for Large Language Models

Benjamin Marie

February 12, 2024

Read full story

This is what they mean by “Pareto optimal”:

Llama 2 70B quantized to 2-bit with AQLM is better than the original Llama 2 13B. It is also smaller: Llama 2 70B 2-bit would roughly occupy 20 GB in memory while Llama 2 13B occupies 26 GB.

They have released their code here: