The Weekly Kaitchup #16

Orca 2 - Exponentially Faster LLMs - LQ-LoRA

Benjamin Marie

Nov 24, 2023

Hi Everyone,

In this edition of The Weekly Kaitchup:

LQ-LoRA: Jointly Quantize and Fine-tune LLMs
Orca 2: An LLM Fine-tuned for Reasoning
Use Only a Few Neurons to Exponentially Accelerate LLMs

The Kaitchup has now 1,119 subscribers. Thanks a lot for your support!

For Black Friday, the yearly subscription to The Kaitchup is 30% off! It’s 42% cheaper than subscribing monthly:

Black Friday Discount

It’s also a good time to upgrade your hardware configuration as most sellers offer Black Friday discounts. You can find my recommended computer parts (GPU, CPU, …) on this page:

Hardware for LLMs

Benjamin Marie, PhD

October 17, 2023

Read full story

LQ-LoRA: Jointly Quantize and Fine-tune LLMs

QLoRA is one of the most popular methods to fine-tune adapters on top of quantized LLMs. While QLoRA is very effective, it has also several drawbacks that we have discussed in previous articles:

Don't Merge Your LoRA Adapter Into a 4-bit LLM

Benjamin Marie, PhD

November 13, 2023

Read full story

There are alternatives to QLoRA. For instance, we have tried QA-LoRA which fine-tuned quantization-aware LoRA adapters. QA-LoRA is a good alternative to QLoRA but its official implementation wasn’t supporting recent LLMs without modifying the code and has since been removed from GitHub by its authors. The project looks dead.

Fine-tune Quantized Llama 2 on Your GPU with QA-LoRA

Benjamin Marie, PhD

October 12, 2023

Read full story

Another alternative to QLoRA and QA-LoRA has just been proposed:

LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

In a nutshell, LQ-LoRA takes the pre-trained LLM and breaks it into 2 parts:

Frozen quantized parameters
Trainable parameters in the form of a LoRA adapter

The method finds the optimal split/decomposition, quantizes, and fine-tunes, jointly. The quantization works similarly to the method implemented in ExLlamaV2: we set a target precision, e.g., 3-bit, and then the algorithm will quantize with mixed-precision the different layers/modules of the models, keeping the important parameters at a higher precision, to achieve the targeted precision.

Run Llama 2 70B on Your GPU with ExLlamaV2

Benjamin Marie, PhD

September 27, 2023

Read full story

LQ-LoRA performs on par with QLoRA according to the authors.

Note that one significant drawback of this approach is that it modifies the pre-trained model, i.e., it saves a new base quantized model along with the adapter.

The implementation seems already mature enough to try it:

https://github.com/HanGuo97/lq-lora

So we will try it! I’ll write an article giving more details on how it works and answering the following questions:

How fast is it for fine-tuning?
Finding the optimal split and quantizing at the same time looks like it may significantly slow down the fine-tuning.
How fast is it for inference?
Since we have mixed precision, which is hardware inefficient, the inference might be slow.
Does it consume more memory, during fine-tuning, than QLoRA?

If everything works well, I’ll publish it next week.

Orca 2: An LLM Fine-tuned for Reasoning

Microsoft released Orca 2 7B and 13B: Two LLMs outperforming much larger ones, such as Llama 2 70B, for reasoning tasks.

Figure by Mitra et al. (2023) — CC BY 4.0

I find the paper very interesting as it describes very well how to build a dataset for reasoning tasks and fine-tune an LLM with it:

Orca 2: Teaching Small Language Models How to Reason

Orca 2 models are based on Llama 2 exclusively fine-tuned on synthetic datasets generated by other models, including GPT-4. Similar to Hugging Face’s Zephyr 7B and Microsoft’s phi-1.5, Orca 2 models are student models. They learn from much larger and better LLMs which explains why they can surpass larger models.

Zephyr 7B Beta: A Good Teacher Is All You Need

Benjamin Marie, PhD

November 6, 2023

Read full story

How to Fine-tune, Quantize, and Run Microsoft phi-1.5

Benjamin Marie, PhD

October 2, 2023

Read full story

Unfortunately, Orca 2 is not open, even though in the abstract of the paper Microsoft claims that the model is open-source:

We open-source Orca 2 to encourage further research on the development, evaluation, and alignment of smaller LMs.

It’s not.

The models can only be used for research purposes, according to the license’s terms. Moreover, Microsoft didn’t release:

the training data
the training source code

In other words, we can’t reproduce or make a similar model. This doesn’t look “open-source” to me.

Use Only a Few Neurons to Exponentially Accelerate LLMs

For inference, it is well-known that LLMs mainly rely on a few of its neurons. However, knowing which neurons will be needed and how to exploit this knowledge to speed up inference is challenging.

Peter Belcak and Roger Wattenhofer (ETH Zurich) propose a solution (with an implementation) in this paper:

Exponentially Faster Language Modeling

They replace the feedforwards of the Transformer with fast feedforwards. These new feedforwards organize their neurons into a balanced binary tree and execute only one branch of the tree conditionally on the input.

Since we don’t need all the neurons, we can perform conditional matrix multiplication (CMM) to involve only the needed neurons. The issue here is that there wasn’t any efficient implementation for CMM. The authors propose one exploiting the CPU through the BLAS library.

They claim that their approach, applied to BERT, provides an exponential acceleration for inference, performs on par with the original BERT, and can be applied to other models.

The implementation is available here:

https://github.com/pbelcak/UltraFastBERT

That’s all for this week.

If you like reading The Kaitchup, consider sharing it with friends and coworkers:

Share The Kaitchup – AI on a Budget

Have a nice weekend!

The Kaitchup – AI on a Budget

The Weekly Kaitchup #16

Orca 2 - Exponentially Faster LLMs - LQ-LoRA

Hardware for LLMs

LQ-LoRA: Jointly Quantize and Fine-tune LLMs

Don't Merge Your LoRA Adapter Into a 4-bit LLM

Fine-tune Quantized Llama 2 on Your GPU with QA-LoRA

Run Llama 2 70B on Your GPU with ExLlamaV2

Orca 2: An LLM Fine-tuned for Reasoning

Zephyr 7B Beta: A Good Teacher Is All You Need

How to Fine-tune, Quantize, and Run Microsoft phi-1.5

Use Only a Few Neurons to Exponentially Accelerate LLMs

Discussion about this post