Hi Everyone,
In this edition of The Weekly Kaitchup:
LQ-LoRA: Jointly Quantize and Fine-tune LLMs
Orca 2: An LLM Fine-tuned for Reasoning
Use Only a Few Neurons to Exponentially Accelerate LLMs
The Kaitchup has now 1,119 subscribers. Thanks a lot for your support!
For Black Friday, the yearly subscription to The Kaitchup is 30% off! It’s 42% cheaper than subscribing monthly:
It’s also a good time to upgrade your hardware configuration as most sellers offer Black Friday discounts. You can find my recommended computer parts (GPU, CPU, …) on this page:
LQ-LoRA: Jointly Quantize and Fine-tune LLMs
QLoRA is one of the most popular methods to fine-tune adapters on top of quantized LLMs. While QLoRA is very effective, it has also several drawbacks that we have discussed in previous articles:
There are alternatives to QLoRA. For instance, we have tried QA-LoRA which fine-tuned quantization-aware LoRA adapters. QA-LoRA is a good alternative to QLoRA but its official implementation wasn’t supporting recent LLMs without modifying the code and has since been removed from GitHub by its authors. The project looks dead.
Another alternative to QLoRA and QA-LoRA has just been proposed:
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning
In a nutshell, LQ-LoRA takes the pre-trained LLM and breaks it into 2 parts:
Frozen quantized parameters
Trainable parameters in the form of a LoRA adapter
The method finds the optimal split/decomposition, quantizes, and fine-tunes, jointly. The quantization works similarly to the method implemented in ExLlamaV2: we set a target precision, e.g., 3-bit, and then the algorithm will quantize with mixed-precision the different layers/modules of the models, keeping the important parameters at a higher precision, to achieve the targeted precision.
LQ-LoRA performs on par with QLoRA according to the authors.
Note that one significant drawback of this approach is that it modifies the pre-trained model, i.e., it saves a new base quantized model along with the adapter.
The implementation seems already mature enough to try it:
So we will try it! I’ll write an article giving more details on how it works and answering the following questions:
How fast is it for fine-tuning?
Finding the optimal split and quantizing at the same time looks like it may significantly slow down the fine-tuning.
How fast is it for inference?
Since we have mixed precision, which is hardware inefficient, the inference might be slow.
Does it consume more memory, during fine-tuning, than QLoRA?
If everything works well, I’ll publish it next week.
Orca 2: An LLM Fine-tuned for Reasoning
Microsoft released Orca 2 7B and 13B: Two LLMs outperforming much larger ones, such as Llama 2 70B, for reasoning tasks.
I find the paper very interesting as it describes very well how to build a dataset for reasoning tasks and fine-tune an LLM with it:
Orca 2: Teaching Small Language Models How to Reason
Orca 2 models are based on Llama 2 exclusively fine-tuned on synthetic datasets generated by other models, including GPT-4. Similar to Hugging Face’s Zephyr 7B and Microsoft’s phi-1.5, Orca 2 models are student models. They learn from much larger and better LLMs which explains why they can surpass larger models.
Unfortunately, Orca 2 is not open, even though in the abstract of the paper Microsoft claims that the model is open-source:
We open-source Orca 2 to encourage further research on the development, evaluation, and alignment of smaller LMs.
It’s not.
The models can only be used for research purposes, according to the license’s terms. Moreover, Microsoft didn’t release:
the training data
the training source code
In other words, we can’t reproduce or make a similar model. This doesn’t look “open-source” to me.
Use Only a Few Neurons to Exponentially Accelerate LLMs
For inference, it is well-known that LLMs mainly rely on a few of its neurons. However, knowing which neurons will be needed and how to exploit this knowledge to speed up inference is challenging.
Peter Belcak and Roger Wattenhofer (ETH Zurich) propose a solution (with an implementation) in this paper:
Exponentially Faster Language Modeling
They replace the feedforwards of the Transformer with fast feedforwards. These new feedforwards organize their neurons into a balanced binary tree and execute only one branch of the tree conditionally on the input.
Since we don’t need all the neurons, we can perform conditional matrix multiplication (CMM) to involve only the needed neurons. The issue here is that there wasn’t any efficient implementation for CMM. The authors propose one exploiting the CPU through the BLAS library.
They claim that their approach, applied to BERT, provides an exponential acceleration for inference, performs on par with the original BERT, and can be applied to other models.
The implementation is available here:
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers:
Have a nice weekend!