Hi Everyone,
In this edition of The Weekly Kaitchup:
HQQ 4-bit Quantization (almost) Lossless for Llama 3.1 8B
Gemma 2 2B: Another Good Student
Are We Overfitting MMLU?
If you are a free subscriber, consider upgrading to paid to access all the notebooks (90+) and more than 100 articles.
If you are looking for custom AI notebooks on request, priority support, or professional LLM services, have a look at The Kaitchup Pro:
AI Notebooks and Articles Published this Week by The Kaitchup
Notebook: #90 Fine-tune Llama 3.1 on Your Computer with QLoRA and LoRA -- Focus on the padding side
Notebook: #91 Serve Multiple LoRA Adapters with vLLM -- Example with Llama 3
HQQ 4-bit Quantization (almost) Lossless for Llama 3.1 8B
Mobius Labs released on the HF Hub a 4-bit version of Llama 3.1 8B achieving 99.3% of the performance of the original Llama 3.1 8B (16-bit version).
HQQ is a very efficient quantization method that doesn’t need a calibration step. If you use HQQ, you can quantize the model at loading time as you would do with bitsandbytes for QLoRA.
Mobius Labs proposes 4-bit versions obtained with and without calibration. The calibrated version yields slightly better results.
Note: The code to run the model is provided in the model card.
The resulting model only consumes 6.1 GB of GPU memory once loaded. This is 9.9 GB less than the original model.
The performance of the model has been evaluated on popular benchmarks:
The “99.3%” looks very good but is probably not very different from the performance of AWQ and GPTQ, especially since we don’t know the quantization hyperparameters for these other two models (group size, asymmetric/symmetric quantization, etc.).
As for the absolute performance itself, remember that there is a growing body of evidence that quantization degrades more than what the benchmarks tell us. For instance, have a look at this paper by Cohere, which I discussed two weeks ago.
Once quantized, evaluate the model on your own data and tasks, rather than using public benchmarks, to make sure that it works as expected.
In this article, I explained how you can fine-tune an adapter for HQQ models (still valid for Llama 3.1):
Gemma 2 2B: Another Good Student
Nearly one month after Gemma 2 9B and 27B, Google released Gemma 2 2B:
With only 2.61B parameters, Gemma 2 2B is smaller than Microsoft’s Phi-3 mini.
The model achieves high scores (given its size) on public benchmarks:
Gemma 2 2B scores are also higher than the ones obtained by GPT 3.5.
Similar to other LLMs of this size, Gemma 2 2B is a product of knowledge distillation. It has been trained by another large model. Which one? We don’t know. It could be Gemma 27B or Gemini 1.5 Pro.
To run this model, you will only need a 6 GB GPU, or a 4 GB GPU once quantized to 4-bit. Given the size of the model, it’s also a good candidate for fast inference on CPU, for instance using llama.cpp.
Are We Overfitting MMLU?
On social networks, you might have seen this plot by Maxime Labonne showing the performance of a large number of LLMs on MMLU:
It's interesting to observe how open LLMs are narrowing the gap with closed-source LLMs on this benchmark. Since the MMLU assesses both world knowledge and language understanding, these results suggest that open LLMs are rapidly improving and becoming more competitive.
But, we have to keep in mind that:
MMLU is a 4-year-old benchmark. It could have largely contaminated all these models (i.e., the models might have seen this benchmark during training).
Closed-source models are products. Anthropic, OpenAI, Google, etc. want to make money with their models. There’s no good reason for them to avoid training their models on public benchmarks like MMLU. Models with higher MMLU scores are likely to be more trending and get more users.
MMLU does not evaluate the language generation capabilities of LLMs. MMLU asks a question with 4 possible answers. The LLM has only to pick the correct answer. An LLM can pick the correct answer among a very limited number of choices, while not being able to generate language. A model might know the right answer but might not know how to communicate it to the user.
It would also be very interesting to see the performance of expert humans on the MMLU questions that top-performing LLMs can’t answer. It might be that it’s nearly impossible to correctly answer these questions (ambiguous questions or answers, poorly written questions, several correct possible answers but only one recognized by the benchmark, etc.).
I’m writing an article about MMLU. It will highlight how an LLM can have 0 accuracy on this benchmark while answering correctly to all the questions (and the reverse: perfect score while unable to generate the correct answers).
Note: Maxime Labonne regularly produces a lot of interesting resources that you might want to check. He wrote an extensive LLM course on GitHub, and is publishing his first book on LLMs (especially interesting if you want to deploy LLMs into production):
* (Amazon affiliate link)
GPU Cost Tracker
In this edition of The Weekly Kaitchup, I inaugurate this new section that will keep track, week after week, of the cost of GPUs. I will only cover consumer GPUs, from middle-end, e.g., RTX 4060, to high-end, e.g., RTX 4090.
While consumer GPUs have much less memory than GPUs dedicated to AI, they are more cost-effective, by far, for inference with small batches and for fine-tuning LLMs with up to ~35B parameters using PEFT methods.
Note: To get the prices of GPUs, I use Amazon. If the price of a GPU drops on Amazon, there is a high chance that it will also be lower at your favorite GPU provider. All the links in this section are Amazon affiliate links.
RTX 4090 (24 GB): PNY GeForce RTX™ 4090 24GB VERTO™ ($1,738.99)
RTX 4080 SUPER (16 GB): ZOTAC GAMING GeForce RTX 4080 SUPER Trinity Black Edition ($1,029.99)
RTX 4070 Ti SUPER (16 GB): ZOTAC GAMING GeForce RTX 4070 Ti SUPER Trinity Black Edition ($779.99)
RTX 4060 Ti (16 GB): PNY GeForce RTX™ 4060 Ti 16GB XLR8 ($459.99)
Evergreen Kaitchup
In this section of The Weekly Kaitchup, I mention which of the Kaitchup’s AI notebook(s) and articles I have checked and updated, with a brief description of what I have done.
10 days ago, I published an estimation of the memory consumption for Llama 3 405B. We saw that most of the memory is consumed by the activations of the models. Fine-tuning on sequences of 128k tokens consumes TBs of memory just for the activations.
However, if we use PEFT methods such as LoRA and QLoRA, we freeze the model’s parameters and we don’t need to store all these activations. Instead, we only need to store the activations for the LoRA adapter. This significantly reduces the memory consumed by the activations. Nonetheless, when I wrote this article, bitsandbytes, which we use for QLoRA fine-tuning, was still reserving memory for the activations related to the frozen parameters. This bug has been corrected.
I’m working on a new estimation of the memory consumption for fine-tuning Llama 3 405B with QLoRA. It should be possible to do it with one TB of GPU memory, for a larger batch size than 512 tokens.
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
This week, I reviewed:
⭐Compact Language Models via Pruning and Knowledge Distillation
PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!
Thanks for the shout-out!