The Weekly Kaitchup #52

HQQ 4-bit Llama 3.1 - Gemma 2 2B - MMLU

Aug 02, 2024

Hi Everyone,

In this edition of The Weekly Kaitchup:

HQQ 4-bit Quantization (almost) Lossless for Llama 3.1 8B
Gemma 2 2B: Another Good Student
Are We Overfitting MMLU?

If you are a free subscriber, consider upgrading to paid to access all the notebooks (90+) and more than 100 articles.

If you are looking for custom AI notebooks on request, priority support, or professional LLM services, have a look at The Kaitchup Pro:

The Kaitchup Pro

AI Notebooks and Articles Published this Week by The Kaitchup

Llama 3.1: Fine-tuning on Consumer Hardware — LoRA vs. QLoRA

Benjamin Marie

July 29, 2024

Read full story

Notebook: #90 Fine-tune Llama 3.1 on Your Computer with QLoRA and LoRA -- Focus on the padding side

Serve Multiple LoRA Adapters with vLLM

Benjamin Marie

August 1, 2024

Read full story

Notebook: #91 Serve Multiple LoRA Adapters with vLLM -- Example with Llama 3

HQQ 4-bit Quantization (almost) Lossless for Llama 3.1 8B

Mobius Labs released on the HF Hub a 4-bit version of Llama 3.1 8B achieving 99.3% of the performance of the original Llama 3.1 8B (16-bit version).

HQQ is a very efficient quantization method that doesn’t need a calibration step. If you use HQQ, you can quantize the model at loading time as you would do with bitsandbytes for QLoRA.

Mobius Labs proposes 4-bit versions obtained with and without calibration. The calibrated version yields slightly better results.

mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib

Note: The code to run the model is provided in the model card.

The resulting model only consumes 6.1 GB of GPU memory once loaded. This is 9.9 GB less than the original model.

The performance of the model has been evaluated on popular benchmarks:

The “99.3%” looks very good but is probably not very different from the performance of AWQ and GPTQ, especially since we don’t know the quantization hyperparameters for these other two models (group size, asymmetric/symmetric quantization, etc.).

As for the absolute performance itself, remember that there is a growing body of evidence that quantization degrades more than what the benchmarks tell us. For instance, have a look at this paper by Cohere, which I discussed two weeks ago.

Once quantized, evaluate the model on your own data and tasks, rather than using public benchmarks, to make sure that it works as expected.

In this article, I explained how you can fine-tune an adapter for HQQ models (still valid for Llama 3.1):

1-bit and 2-bit Llama 3: Quantization with HQQ and Fine-tuning with HQQ+

Benjamin Marie

May 30, 2024

Read full story

Gemma 2 2B: Another Good Student

Nearly one month after Gemma 2 9B and 27B, Google released Gemma 2 2B:

google/gemma-2-2b

With only 2.61B parameters, Gemma 2 2B is smaller than Microsoft’s Phi-3 mini.

Phi-3 mini: Fine-tuning and Quantization on Your Computer

Benjamin Marie

May 2, 2024

Read full story

The model achieves high scores (given its size) on public benchmarks:

Gemma 2 2B scores are also higher than the ones obtained by GPT 3.5.

Similar to other LLMs of this size, Gemma 2 2B is a product of knowledge distillation. It has been trained by another large model. Which one? We don’t know. It could be Gemma 27B or Gemini 1.5 Pro.

To run this model, you will only need a 6 GB GPU, or a 4 GB GPU once quantized to 4-bit. Given the size of the model, it’s also a good candidate for fast inference on CPU, for instance using llama.cpp.

GGUF Quantization for Fast and Memory-Efficient Inference on Your CPU

Benjamin Marie

February 29, 2024

Read full story

Are We Overfitting MMLU?

On social networks, you might have seen this plot by Maxime Labonne showing the performance of a large number of LLMs on MMLU:

It's interesting to observe how open LLMs are narrowing the gap with closed-source LLMs on this benchmark. Since the MMLU assesses both world knowledge and language understanding, these results suggest that open LLMs are rapidly improving and becoming more competitive.

But, we have to keep in mind that:

MMLU is a 4-year-old benchmark. It could have largely contaminated all these models (i.e., the models might have seen this benchmark during training).
Closed-source models are products. Anthropic, OpenAI, Google, etc. want to make money with their models. There’s no good reason for them to avoid training their models on public benchmarks like MMLU. Models with higher MMLU scores are likely to be more trending and get more users.
MMLU does not evaluate the language generation capabilities of LLMs. MMLU asks a question with 4 possible answers. The LLM has only to pick the correct answer. An LLM can pick the correct answer among a very limited number of choices, while not being able to generate language. A model might know the right answer but might not know how to communicate it to the user.

It would also be very interesting to see the performance of expert humans on the MMLU questions that top-performing LLMs can’t answer. It might be that it’s nearly impossible to correctly answer these questions (ambiguous questions or answers, poorly written questions, several correct possible answers but only one recognized by the benchmark, etc.).

I’m writing an article about MMLU. It will highlight how an LLM can have 0 accuracy on this benchmark while answering correctly to all the questions (and the reverse: perfect score while unable to generate the correct answers).

Note: Maxime Labonne regularly produces a lot of interesting resources that you might want to check. He wrote an extensive LLM course on GitHub, and is publishing his first book on LLMs (especially interesting if you want to deploy LLMs into production):

LLM Engineer's Handbook: Master the art of engineering Large Language Models from concept to production*

* (Amazon affiliate link)

GPU Cost Tracker

In this edition of The Weekly Kaitchup, I inaugurate this new section that will keep track, week after week, of the cost of GPUs. I will only cover consumer GPUs, from middle-end, e.g., RTX 4060, to high-end, e.g., RTX 4090.

While consumer GPUs have much less memory than GPUs dedicated to AI, they are more cost-effective, by far, for inference with small batches and for fine-tuning LLMs with up to ~35B parameters using PEFT methods.

GPU Benchmarking: What Is the Best GPU for LoRA, QLoRA, and Inference?

Benjamin Marie

July 18, 2024

Read full story

Note: To get the prices of GPUs, I use Amazon. If the price of a GPU drops on Amazon, there is a high chance that it will also be lower at your favorite GPU provider. All the links in this section are Amazon affiliate links.

RTX 4090 (24 GB): PNY GeForce RTX™ 4090 24GB VERTO™ ($1,738.99)
RTX 4080 SUPER (16 GB): ZOTAC GAMING GeForce RTX 4080 SUPER Trinity Black Edition ($1,029.99)
RTX 4070 Ti SUPER (16 GB): ZOTAC GAMING GeForce RTX 4070 Ti SUPER Trinity Black Edition ($779.99)
RTX 4060 Ti (16 GB): PNY GeForce RTX™ 4060 Ti 16GB XLR8 ($459.99)

Evergreen Kaitchup

In this section of The Weekly Kaitchup, I mention which of the Kaitchup’s AI notebook(s) and articles I have checked and updated, with a brief description of what I have done.

10 days ago, I published an estimation of the memory consumption for Llama 3 405B. We saw that most of the memory is consumed by the activations of the models. Fine-tuning on sequences of 128k tokens consumes TBs of memory just for the activations.

Llama 3 405B: Can You Fine-tune It?

Benjamin Marie

July 24, 2024

Read full story

However, if we use PEFT methods such as LoRA and QLoRA, we freeze the model’s parameters and we don’t need to store all these activations. Instead, we only need to store the activations for the LoRA adapter. This significantly reduces the memory consumed by the activations. Nonetheless, when I wrote this article, bitsandbytes, which we use for QLoRA fine-tuning, was still reserving memory for the activations related to the frozen parameters. This bug has been corrected.

I’m working on a new estimation of the memory consumption for fine-tuning Llama 3 405B with QLoRA. It should be possible to do it with one TB of GPU memory, for a larger batch size than 512 tokens.

The Salt

The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.

This week, I reviewed: