Avoid Quantizing Llama 3 8B with GPTQ and Use BitsandBytes Instead

Llama 3 vs. Llama 2 vs. Mistral 7B, quantized

May 16, 2024

∙ Paid

With quantization, we can reduce the size of large language models (LLMs). Quantized LLMs are easier to run on GPUs with smaller memory. While 8-bit quantization would have almost no impact on the model’s quality, a 3-bit or 2-bit quantization can significantly degrade the accuracy of the model.

Preliminary results that I discussed in The Weekly Kaitchup #39 suggest that once quantized, Llama 3 8B underperforms quantized Llama 2.

Is quantized Llama 2 really better than quantized Llama 3?

Or, more generally, if Llama 3 is better than Mistral 7B and Llama 2 (Llama 3 > Mistral 7B > Llama 2 7B), is the quantized version also better than these models quantized (quantized Llama 3 > quantized Mistral 7B > quantized Llama 2 7B)?

In this article, we will answer this question. I quantized all the models with bitsandbytes to 8-bit and 4-bit, and with GPTQ to 8-bit, 4-bit, 3-bit, and 2-bit and checked their performance on 3 different tasks. We will see that 8-bit quantization works reasonably well for Llama 3 with both quantization algorithms. I also found that while GPTQ significantly degrades the model, bitsandbytes quantization seems to work well.

The code I used to quantize and evaluate the models is in this notebook:

Get the notebook (#70)

Note: The original version of this article didn’t include the quantization with bitsandbytes. I added the results on May 20th, 2024.

The Kaitchup – AI on a Budget

Avoid Quantizing Llama 3 8B with GPTQ and Use BitsandBytes Instead

Llama 3 vs. Llama 2 vs. Mistral 7B, quantized

This post is for paid subscribers