From 16-bit to 2-bit: Finding the Best Trade-off Between Memory-Efficiency and Accuracy

Mistral 7B and Llama 2 under pressure

Feb 01, 2024

∙ Paid

With quantization, we can reduce the size of large language models to run them on consumer hardware.

However, quantization is not accurate as it loses information in the process. Typically, large LLMs can be aggressively quantized to lower precision with only little loss of accuracy while smaller LLMs are much more difficult to accurately quantize.

When is it better to use a small LLM rather than quantizing a larger one?

In this article, we will answer this question by experimenting with 8-bit, 4-bit, 3-bit, and 2-bit quantizations, using GPTQ, applied to Mistral 7B, Llama 2 7B, and Llama 13B. We will compare their memory consumption using optimum-benchmark and their accuracy using the LLM Evaluation Harness.

The quantization and benchmarking are implemented in this notebook:

Get the notebook (#42)

The Kaitchup – AI on a Budget

From 16-bit to 2-bit: Finding the Best Trade-off Between Memory-Efficiency and Accuracy

Mistral 7B and Llama 2 under pressure

This post is for paid subscribers