The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
From 16-bit to 2-bit: Finding the Best Trade-off Between Memory-Efficiency and Accuracy
Copy link
Facebook
Email
Notes
More

From 16-bit to 2-bit: Finding the Best Trade-off Between Memory-Efficiency and Accuracy

Mistral 7B and Llama 2 under pressure

Benjamin Marie's avatar
Benjamin Marie
Feb 01, 2024
∙ Paid
12

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
From 16-bit to 2-bit: Finding the Best Trade-off Between Memory-Efficiency and Accuracy
Copy link
Facebook
Email
Notes
More
3
1
Share

With quantization, we can reduce the size of large language models to run them on consumer hardware.

However, quantization is not accurate as it loses information in the process. Typically, large LLMs can be aggressively quantized to lower precision with only little loss of accuracy while smaller LLMs are much more difficult to accurately quantize.

When is it better to use a small LLM rather than quantizing a larger one?

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we will answer this question by experimenting with 8-bit, 4-bit, 3-bit, and 2-bit quantizations, using GPTQ, applied to Mistral 7B, Llama 2 7B, and Llama 13B. We will compare their memory consumption using optimum-benchmark and their accuracy using the LLM Evaluation Harness.

The quantization and benchmarking are implemented in this notebook:

Get the notebook (#42)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More