The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Run Llama 3 70B on Your GPU with ExLlamaV2

Run Llama 3 70B on Your GPU with ExLlamaV2

2.5 bits per weight, on average, is good enough

Benjamin Marie's avatar
Benjamin Marie
May 06, 2024
∙ Paid
5

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Run Llama 3 70B on Your GPU with ExLlamaV2
4
Share
Generated with DALL-E

Llama 3 70B is currently one of the best LLMs. According to public leaderboards such as Chatbot Arena, Llama 3 70B is better than GPT-3.5 and some versions of GPT-4. However, with its 70 billion parameters, this is a very large model. Inference with Llama 3 70B consumes a lot of GPU RAM. For fast inference on GPUs, we would need 2x80 GB GPUs. This is far from an affordable configuration but we can significantly reduce these requirements.

Estimate the Memory Consumption of LLMs for Inference and Fine-tuning

Estimate the Memory Consumption of LLMs for Inference and Fine-tuning

Benjamin Marie
·
April 25, 2024
Read full story

Moving the model to the CPU RAM and using a framework optimized for CPU inference such as Neural Speed is an appealing alternative. However, CPUs are not as fast as GPUs, especially for batch decoding.

Get instant access to over 100 AI articles and tutorials, plus more than 80 comprehensive AI notebooks. Subscribe to The Kaitchup:

In this article, I show how to quantize Llama 3 70B with mixed precision using ExLlamaV2 and a 24 GB GPU. ExLlamaV2’s quantization method preserves the important weights from quantization while aggressively quantizing the remaining weights. I quantized Llama 3 70B with 4, 3.5, 3, 2.5, and 2.18 bits per weight, on average, and benchmarked the resulting models. We will see that quantization below 2.5 bits per weight makes the model small enough to run on a 24 GB GPU.

The notebook implementing Llama 3 70B quantization with ExLlamaV2 and benchmarking the quantized models is here:

Get the notebook (#67)

Llama 3 70B Requirements

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share