Run Llama 3 70B on Your GPU with ExLlamaV2

2.5 bits per weight, on average, is good enough

May 06, 2024

∙ Paid

Llama 3 70B is currently one of the best LLMs. According to public leaderboards such as Chatbot Arena, Llama 3 70B is better than GPT-3.5 and some versions of GPT-4. However, with its 70 billion parameters, this is a very large model. Inference with Llama 3 70B consumes a lot of GPU RAM. For fast inference on GPUs, we would need 2x80 GB GPUs. This is far from an affordable configuration but we can significantly reduce these requirements.

Estimate the Memory Consumption of LLMs for Inference and Fine-tuning

Benjamin Marie

April 25, 2024

Read full story

Moving the model to the CPU RAM and using a framework optimized for CPU inference such as Neural Speed is an appealing alternative. However, CPUs are not as fast as GPUs, especially for batch decoding.

In this article, I show how to quantize Llama 3 70B with mixed precision using ExLlamaV2 and a 24 GB GPU. ExLlamaV2’s quantization method preserves the important weights from quantization while aggressively quantizing the remaining weights. I quantized Llama 3 70B with 4, 3.5, 3, 2.5, and 2.18 bits per weight, on average, and benchmarked the resulting models. We will see that quantization below 2.5 bits per weight makes the model small enough to run on a 24 GB GPU.

The notebook implementing Llama 3 70B quantization with ExLlamaV2 and benchmarking the quantized models is here:

Get the notebook (#67)

The Kaitchup – AI on a Budget

Run Llama 3 70B on Your GPU with ExLlamaV2

2.5 bits per weight, on average, is good enough

Estimate the Memory Consumption of LLMs for Inference and Fine-tuning

Llama 3 70B Requirements

This post is for paid subscribers