The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Run Llama 2 70B on Your GPU: 4-bit VRAM requirement (ExLlamaV2)

4-bit VRAM requirements + mixed-precision quantization for your hardware

Benjamin Marie's avatar
Benjamin Marie
Sep 27, 2023
∙ Paid

The largest and best model of the Llama 2 family has 70 billion parameters. One fp16 parameter weighs 2 bytes. Loading Llama 2 70B requires 140 GB of memory (70 billion * 2 bytes).

What are Llama 2 70B’s 4-bit VRAM requirements?

This is challenging. Let’s define that a high-end consumer GPU, such as the NVIDIA RTX 3090 or 4090, has a maximum of 24 GB of VRAM. Llama 2 70B’s 4-bit VRAM requirement is ~35 GB (70B * 0.5 bytes), so 4-bit VRAM usage won’t fit on a single 24 GB GPU. The model could fit into 2 consumer GPUs. Note: I provide more details on the GPU requirements in the next section.

GPU Benchmarking: What Is the Best GPU for LoRA, QLoRA, and Inference?

GPU Benchmarking: What Is the Best GPU for LoRA, QLoRA, and Inference?

Benjamin Marie
·
July 18, 2024
Read full story

Get instant access to over 100 AI articles and tutorials, plus more than 90 AI notebooks. Subscribe to The Kaitchup:

We could reduce the precision to 2-bit. It would fit into 24 GB of VRAM but then the performance of the model would also significantly drop.

To avoid losing too much in the performance of the model, we could quantize important layers, or parts, of the model to a higher precision and the less important parts to a lower precision. The model would be quantized with mixed precision.

ExLlamaV2 (MIT license) implements mixed-precision quantization.

In this article, I show how to use ExLlamaV2 to quantize models with mixed precision. More particularly, we will see how to quantize Llama 2 70B to an average precision lower than 3-bit. For smaller GPUs, I show how to quantize Llama 2 13B with mixed precision. I also benchmark ExLlamaV2’s computational cost for quantization. We will see that the resulting models are very fast for inference.

The notebook demonstrating mixed-precision quantization of Llama 2 with ExLlamaV2 is available here:

Get the notebook (#18)

Update (September 6th, 2024): This post for Llama 2 is a bit outdated. I wrote a follow-up article showing how to do it with Llama 3, here:

Run Llama 3 70B on Your GPU with ExLlamaV2

Run Llama 3 70B on Your GPU with ExLlamaV2

Benjamin Marie
·
May 6, 2024
Read full story

Llama 2 70B VRAM requirements (4-bit vs mixed precision)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture