Run Llama 2 70B on Your GPU: 4-bit VRAM requirement (ExLlamaV2)
4-bit VRAM requirements + mixed-precision quantization for your hardware
The largest and best model of the Llama 2 family has 70 billion parameters. One fp16 parameter weighs 2 bytes. Loading Llama 2 70B requires 140 GB of memory (70 billion * 2 bytes).
What are Llama 2 70B’s 4-bit VRAM requirements?
This is challenging. Let’s define that a high-end consumer GPU, such as the NVIDIA RTX 3090 or 4090, has a maximum of 24 GB of VRAM. Llama 2 70B’s 4-bit VRAM requirement is ~35 GB (70B * 0.5 bytes), so 4-bit VRAM usage won’t fit on a single 24 GB GPU. The model could fit into 2 consumer GPUs. Note: I provide more details on the GPU requirements in the next section.
We could reduce the precision to 2-bit. It would fit into 24 GB of VRAM but then the performance of the model would also significantly drop.
To avoid losing too much in the performance of the model, we could quantize important layers, or parts, of the model to a higher precision and the less important parts to a lower precision. The model would be quantized with mixed precision.
ExLlamaV2 (MIT license) implements mixed-precision quantization.
In this article, I show how to use ExLlamaV2 to quantize models with mixed precision. More particularly, we will see how to quantize Llama 2 70B to an average precision lower than 3-bit. For smaller GPUs, I show how to quantize Llama 2 13B with mixed precision. I also benchmark ExLlamaV2’s computational cost for quantization. We will see that the resulting models are very fast for inference.
The notebook demonstrating mixed-precision quantization of Llama 2 with ExLlamaV2 is available here:
Update (September 6th, 2024): This post for Llama 2 is a bit outdated. I wrote a follow-up article showing how to do it with Llama 3, here:




