Run Llama 2 70B on Your GPU with ExLlamaV2
Finding the optimal mixed-precision quantization for your hardware
The largest and best model of the Llama 2 family has 70 billion parameters. One fp16 parameter weighs 2 bytes. Loading Llama 2 70B requires 140 GB of memory (70 billion * 2 bytes).
What are Llama 2 70B’s GPU requirements?
This is challenging. Let’s define that a high-end consumer GPU, such as the NVIDIA RTX 3090* or 4090*, has a maximum of 24 GB of VRAM. If we quantize Llama 2 70B to 4-bit precision, we still need 35 GB of memory (70 billion * 0.5 bytes). The model could fit into 2 consumer GPUs. Note: I provide more details on the GPU requirements in the next section.
We could reduce the precision to 2-bit. It would fit into 24 GB of VRAM but then the performance of the model would also significantly drop.
To avoid losing too much in the performance of the model, we could quantize important layers, or parts, of the model to a higher precision and the less important parts to a lower precision. The model would be quantized with mixed precision.
ExLlamaV2 (MIT license) implements mixed-precision quantization.
In this article, I show how to use ExLlamaV2 to quantize models with mixed precision. More particularly, we will see how to quantize Llama 2 70B to an average precision lower than 3-bit. For smaller GPUs, I show how to quantize Llama 2 13B with mixed precision. I also benchmark ExLlamaV2’s computational cost for quantization. We will see that the resulting models are very fast for inference.
The notebook demonstrating mixed-precision quantization of Llama 2 with ExLlamaV2 is available here:
Update (September 6th, 2024): This post for Llama 2 is a bit outdated. I wrote a follow-up article showing how to do it with Llama 3, here:
Note: The links marked with a “*” are Amazon affiliate links.
Llama 2 70B Requirements
Software Requirements
To run Llama 2 70B and quantize it with mixed precision and run them, we need to install ExLlamaV2.
Install it from source:
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install -r requirements.txt
We will download models from Hugging Face Hub. We need to install transformers:
pip install transformers
Llama 2 70B GPU Requirements
As for the hardware requirements, we aim to run models on consumer GPUs. I benchmarked various GPUs to run LLMs, here:
Llama 2 70B: We target 24 GB of VRAM. NVIDIA RTX3090/4090 GPUs would work.
The NVIDIA RTX 3090* is less expensive but slower than the RTX 4090*. If you do a lot of AI experiments, I recommend the RTX 4090*. It will save you a lot of time.
Llama 2 13B: We target 12 GB of VRAM. Many GPUs with at least 12 GB of VRAM are available. RTX3060/3080/4060/4080 are some of them.
If you are looking for a GPU under $500, the RTX 4060* has the best value. It is fast and has 16 GB of VRAM.
How to quantize with mixed precision using ExLlamaV2
The quantization algorithm used by ExLlamaV2 is similar to GPTQ. But instead of choosing one precision type, ExLlamaV2 tries different precision types for each layer while measuring quantization errors. All the tries and associated error rates are saved. Then, given a target precision provided by the user, the ExLlamaV2 algorithm will quantize the model by choosing for each layer’s module the quantization precision that leads, on average, to the target precision with the lowest error rate.
During quantization, ExLlamaV2 outputs all the tries:
Quantization tries for the 10th layer’s up_proj module of Llama 2 13B
-- Linear: model.layers.10.mlp.up_proj
-- 0.05:3b/0.95:2b 32g s4 2.18 bpw rfn_error: 0.21867
-- 0.25:3b/0.75:2b 32g s4 2.38 bpw rfn_error: 0.20617
-- 0.25:4b/0.75:2b 32g s4 2.63 bpw rfn_error: 0.20230
-- 0.1:4b/0.4:3b/0.5:2b 32g s4 2.73 bpw rfn_error: 0.18449
-- 0.1:4b/0.9:3b 32g s4 3.23 bpw rfn_error: 0.10229
-- 0.2:6b/0.8:3b 32g s4 3.73 bpw rfn_error: 0.09791
-- 1.0:3b 128g s4 3.03 bpw rfn_error: 0.11354
-- 1.0:3b 32g s4 3.13 bpw rfn_error: 0.10491
-- 0.05:4b/0.95:3b 32g s4 3.18 bpw rfn_error: 0.10363
-- 0.4:4b/0.6:3b 32g s4 3.53 bpw rfn_error: 0.09272
-- 0.6:4b/0.4:3b 64g s4 3.66 bpw rfn_error: 0.08835
-- 1.0:4b 128g s4 4.03 bpw rfn_error: 0.05756
-- 1.0:4b 32g s4 4.13 bpw rfn_error: 0.05007
-- 0.1:5b/0.9:4b 32g s4 4.23 bpw rfn_error: 0.04889
-- 0.1:6b/0.9:4b 32g s4 4.33 bpw rfn_error: 0.04861
-- 1.0:5b 128g s4 5.03 bpw rfn_error: 0.02879
-- 0.1:6b/0.9:5b 32g s4 5.23 bpw rfn_error: 0.02494
-- 0.05:8b/0.05:6b/0.9:5b 32g s4 5.33 bpw rfn_error: 0.02486
-- 0.4:6b/0.6:5b 32g s4 5.53 bpw rfn_error: 0.02297
-- 0.1:8b/0.3:6b/0.6:5b 32g s4 5.73 bpw rfn_error: 0.02280
-- 1.0:6b 128g s4 6.03 bpw rfn_error: 0.01503
-- 1.0:6b 32g s4 6.13 bpw rfn_error: 0.01471
-- 0.1:8b/0.9:6b 128g s4 6.23 bpw rfn_error: 0.01463
-- 1.0:8b 32g s4 8.13 bpw rfn_error: 0.00934
-- Time: 19.57 seconds
We can see that the error rate decreases as the quantization precision (bpw, i.e., bits per weight) increases, as expected.
Quantization with ExLlamaV2 is as simple as running the convert.py script:
Note: convert.py is in the root directory of ExLlamaV2
python convert.py \
-i ./Llama-2-13b-hf/ \
-o ./Llama-2-13b-hf/temp/ \
-c test.parquet \
-cf ./Llama-2-13b-hf/3.0bpw/ \
-b 3.0
ExLlamaV2 doesn’t support Hugging Face libraries. It expects the model and the calibration dataset to be stored locally.
The script’s main arguments are the following:
input model (-i): A local directory that contains the model in the “safetensors” format.
dataset used for calibration (-c): ExLlamaV2 uses the GPTQ algorithm. We need a dataset for calibration. It must be stored locally in the “parquet” format.
output directory (-cf): The local directory in which the quantized model will be saved.
Target precision of the quantization (-b): The model will be quantized with a mixed precision which will be on average the targeted precision. Here, I chose to target a 3-bit precision.
This quantization took 2 hours and 5 minutes. I used Google Colab PRO with the T4 GPU and high CPU RAM. It didn’t consume more than 5 GB of VRAM during the entire process, but there was a peak consumption of 20 GB of CPU RAM.
The T4 is quite slow. The quantization time could be reduced with Google Colab V100 or an RTX GPU. Note: It’s unclear to me how much the GPU is used during quantization. It might be that the CPU speed has more impact on the quantization time than the GPU.
To quantize Llama 2 70B, you can do the same.
What precision should we target so that the quantized Llama 2 70B would fit into 24 GB of VRAM?
Here is the method you can apply to decide on the precision of a model given your hardware.
Let’s say we have 24 GB of VRAM. We should also always expect some memory overhead for inference. So let’s target a quantized model size of 22 GB.
First, we need to convert 22 GB into bits:
22 GB = 2.2e+10 bytes = 1.76e+11 bits (since 1 byte = 8 bits)
We have 1.76e+11 bits (b) available. Llama 2 70B has 7e+10 parameters (p) to be quantized. We target a precision that I denote bpw.
bpw = b/p
bpw = 176 000 000 000 / 70 000 000 000 = 2.51
So we can afford an average precision of 2.51 bits per parameter.
I round it to 2.5 bits.
To quantize Llama 2 70B to an average precision of 2.5 bits, we run:
python convert.py \
-i ./Llama-2-70b-hf/ \
-o ./Llama-2-70b-hf/temp/ \
-c test.parquet \
-cf ./Llama-2-70b-hf/2.5bpw/ \
-b 2.5
This quantization is also feasible on consumer hardware with a 24 GB GPU. It can take up to 15 hours. If you want to use Google Colab for this one, note that you will have to store the original model outside of Google Colab's hard drive since it is too small when using the A100 GPU.
Running Llama 2 70B on Your GPU with ExLlamaV2
ExLlamaV2 already provides all you need to run models quantized with mixed precision.
There is a chat.py script that will run the model as a chatbot for interactive use. You can also simply test the model with test_inference.py. This is what we will do to check the model speed and memory consumption.
If you didn’t quantize Llama 2 70B by yourself, you can download one quantized by ExLlamaV2 from the hugging face hub:
The repository has many branches for different mixed-precision. Here we choose the branch “2.5bpw” since 2.5 bpw is our target precision.
If you use Hugging Face transformers, you can download it with snapshot_download:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="turboderp/Llama2-70B-exl2", ignore_patterns=["*.bin"], revision="2.5bpw", local_dir="./Llama-2-70b-2.5bpw/", local_dir_use_symlinks=False)
For testing this model, we run:
python test_inference.py -m ./Llama-2-70b-2.5bpw/ -p "Once upon a time,"
Note: “-p” is the testing prompt.
It should take several minutes (8 minutes on an A100 GPU). ExLlamaV2 uses “torch.compile”. According to PyTorch documentation:
torch.compile makes PyTorch code run faster by JIT-compiling PyTorch code into optimized kernels, all while requiring minimal code changes.
This compilation is time-consuming but cached.
If you run test_inference.py, again it should take only 30 seconds.
The model itself weighs exactly 22.15 GB. During my inference experiments, it occupied exactly 24 GB. It barely fits on our consumer GPU.
Why it doesn’t only consume 22.15 GB?
The model in memory actually occupies 22.15 GB but the inference itself also consumes additional memory. For instance, we have to encode the prompt and store it in memory. Also, if you set a higher max sequence length or do batch decoding, inference will consume more memory.
I used the A100 of Google Colab for this experiment. If you use a GPU with 24 GB, you will likely get a CUDA out-of-memory error during inference, especially if you also use the GPU to run your OS graphical user interface (e.g., Ubuntu Desktop consumes around 1.5 GB of VRAM).
To give you some margin, targeting a lower bpw. 2.4 or even 2.3 would leave several GB of VRAM available for inference.
With 2.5 bpw, the quality of the generation remains acceptable. I obtained this output from test_inference.py:
Once upon a time, there was a little girl who lived in a castle. She had everything a princess could want, and she was the most beautiful girl in the world.
One day, she looked in the mirror and said, “Mirror, mirror, on the wall, who is the fairest of them all?”
The mirror replied, “You are the most beautiful girl in the world.”
But the princess wasn’t satisfied. She wanted to be more beautiful. So she went to a witch and asked her to make her even more beautiful.
The witch gave her a potion, and the princess d
Prompt processed in 0.08 seconds, 5 tokens, 65.70 tokens/second
Response generated in 7.43 seconds, 128 tokens, 17.23 tokens/second
ExLlamaV2 models are also extremely fast. I observed a generation speed between 15 and 30 tokens/second. To give you a point of comparison, when I benchmarked Llama 2 7B quantized to 4-bit with GPTQ, a model 10 times smaller, I obtained a speed of around 28 tokens/sec using Hugging Face transformers for generation.
Conclusion
Quantization to mixed-precision is intuitive. We aggressively lower the precision of the model where it has less impact.
Running huge models such as Llama 2 70B is possible on a single consumer GPU.
Note also that ExLlamaV2 is only two weeks old. The framework is likely to become faster and easier to use. Hopefully, ExLlamaV2 will be natively supported by Hugging Face transformers.