Llama 3 70B is currently one of the best LLMs. According to public leaderboards such as Chatbot Arena, Llama 3 70B is better than GPT-3.5 and some versions of GPT-4. However, with its 70 billion parameters, this is a very large model. Inference with Llama 3 70B consumes at least 140 GB of GPU RAM. For fast inference on GPUs, we would need 2x80 GB GPUs. This is far from an affordable configuration.
Moving the model to the CPU RAM and using a framework optimized for CPU inference such as Neural Speed is an appealing alternative. However, CPUs are not as fast as GPUs, especially for batch decoding.
With quantization, we can reduce the size of the model so that it can fit on a GPU. Using 4-bit quantization, we divide the size of the model by nearly 4. It would still require a costly 40 GB GPU. With 2-bit quantization, Llama 3 70B could fit on a 24 GB consumer GPU but with such a low-precision quantization, the accuracy of the model could drop.
In this article, I show how to quantize Llama 3 70B with mixed precision using ExLlamaV2 and a 24 GB GPU. ExLlamaV2’s quantization method preserves the important weights from quantization while aggressively quantizing the remaining weights. I quantized Llama 3 70B with 4, 3.5, 3, 2.5, and 2.18 bits per weight, on average, and benchmarked the resulting models. We will see that quantization below 2.5 bits per weight makes the model small enough to run on a 24 GB GPU.
The notebook implementing Llama 3 70B quantization with ExLlamaV2 and benchmarking the quantized models is here: