Llama 3 70B is currently one of the best LLMs. According to public leaderboards such as Chatbot Arena, Llama 3 70B is better than GPT-3.5 and some versions of GPT-4. However, with its 70 billion parameters, this is a very large model. Inference with Llama 3 70B consumes a lot of GPU RAM. For fast inference on GPUs, we would need 2x80 GB GPUs. This is far from an affordable configuration but we can significantly reduce these requirements.
Moving the model to the CPU RAM and using a framework optimized for CPU inference such as Neural Speed is an appealing alternative. However, CPUs are not as fast as GPUs, especially for batch decoding.
In this article, I show how to quantize Llama 3 70B with mixed precision using ExLlamaV2 and a 24 GB GPU. ExLlamaV2’s quantization method preserves the important weights from quantization while aggressively quantizing the remaining weights. I quantized Llama 3 70B with 4, 3.5, 3, 2.5, and 2.18 bits per weight, on average, and benchmarked the resulting models. We will see that quantization below 2.5 bits per weight makes the model small enough to run on a 24 GB GPU.
The notebook implementing Llama 3 70B quantization with ExLlamaV2 and benchmarking the quantized models is here: