Run Llama 3.3 70B on Your GPU with ExLlamaV3
Fast Llama 3.3 70B at 1.75 bits per weight, using only 19 GB!
ExLlama is a framework similar to llama.cpp
, but with a primary focus on accelerating inference using GPUs, whereas llama.cpp
is optimized for CPU-based execution in mind. In a previous article, we explored how ExLlamaV2 handled quantization and inference to make fast and small LLMs.
When ExLlamaV2 was released, it quickly became the fastest framework for running quantized models. However, the landscape has since evolved, with strong competition emerging, notably, highly efficient quantization formats like Marlin, and ultra-fast inference frameworks such as vLLM that support them.
Despite this, the ExLlama team has continued to refine its framework, leading to the release of ExLlamaV3. This new version promises even faster inference and a significantly improved quantization algorithm.
In this article, we'll see how to use ExLlamaV3 and understand how it works under the hood. We'll check its quantization approach and how it enables very large models to run on a single GPU at impressive speeds. In short, ExLlamaV3 is blazing fast, and the development team is actively working to make it even faster. Its quantization accuracy is also among the best currently available.
The notebook below contains all the code and commands needed to quantize and run LLMs using ExLlamaV3:
ExLlamaV3’s Quantization Algorithm Explained
Keep reading with a 7-day free trial
Subscribe to The Kaitchup – AI on a Budget to keep reading this post and get 7 days of free access to the full post archives.