Run Llama 3.3 70B on Your GPU with ExLlamaV3

Fast Llama 3.3 70B at 1.75 bits per weight, using only 19 GB!

Apr 17, 2025

∙ Paid

ExLlama is a framework similar to llama.cpp, but with a primary focus on accelerating inference using GPUs, whereas llama.cpp is optimized for CPU-based execution in mind. In a previous article, we explored how ExLlamaV2 handled quantization and inference to make fast and small LLMs.

Run Llama 3.1 70B Instruct on Your GPU with ExLlamaV2 (2.2, 2.5, 3.0, and 4.0-bit)

Benjamin Marie

August 29, 2024

Read full story

When ExLlamaV2 was released, it quickly became the fastest framework for running quantized models. However, the landscape has since evolved, with strong competition emerging, notably, highly efficient quantization formats like Marlin, and ultra-fast inference frameworks such as vLLM that support them.

Despite this, the ExLlama team has continued to refine its framework, leading to the release of ExLlamaV3. This new version promises even faster inference and a significantly improved quantization algorithm.

In this article, we'll see how to use ExLlamaV3 and understand how it works under the hood. We'll check its quantization approach and how it enables very large models to run on a single GPU at impressive speeds. In short, ExLlamaV3 is blazing fast, and the development team is actively working to make it even faster. Its quantization accuracy is also among the best currently available.

The notebook below contains all the code and commands needed to quantize and run LLMs using ExLlamaV3:

Get the notebook (#158)

The Kaitchup – AI on a Budget

Run Llama 3.3 70B on Your GPU with ExLlamaV3

Fast Llama 3.3 70B at 1.75 bits per weight, using only 19 GB!

Run Llama 3.1 70B Instruct on Your GPU with ExLlamaV2 (2.2, 2.5, 3.0, and 4.0-bit)

ExLlamaV3’s Quantization Algorithm Explained

This post is for paid subscribers