The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Run Llama 3.3 70B on Your GPU with ExLlamaV3
Copy link
Facebook
Email
Notes
More

Run Llama 3.3 70B on Your GPU with ExLlamaV3

Fast Llama 3.3 70B at 1.75 bits per weight, using only 19 GB!

Benjamin Marie's avatar
Benjamin Marie
Apr 17, 2025
∙ Paid
7

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Run Llama 3.3 70B on Your GPU with ExLlamaV3
Copy link
Facebook
Email
Notes
More
1
Share
Image generated with ChatGPT

ExLlama is a framework similar to llama.cpp, but with a primary focus on accelerating inference using GPUs, whereas llama.cpp is optimized for CPU-based execution in mind. In a previous article, we explored how ExLlamaV2 handled quantization and inference to make fast and small LLMs.

Run Llama 3.1 70B Instruct on Your GPU with ExLlamaV2 (2.2, 2.5, 3.0, and 4.0-bit)

Run Llama 3.1 70B Instruct on Your GPU with ExLlamaV2 (2.2, 2.5, 3.0, and 4.0-bit)

Benjamin Marie
·
August 29, 2024
Read full story

When ExLlamaV2 was released, it quickly became the fastest framework for running quantized models. However, the landscape has since evolved, with strong competition emerging, notably, highly efficient quantization formats like Marlin, and ultra-fast inference frameworks such as vLLM that support them.

Despite this, the ExLlama team has continued to refine its framework, leading to the release of ExLlamaV3. This new version promises even faster inference and a significantly improved quantization algorithm.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we'll see how to use ExLlamaV3 and understand how it works under the hood. We'll check its quantization approach and how it enables very large models to run on a single GPU at impressive speeds. In short, ExLlamaV3 is blazing fast, and the development team is actively working to make it even faster. Its quantization accuracy is also among the best currently available.

The notebook below contains all the code and commands needed to quantize and run LLMs using ExLlamaV3:

Get the notebook (#158)

ExLlamaV3’s Quantization Algorithm Explained

Keep reading with a 7-day free trial

Subscribe to The Kaitchup – AI on a Budget to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More