The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Mixtral-8x7B VRAM requirements (FP16 vs 4-bit/Q4): run it on consumer GPUs with expert offloading

VRAM usage & speed benchmarks (≈13 GB on a 16 GB GPU; ≈11.7 GB with more offload) with expert-aware quantization

Benjamin Marie's avatar
Benjamin Marie
Jan 08, 2024
∙ Paid
Image generated with Substack

While Mixtral-8x7B is one of the best open LLMs, it is also huge (46.7B parameters). VRAM requirements are high: even 4-bit (Q4) quantization can’t fully load it on a consumer GPU (e.g., an RTX 3090 with 24 GB of VRAM is not enough). Later in this post I report VRAM usage and tokens/sec with expert offloading, and compare against FP16 / higher-precision settings.

Mixtral-8x7B is a mixture of experts (MoE). It is made of 8 expert sub-networks of 6 billion parameters each. I explained in more detail how the model works in this article:

Mixtral-8x7B: Understanding and Running the Sparse Mixture of Experts by Mistral AI

Mixtral-8x7B: Understanding and Running the Sparse Mixture of Experts by Mistral AI

Benjamin Marie
·
December 12, 2023
Read full story

Since only 2 experts among 8 are effective during decoding, the 6 remaining experts can be moved, or offloaded, to another device, e.g., the CPU RAM, to free up some of the GPU VRAM. In practice, this offloading is complicated.

Choosing which one of the experts to activate is a decision taken at inference time for each input token and each layer of the model. Naively moving some parts of the model to the CPU RAM, as with Accelerate’s device_map, would create a communication bottleneck between the CPU and the GPU.

Mixtral-offloading (MIT license) is a project that proposes a much more efficient solution to reduce VRAM consumption while preserving a reasonable inference speed.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I explain how mixtral-offloading implements expert-aware quantization and expert offloading to save memory and maintain a good inference speed. Using this framework, we will see how to run Mixtral-8x7B on consumer hardware and benchmark its inference speed.

The tutorial section is also available as a notebook that you can find here:

Get the notebook (#37)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture