Estimate the Memory Consumption of LLMs for Inference and Fine-tuning

A close look at the memory consumption of Command-R+, Mixtral-8x22B, and Llama 3 70B

Apr 25, 2024

∙ Paid

With Command-R+, Mixtral-8x22b, and Llama 3 70B that were all released within a few weeks, we have now LLMs that perform more and more closely to the best GPT-4 models. However, these models are huge. They all have more than 70 billion parameters:

Command-R+: A 104B parameter model
Mixtral-8x22b: A mixture-of-expert (MoE) model with 141B parameters
Llama 3 70B: A model with 70.6B parameters

Can you fine-tune and run these models on your computer?

In this article, I explain and analyze their memory consumption for inference and fine-tuning. The method I present applies to any transformer LLMs to estimate their memory consumption without downloading them. We will see that, while the memory consumption of Command-R+, Mixtral-8x22b, and Llama 3 70B is huge, there are several techniques to significantly reduce it, such as quantization and memory-efficient optimizers.

I made a notebook that can automatically estimate the memory consumption of a transformer model for inference and fine-tuning. You can find it here:

Get the notebook (#64)

The Kaitchup – AI on a Budget

Estimate the Memory Consumption of LLMs for Inference and Fine-tuning

A close look at the memory consumption of Command-R+, Mixtral-8x22B, and Llama 3 70B

This post is for paid subscribers