Estimate the Memory Consumption of LLMs for Inference and Fine-tuning
A close look at the memory consumption of Command-R+, Mixtral-8x22B, and Llama 3 70B
With Command-R+, Mixtral-8x22b, and Llama 3 70B that were all released within a few weeks, we have now LLMs that perform more and more closely to the best GPT-4 models. However, these models are huge. They all have more than 70 billion parameters:
Command-R+: A 104B parameter model
Mixtral-8x22b: A mixture-of-expert (MoE) model with 141B parameters
Llama 3 70B: A model with 70.6B parameters
Can you fine-tune and run these models on your computer?
In this article, I explain and analyze their memory consumption for inference and fine-tuning. The method I present applies to any transformer LLMs to estimate their memory consumption without downloading them. We will see that, while the memory consumption of Command-R+, Mixtral-8x22b, and Llama 3 70B is huge, there are several techniques to significantly reduce it, such as quantization and memory-efficient optimizers.
I made a notebook that can automatically estimate the memory consumption of a transformer model for inference and fine-tuning. You can find it here: