Fine-tuning large language models requires a huge amount of GPU memory. We need a GPU with enough RAM to load the model and store the optimizer states. If we take a 7 billion parameter model, e.g., a Mistral 7B, we need 14 GB of GPU memory to load the model, assuming that the model is loaded with 16-bit parameters. Moreover, for each parameter, the standard AdamW optimizer creates and stores 2 new 32-bit parameters, i.e., we need an additional 56 GB of memory. That’s already a total of 70 GB!
This is without counting the additional copies of the model made at various stages of fine-tuning and the model’s activations whose memory consumption is also significant but difficult to estimate as it depends on hyperparameters. One A100/H100 80 GB GPU wouldn’t be enough to fully fine-tune a 7B model.
Storing the optimizer states is by far the most expensive part. Even if we quantize the optimizer’s parameters to 8-bit, they would still require 14 GB of memory.
GaLore can significantly reduce this memory consumption to a point where it becomes possible to perform full fine-tuning of a 7B model on a 24 GB consumer GPU. Moreover, while parameter-efficient fine-tuning (PEFT) methods such as LoRA often don’t perform as well as full fine-tuning, GaLore performs comparably to full fine-tuning.
In this article, I present GaLore. We will see how it works and why it is more memory-efficient. Then, I show how to use it to fully fine-tune Mistral 7B on consumer hardware. This tutorial can also be used to fully fine-tune most LLMs supported by Hugging Face Transformers.
The notebook implementing GaLore’s full fine-tuning for Mistral 7B is available here: