Mistral 7B: QLoRA Fine-tuning + 4-bit (Q4/NF4/INT4) VRAM Usage
VRAM requirements, LoRA target_modules, and 4-bit quantization (NF4/INT4)
Mistral 7B is now a very popular large language model (LLM). It outperforms all the other pre-trained LLMs of similar size and is even better than larger LLMs such as Llama 2 13B.
It is also very well optimized for fast decoding, especially for long contexts, thanks to the use of a sliding window to compute attention and grouped-query attention (GQA). You can find more details in the arXiv paper presenting Mistral 7B.
Mistral 7B is good but small enough to be exploited with affordable hardware (VRAM requirements: ~15 GB to load; ~4 GB with 4-bit (Q4/NF4) quantization).
In this article, I will show you how to fine-tune Mistral 7B with QLoRA. We will use the dataset “ultrachat” that I modified for this article. Ultrachat was used by Hugging Face to create the very good Zephyr 7B. We will also see how to quantize Mistral7B with AutoGPTQ.
I wrote notebooks implementing all the sections. You can find them here:
Last update: March 8th, 2024


