Fine-Tune Llama 3.2 Vision, Pixtral, and Qwen2-VL on Your Computer with Unsloth
Step-by-step guide for memory-efficient fine-tuning
Vision-language models (VLMs) are good at a wide range of visual tasks due to their ability to encode visual inputs in conjunction with text prompts.
In a previous article, we explored fine-tuning Florence 2, a compact VLM, for visual-question answering tasks. Its small size made it possible to fine-tune the model on a single GPU with 24 GB of memory.
However, fine-tuning state-of-the-art VLMs like Qwen2-VL and Pixtral presents significant challenges. These models are substantially larger, and fine-tuning them using standard frameworks like Hugging Face Transformers requires professional-grade GPUs. Even with quantization and parameter-efficient techniques such as LoRA (QLoRA), the process remains computationally intensive and relatively slow. That said, VLMs do not have dramatically more parameters than their language-model counterparts. For instance, Qwen2-VL 7B has only a modest increase in parameters (+670M) compared to Qwen2 7B which can, on the other hand, easily be fine-tuned on consumer hardware.
Why do VLMs require significantly more memory for fine-tuning than similarly sized LLMs?
In this article, we will first explore the factors that make fine-tuning VLMs more memory-intensive. Then, we will see how to fine-tune VLMs like Qwen2-VL using a single consumer GPU. The minimum GPU requirement is 12 GB, and for this tutorial, I used an RTX 4080 available through RunPod (referral link). This tutorial provides a step-by-step breakdown of the fine-tuning process, with Unsloth, a toolkit designed to optimize memory use and accelerate fine-tuning.
I included a detailed notebook that walks through the fine-tuning process for Qwen2-VL 7B, focusing on multimodal chat applications with images and text as inputs:
The same notebook can be used to fine-tune Pixtral and Llama 3.2 Vision.