The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-Tune Llama 3.2 Vision, Pixtral, and Qwen2-VL on Your Computer with Unsloth

Fine-Tune Llama 3.2 Vision, Pixtral, and Qwen2-VL on Your Computer with Unsloth

Step-by-step guide for memory-efficient fine-tuning

Benjamin Marie's avatar
Benjamin Marie
Nov 28, 2024
∙ Paid
9

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-Tune Llama 3.2 Vision, Pixtral, and Qwen2-VL on Your Computer with Unsloth
1
Share
Generated with Grok

Vision-language models (VLMs) are good at a wide range of visual tasks due to their ability to encode visual inputs in conjunction with text prompts.

In a previous article, we explored fine-tuning Florence 2, a compact VLM, for visual-question answering tasks. Its small size made it possible to fine-tune the model on a single GPU with 24 GB of memory.

Fine-tune a Multimodal Chat Model with Florence-2 on Your Computer

Fine-tune a Multimodal Chat Model with Florence-2 on Your Computer

Benjamin Marie
·
July 8, 2024
Read full story

However, fine-tuning state-of-the-art VLMs like Qwen2-VL and Pixtral presents significant challenges. These models are substantially larger, and fine-tuning them using standard frameworks like Hugging Face Transformers requires professional-grade GPUs. Even with quantization and parameter-efficient techniques such as LoRA (QLoRA), the process remains computationally intensive and relatively slow. That said, VLMs do not have dramatically more parameters than their language-model counterparts. For instance, Qwen2-VL 7B has only a modest increase in parameters (+670M) compared to Qwen2 7B which can, on the other hand, easily be fine-tuned on consumer hardware.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Why do VLMs require significantly more memory for fine-tuning than similarly sized LLMs?

In this article, we will first explore the factors that make fine-tuning VLMs more memory-intensive. Then, we will see how to fine-tune VLMs like Qwen2-VL using a single consumer GPU. The minimum GPU requirement is 12 GB, and for this tutorial, I used an RTX 4080 available through RunPod (referral link). This tutorial provides a step-by-step breakdown of the fine-tuning process, with Unsloth, a toolkit designed to optimize memory use and accelerate fine-tuning.

I included a detailed notebook that walks through the fine-tuning process for Qwen2-VL 7B, focusing on multimodal chat applications with images and text as inputs:

Get the notebook (#125)

The same notebook can be used to fine-tune Pixtral and Llama 3.2 Vision.

Fine-Tuning VLMs: Why Do We Need So Much Memory?

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share