The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-Tuning 2-Bit Qwen3 Models on Your Computer

Fine-Tuning 2-Bit Qwen3 Models on Your Computer

Code and best practices

Benjamin Marie's avatar
Benjamin Marie
Jun 09, 2025
∙ Paid
11

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-Tuning 2-Bit Qwen3 Models on Your Computer
2
Share

QLoRA is a widely adopted method for fine-tuning quantized large language models (LLMs). Instead of updating the full model, QLoRA freezes the base model’s weights and trains a lightweight adapter, a small number of additional parameters, inserted into key components like self-attention and MLP layers. This approach enables efficient fine-tuning with minimal memory and compute overhead.

QLoRA: Fine-Tune a Large Language Model on Your GPU

QLoRA: Fine-Tune a Large Language Model on Your GPU

Benjamin Marie
·
May 30, 2023
Read full story

This technique is most commonly used with bitsandbytes 4-bit quantization, which has been shown to produce stable and reasonably accurate results. However, as discussed in a previous article, bitsandbytes is not optimal for QLoRA: it lacks the accuracy and efficiency of more recent, state-of-the-art quantization methods.

Modern alternatives not only deliver higher accuracy but also faster fine-tuning, thanks to optimized CUDA kernels. These newer techniques also support lower-bit quantization, including 2-bit and 3-bit formats. That said, fine-tuning low-bit models remains challenging. Such models often suffer from significant accuracy degradation, making them difficult to train reliably. Nonetheless, fine-tuning an adapter, rather than the entire model, can act as a form of targeted “repair” while improving the model for a specific task.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we’ll explore the main challenges of fine-tuning adapters for low-bit models. At extreme compression levels, the model's initial accuracy may be so low that it becomes unrecoverable, or training may become unstable, even tiny learning rates can trigger gradient explosions. Proper adapter initialization, such as using EoRA, can help mitigate these issues, speeding up convergence and improving final performance.

We'll walk through how to fine-tune a LoRA adapter for a 2-bit Qwen3-14B model using Transformers and TRL, all on a single 24 GB RTX 4090 GPU (via RunPod; referral link).

Here is the notebook containing the fine-tuning code:

Get the notebook (#170)

The Basics of QLoRA Fine-Tuning for 2-Bit LLMs

Choosing the Right Model: Degraded, Not Broken

A successful QLoRA fine-tuning process starts with a solid foundation. While low-bit quantization, especially at 2 bits, can significantly degrade a model’s performance, it should not render the model completely unusable. In other words, the quantized model must still be capable of generating coherent text.

Smaller LLMs are particularly vulnerable to collapsing under aggressive quantization, so careful model selection is crucial. In a previous article, we demonstrated how to accurately quantize Qwen3 models to both 4-bit and 2-bit precision, preserving baseline usability while enabling efficient fine-tuning.

How Well Does Qwen3 Handle 4-bit and 2-bit Quantization?

How Well Does Qwen3 Handle 4-bit and 2-bit Quantization?

Benjamin Marie
·
May 1
Read full story

We found that Qwen3 is quite robust to 2-bit quantization, with a limited yet significant degradation on tasks requiring good instruction-following capabilities.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share