Fine-Tuning 2-Bit Qwen3 Models on Your Computer

Code and best practices

Jun 09, 2025

∙ Paid

QLoRA is a widely adopted method for fine-tuning quantized large language models (LLMs). Instead of updating the full model, QLoRA freezes the base model’s weights and trains a lightweight adapter, a small number of additional parameters, inserted into key components like self-attention and MLP layers. This approach enables efficient fine-tuning with minimal memory and compute overhead.

QLoRA: Fine-Tune a Large Language Model on Your GPU

Benjamin Marie

May 30, 2023

Read full story

This technique is most commonly used with bitsandbytes 4-bit quantization, which has been shown to produce stable and reasonably accurate results. However, as discussed in a previous article, bitsandbytes is not optimal for QLoRA: it lacks the accuracy and efficiency of more recent, state-of-the-art quantization methods.

Modern alternatives not only deliver higher accuracy but also faster fine-tuning, thanks to optimized CUDA kernels. These newer techniques also support lower-bit quantization, including 2-bit and 3-bit formats. That said, fine-tuning low-bit models remains challenging. Such models often suffer from significant accuracy degradation, making them difficult to train reliably. Nonetheless, fine-tuning an adapter, rather than the entire model, can act as a form of targeted “repair” while improving the model for a specific task.

In this article, we’ll explore the main challenges of fine-tuning adapters for low-bit models. At extreme compression levels, the model's initial accuracy may be so low that it becomes unrecoverable, or training may become unstable, even tiny learning rates can trigger gradient explosions. Proper adapter initialization, such as using EoRA, can help mitigate these issues, speeding up convergence and improving final performance.

We'll walk through how to fine-tune a LoRA adapter for a 2-bit Qwen3-14B model using Transformers and TRL, all on a single 24 GB RTX 4090 GPU (via RunPod; referral link).

Here is the notebook containing the fine-tuning code:

Get the notebook (#170)

The Basics of QLoRA Fine-Tuning for 2-Bit LLMs

Choosing the Right Model: Degraded, Not Broken

A successful QLoRA fine-tuning process starts with a solid foundation. While low-bit quantization, especially at 2 bits, can significantly degrade a model’s performance, it should not render the model completely unusable. In other words, the quantized model must still be capable of generating coherent text.

Smaller LLMs are particularly vulnerable to collapsing under aggressive quantization, so careful model selection is crucial. In a previous article, we demonstrated how to accurately quantize Qwen3 models to both 4-bit and 2-bit precision, preserving baseline usability while enabling efficient fine-tuning.

How Well Does Qwen3 Handle 4-bit and 2-bit Quantization?

Benjamin Marie

May 1