The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

QLoRA: Fine-Tune a Large Language Model on Your GPU

Fine-tuning models with billions of parameters on consumer hardware

Benjamin Marie's avatar
Benjamin Marie
May 30, 2023
∙ Paid
Comparison between standard, LoRA, and QLoRA for fine-tuning an LLM

Most large language models (LLMs) are far too large to fine-tune on consumer hardware. For example, fine-tuning a 70-billion-parameter model typically requires a multi-GPU node, such as 8 NVIDIA H100s, an extremely costly setup that can run into hundreds of thousands of dollars. In practice, this means relying on cloud computing, where costs can still escalate quickly to unaffordable levels.

With QLoRA (Dettmers et al., 2023), you could do it with only one GPU, instead of 8.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we won’t be fine-tuning a massive 70B model, but rather a smaller one that would still require expensive GPUs, unless we use a parameter-efficient fine-tuning method like QLoRA. I’ll introduce QLoRA, explain how it works, and demonstrate how to use it to fine-tune a 4-billion-parameter Qwen3 base model directly on your GPU.

Note: I ran all the code in this post on an RTX 4090 from RunPod (referral link), but you can achieve the same results using a free Google Colab instance. If your GPU has less memory, you can simply choose a smaller LLM.

Get the notebook (#2)

Last update: August 10th, 2025

QLoRA: Quantized LLMs with Low-Rank Adapters

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture