The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Quantize and Fine-tune LLMs with GPTQ Using Transformers and TRL

Quantize and Fine-tune LLMs with GPTQ Using Transformers and TRL

GPTQ is now much easier to use

Benjamin Marie's avatar
Benjamin Marie
Aug 30, 2023
∙ Paid
8

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Quantize and Fine-tune LLMs with GPTQ Using Transformers and TRL
2
2
Share

Many large language models (LLMs) on the Hugging Face Hub are quantized with AutoGPTQ, an efficient and easy-to-use implementation of GPTQ.

GPTQ quantization has several advantages over other quantization methods such as bitsandbytes nf4. For instance, GPTQ yields faster models for inference and supports more data types for quantization to lower precision. You will find a detailed comparison between GPTQ and bitsandbytes quantizations in my previous article:

GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs — Examples with Llama 2

GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs — Examples with Llama 2

Benjamin Marie, PhD
·
August 22, 2023
Read full story

GPTQ models are now much easier to use since Hugging Face Transformers and TRL natively support them.

With Transformers and TRL, you can:

  • Quantize an LLM with GPTQ with a 4-bit, 3-bit, or 2-bit precision

  • Load a GPTQ LLM from your computer or the HF hub

  • Serialize a GPTQ LLM

  • Fine-tune a GPTQ LLM

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I show you how to do all of this on consumer hardware (when possible…). I only use Llama 2 7B for example but you can apply GPTQ to most LLMs with an encoder-only or a decoder-only architecture. I also compare the fine-tuning speed and performance of Transformers GPTQ with bitsandbytes nf4.

If you are searching for fine-tuning 2-bit or 3-bit models, I rather recommend more powerful quantization methods such as AQLM:

Fine-tune Mixtral-8x7B Quantized with AQLM (2-bit) on Your GPU

Fine-tune Mixtral-8x7B Quantized with AQLM (2-bit) on Your GPU

Benjamin Marie
·
March 14, 2024
Read full story

The notebook showing how to quantize and fine-tune GPTQ models is here:

Get the notebook (#12)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share