The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer
Copy link
Facebook
Email
Notes
More

Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer

Llama 2 but 75% smaller

Benjamin Marie's avatar
Benjamin Marie
Jul 27, 2023
∙ Paid
4

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer
Copy link
Facebook
Email
Notes
More
2
Share
Photo by Liudmila Shuvalova on Unsplash

In The Kaitchup, I mainly discussed QLoRa to run large language models (LLM) on consumer hardware.

QLoRa: Fine-Tune a Large Language Model on Your GPU

Benjamin Marie
·
May 30, 2023
QLoRa: Fine-Tune a Large Language Model on Your GPU

Fine-tuning models with billions of parameters is now possible on consumer hardware Most large language models (LLM) are too big to be fine-tuned on consumer hardware. For instance, to fine-tune a 65 billion parameters model we need more than 780 Gb of GPU memory. This is equivalent to ten A100 80 Gb GPUs. In other words, you w…

Read full story

But QLoRa was mainly proposed to make fine-tuning faster and more affordable. It’s not the best option for inference if your model is already fine-tuned. For this scenario, GPTQ is much more suitable.

I’m currently writing a complete article for The Kaitchup comparing QLoRa and GPTQ. They both have different pros and cons. Meanwhile, I can already share with you how to quantize Llama 2 with GPTQ and run it on your computer.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More