The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer

Llama 2 but 75% smaller

Benjamin Marie's avatar
Benjamin Marie
Jul 27, 2023
∙ Paid
Photo by Liudmila Shuvalova on Unsplash

In The Kaitchup, I mainly discussed QLoRa to run large language models (LLM) on consumer hardware.

But QLoRa was mainly proposed to make fine-tuning faster and more affordable. It’s not the best option for inference if your model is already fine-tuned. For this scenario, GPTQ is much more suitable.

I’m currently writing a complete article for The Kaitchup comparing QLoRa and GPTQ. They both have different pros and cons. Meanwhile, I can already share with you how to quantize Llama 2 with GPTQ and run it on your computer.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture