Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer

Llama 2 but 75% smaller

Benjamin Marie

Jul 27, 2023

∙ Paid

In The Kaitchup, I mainly discussed QLoRa to run large language models (LLM) on consumer hardware.

QLoRa: Fine-Tune a Large Language Model on Your GPU

Benjamin Marie

May 30, 2023

QLoRa: Fine-Tune a Large Language Model on Your GPU

Fine-tuning models with billions of parameters is now possible on consumer hardware Most large language models (LLM) are too big to be fine-tuned on consumer hardware. For instance, to fine-tune a 65 billion parameters model we need more than 780 Gb of GPU memory. This is equivalent to ten A100 80 Gb GPUs. In other words, you w…

Read full story

But QLoRa was mainly proposed to make fine-tuning faster and more affordable. It’s not the best option for inference if your model is already fine-tuned. For this scenario, GPTQ is much more suitable.

I’m currently writing a complete article for The Kaitchup comparing QLoRa and GPTQ. They both have different pros and cons. Meanwhile, I can already share with you how to quantize Llama 2 with GPTQ and run it on your computer.

The Kaitchup – AI on a Budget

Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer

Llama 2 but 75% smaller

QLoRa: Fine-Tune a Large Language Model on Your GPU

This post is for paid subscribers