
In The Kaitchup, I mainly discussed QLoRa to run large language models (LLM) on consumer hardware.
QLoRa: Fine-Tune a Large Language Model on Your GPU
Fine-tuning models with billions of parameters is now possible on consumer hardware Most large language models (LLM) are too big to be fine-tuned on consumer hardware. For instance, to fine-tune a 65 billion parameters model we need more than 780 Gb of GPU memory. This is equivalent to ten A100 80 Gb GPUs. In other words, you w…
But QLoRa was mainly proposed to make fine-tuning faster and more affordable. It’s not the best option for inference if your model is already fine-tuned. For this scenario, GPTQ is much more suitable.
I’m currently writing a complete article for The Kaitchup comparing QLoRa and GPTQ. They both have different pros and cons. Meanwhile, I can already share with you how to quantize Llama 2 with GPTQ and run it on your computer.