In The Kaitchup, I mainly discussed QLoRa to run large language models (LLM) on consumer hardware.
But QLoRa was mainly proposed to make fine-tuning faster and more affordable. It’s not the best option for inference if your model is already fine-tuned. For this scenario, GPTQ is much more suitable.
I’m currently writing a complete article for The Kaitchup comparing QLoRa and GPTQ. They both have different pros and cons. Meanwhile, I can already share with you how to quantize Llama 2 with GPTQ and run it on your computer.