The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fast Inference with GGUF LoRA Adapters on Your CPU

Fast Inference with GGUF LoRA Adapters on Your CPU

GGUF your LoRA

Benjamin Marie's avatar
Benjamin Marie
Nov 14, 2024
∙ Paid
5

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fast Inference with GGUF LoRA Adapters on Your CPU
6
1
Share
Generated with ChatGPT

The GGUF format is probably the most used format for quantized models. It is a binary file format designed to store large language models (LLMs) efficiently and load them quickly using GGML, a C-based machine learning library. It packages everything needed for model inference, like the tokenizer and code, into one file. GGUF can convert models like Llama, Phi, and Qwen. They are mainly used for fast inference on CPUs with llama.cpp.

In a previous article, we saw how to make a more accurate quantization leveraging an imatrix during GGUF conversion.

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

Benjamin Marie
·
September 9, 2024
Read full story

Until recently, LoRA adapters required merging prior to GGUF conversion using a different framework like Hugging Face PEFT. This was quite inconvenient because, with multiple adapters, we needed to merge each individually and create separate GGUF files. However, a recent update to llama.cpp introduced a script that allows direct GGUF conversion of a LoRA adapter, which can then be loaded on top of the GGUF model.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we will experiment with this new functionality introduced in llama.cpp to GGUF LoRA adapters. We will compare it with a standard merge, made with Hugging Face PEFT, followed by GGUF conversion to find out whether it yields the same results. We will also see how to use GGUF LoRA adapter with llama.cpp.

I made a notebook showing how to GGUF a LoRA adapter and how to use it with llama.cpp. The notebook also shows how to merge the adapter and then convert the model to GGUF.

Get the notebook (#121)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share