Fast Inference with GGUF LoRA Adapters on Your CPU

GGUF your LoRA

Nov 14, 2024

∙ Paid

The GGUF format is probably the most used format for quantized models. It is a binary file format designed to store large language models (LLMs) efficiently and load them quickly using GGML, a C-based machine learning library. It packages everything needed for model inference, like the tokenizer and code, into one file. GGUF can convert models like Llama, Phi, and Qwen. They are mainly used for fast inference on CPUs with llama.cpp.

In a previous article, we saw how to make a more accurate quantization leveraging an imatrix during GGUF conversion.

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

Benjamin Marie

September 9, 2024

Read full story

Until recently, LoRA adapters required merging prior to GGUF conversion using a different framework like Hugging Face PEFT. This was quite inconvenient because, with multiple adapters, we needed to merge each individually and create separate GGUF files. However, a recent update to llama.cpp introduced a script that allows direct GGUF conversion of a LoRA adapter, which can then be loaded on top of the GGUF model.

In this article, we will experiment with this new functionality introduced in llama.cpp to GGUF LoRA adapters. We will compare it with a standard merge, made with Hugging Face PEFT, followed by GGUF conversion to find out whether it yields the same results. We will also see how to use GGUF LoRA adapter with llama.cpp.

I made a notebook showing how to GGUF a LoRA adapter and how to use it with llama.cpp. The notebook also shows how to merge the adapter and then convert the model to GGUF.

Get the notebook (#121)

The Kaitchup – AI on a Budget

Fast Inference with GGUF LoRA Adapters on Your CPU

GGUF your LoRA

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

This post is for paid subscribers