bitnet.cpp: Efficient Inference with 1-Bit LLMs on your CPU
How to run "1-bit" (but 1.58-bit) LLMs made of ternary weights packed to 2-bit
BitNet is a specialized transformer architecture developed by Microsoft Research. It uses an approach where each model parameter is represented by only three values: -1, 0, and 1. This drastically reduces the memory required, as each parameter consumes just 1.58 bits instead of the standard 16 bits. Microsoft refers to these models as "1-bit LLMs."
These ternary LLMs are pre-trained from scratch to find optimal weights, using a training process specifically designed for low-precision parameters. Despite working with such limited precision, Microsoft has demonstrated that these models still achieve competitive performance compared to traditional LLMs with higher bit precision.
Today, several of these ternary LLMs are available on the Hugging Face Hub for public use, making them accessible to researchers and developers. They can run on consumer hardware.
To make these models even more practical, Microsoft released bitnet.cpp
. This open-source software includes optimized kernels for efficient inference with ternary LLMs on standard CPUs. It works very similarly to llama.cpp
and also uses the GGUF format.
In this article, we will first explore how these 1-bit LLMs work and then experiment with some of them using bitnet.cpp
on a CPU.
I made a notebook showing how to use bitnet.cpp
with 1-bit LLMs here: