bitnet.cpp: Efficient Inference with 1-Bit LLMs on your CPU

How to run "1-bit" (but 1.58-bit) LLMs made of ternary weights packed to 2-bit

Oct 28, 2024

∙ Paid

BitNet is a specialized transformer architecture developed by Microsoft Research. It uses an approach where each model parameter is represented by only three values: -1, 0, and 1. This drastically reduces the memory required, as each parameter consumes just 1.58 bits instead of the standard 16 bits. Microsoft refers to these models as "1-bit LLMs."

These ternary LLMs are pre-trained from scratch to find optimal weights, using a training process specifically designed for low-precision parameters. Despite working with such limited precision, Microsoft has demonstrated that these models still achieve competitive performance compared to traditional LLMs with higher bit precision.

Today, several of these ternary LLMs are available on the Hugging Face Hub for public use, making them accessible to researchers and developers. They can run on consumer hardware.

To make these models even more practical, Microsoft released bitnet.cpp. This open-source software includes optimized kernels for efficient inference with ternary LLMs on standard CPUs. It works very similarly to llama.cpp and also uses the GGUF format.

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

Benjamin Marie

September 9, 2024

Read full story

In this article, we will first explore how these 1-bit LLMs work and then experiment with some of them using bitnet.cpp on a CPU.

I made a notebook showing how to use bitnet.cpp with 1-bit LLMs here:

Get the notebook (#116)

The Kaitchup – AI on a Budget

bitnet.cpp: Efficient Inference with 1-Bit LLMs on your CPU

How to run "1-bit" (but 1.58-bit) LLMs made of ternary weights packed to 2-bit

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

Training from Scratch "1-bit" Ternary LLMs

This post is for paid subscribers