Neural Speed: Fast Inference on CPU for 4-bit Large Language Models

Up to 40x faster than llama.cpp?

Apr 15, 2024

∙ Paid

Running large language models (LLMs) on consumer hardware can be challenging. If the LLM doesn’t fit on the GPU memory, quantization is usually applied to reduce its size. However, even after quantization, the model might still be too large to fit on the GPU. An alternative is to run it on the CPU RAM using a framework optimized for CPU inference such as llama.cpp.

GGUF Quantization for Fast and Memory-Efficient Inference on Your CPU

Benjamin Marie

February 29, 2024

Read full story

Intel, inspired by llama.cpp, is also working on accelerating inference on the CPU. They propose a framework, Intel’s extension for Transformers, built on top of Hugging Face Transformers and easy to use to exploit the CPU. In a previous article, I tried it to fine-tune LLMs on the CPU. It works but it’s slow:

Fine-tune LLMs on Your CPU with QLoRA

Benjamin Marie

January 4, 2024

Read full story

With Neural Speed, which relies on Intel’s extension for Transformers, Intel further accelerates inference for 4-bit LLMs on CPUs. According to Intel, using this framework can make inference up to 40x faster than llama.cpp.

In this article, I review the main optimizations Neural Speed brings. I show how to use it and benchmark the inference throughput. I also compare it with llama.cpp.

The notebook demonstrating how to use Neural Speed is available here:

Get the notebook (#60)

The Kaitchup – AI on a Budget

Neural Speed: Fast Inference on CPU for 4-bit Large Language Models

Up to 40x faster than llama.cpp?

GGUF Quantization for Fast and Memory-Efficient Inference on Your CPU

Fine-tune LLMs on Your CPU with QLoRA

This post is for paid subscribers