The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Neural Speed: Fast Inference on CPU for 4-bit Large Language Models

Neural Speed: Fast Inference on CPU for 4-bit Large Language Models

Up to 40x faster than llama.cpp?

Benjamin Marie's avatar
Benjamin Marie
Apr 15, 2024
∙ Paid
7

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Neural Speed: Fast Inference on CPU for 4-bit Large Language Models
Share
Generate with DALL-E

Running large language models (LLMs) on consumer hardware can be challenging. If the LLM doesn’t fit on the GPU memory, quantization is usually applied to reduce its size. However, even after quantization, the model might still be too large to fit on the GPU. An alternative is to run it on the CPU RAM using a framework optimized for CPU inference such as llama.cpp.

GGUF Quantization for Fast and Memory-Efficient Inference on Your CPU

GGUF Quantization for Fast and Memory-Efficient Inference on Your CPU

Benjamin Marie
·
February 29, 2024
Read full story

Intel, inspired by llama.cpp, is also working on accelerating inference on the CPU. They propose a framework, Intel’s extension for Transformers, built on top of Hugging Face Transformers and easy to use to exploit the CPU. In a previous article, I tried it to fine-tune LLMs on the CPU. It works but it’s slow:

Fine-tune LLMs on Your CPU with QLoRA

Fine-tune LLMs on Your CPU with QLoRA

Benjamin Marie
·
January 4, 2024
Read full story

With Neural Speed, which relies on Intel’s extension for Transformers, Intel further accelerates inference for 4-bit LLMs on CPUs. According to Intel, using this framework can make inference up to 40x faster than llama.cpp.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I review the main optimizations Neural Speed brings. I show how to use it and benchmark the inference throughput. I also compare it with llama.cpp.

The notebook demonstrating how to use Neural Speed is available here:

Get the notebook (#60)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share