Neural Speed: Fast Inference on CPU for 4-bit Large Language Models
Up to 40x faster than llama.cpp?
Running large language models (LLMs) on consumer hardware can be challenging. If the LLM doesn’t fit on the GPU memory, quantization is usually applied to reduce its size. However, even after quantization, the model might still be too large to fit on the GPU. An alternative is to run it on the CPU RAM using a framework optimized for CPU inference such as llama.cpp.
Intel, inspired by llama.cpp, is also working on accelerating inference on the CPU. They propose a framework, Intel’s extension for Transformers, built on top of Hugging Face Transformers and easy to use to exploit the CPU. In a previous article, I tried it to fine-tune LLMs on the CPU. It works but it’s slow:
With Neural Speed, which relies on Intel’s extension for Transformers, Intel further accelerates inference for 4-bit LLMs on CPUs. According to Intel, using this framework can make inference up to 40x faster than llama.cpp.
In this article, I review the main optimizations Neural Speed brings. I show how to use it and benchmark the inference throughput. I also compare it with llama.cpp.
The notebook demonstrating how to use Neural Speed is available here: