Serve Large Language Models from Your Computer with Text Generation Inference
Examples with the instruct version of Falcon-7B
Running very large language models (LLM) locally, on consumer hardware, is now possible thanks to quantization methods such as QLoRa and GPTQ.
Considering how long it takes to load an LLM, we may also want to keep the LLM in memory to query it and have the results instantly. If you use LLMs with a standard inference pipeline, you must reload the model each time. If the model is very large, you may have to wait several minutes for the model to generate an output.
There are various frameworks that can host LLMs on a server (locally or remotely). On my blog, I have already presented the Triton Inference Server which is a very optimized framework, developed by NVIDIA, to serve multiple LLMs and balance the load across GPUs. But if you have only one GPU and if you want to host your model on your computer, using a Triton inference may seem unsuitable.
In this article, I present an alternative called Text Generation Inference. A more straightforward framework that implements all the minimum features to run and serve LLMs on consumer hardware.
After reading this article, you will have on your computer a chat model/LLM deployed locally and waiting for queries.
Text Generation Inference
Text Generation Inference (TGI) is a framework written in Rust and Python for deploying and serving LLMs. It is developed by Hugging Face and distributed with an HFOILv1.0. Hugging Face uses it in production to power their inference widgets.
Note: TGI was originally distributed with an Apache 2.0 License. Hugging Face recently switched to an HFOILv1.0 license. You can still use TGI for commercial purposes, but not if it’s for serving LLMs to your users/customers. In other words, Hugging Face doesn’t want to create competitors using its own open-source library.
Even though TGI has been optimized for A100 GPUs, I found TGI very suitable for self-hosted LLMs, on consumer hardware such as RTX GPUs, thanks to the support for quantization and paged attention. However, it requires a particular installation to support RTX GPUs, which I will detail later in this article.
Recently, I also found out that Hugging Face is optimizing some LLM architectures so that they run faster with TGI.
This is notably the case for the Falcon models which are relatively slow when run with a standard inference pipeline but much faster when run with TGI. One of the authors behind the Falcon models told me on Twitter that this is because they rushed the implementation of multi-query attention while Hugging Face optimized it to work with TGI.
Several LLM architectures are optimized this way to run faster with TGI: BLOOM, OPT, GPT-NeoX, etc. The full list is available and regularly updated on TGI’s GitHub.
Set up Text Generation Inference
Hardware and Software Requirements
I tested with an RTX 3060 12 GB. It should work with all the RTX 30x and 40x but note that TGI is specially optimized for A100 GPUs.
To run the commands, you will need a UNIX OS. I used Ubuntu 20.04 through Windows WSL2.
It should also work without modifications on Mac OS.
TGI requires Python ≥ 3.9.
I will first present how to install TGI from scratch which I think is not straightforward. If you run into issues during the installation, you may need to run a Docker image instead. I’ll address both scenarios.
Set up
TGI is written in Rust. You need it installed. If you don’t have it, run the following command in your terminal:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
It should take less than 2 minutes. I recommend restarting your shell, e.g., opening a new terminal, to make sure that all your environment variables are correctly updated.
Then, we create a dedicated conda environment. This step is optional but I prefer to have a separate environment for each one of my projects.
conda create -n text-generation-inference python=3.9
conda activate text-generation-inference
We have also to install Protoc. Hugging Face currently recommends version 21.12. You will need sudo privileges for this.
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
We have installed all the requirements. Now, we can install TGI.
First, clone the GitHub repository:
git clone https://github.com/huggingface/text-generation-inference.git
And then install TGI:
cd text-generation-inference/
BUILD_EXTENSIONS=False make install
Note: I set BUILD_EXTENSIONS to False to deactivate the custom CUDA kernels since I don’t have A100 GPUs.
It should install smoothly… On my computer, it didn’t. I had to run all the commands in the file server/Makefile by hand. I suspect a problem with my environment variables that were not loaded properly due to “make” switching to a different shell for some reason. You may have to do the same.
Note: If you fail to install it, don’t worry! Hugging Face has created a Docker image that you can launch to start the server, as we will see in the next section.
Launching a model with TGI
For the following examples, I use the instruct version of the Falcon-7B model which is distributed under an Apache 2.0 license. If you want to know more about the Falcon models, I presented them in a previous article:
Without Docker
The installation created a new command “text-generation-launcher” that will start the TGI server.
text-generation-launcher --model-id tiiuae/falcon-7b-instruct --num-shard 1 --port 8080 --quantize bitsandbytes
model-id: The model name on the Hugging Face Hub.
num-shard: Set it to the number of GPUs you have and that you would like to exploit.
port: The port on which you want the server to listen.
quantize: If you are using a GPU with less than 24 GB of VRAM, you will need to quantize the model to avoid running out of memory. Here I choose “bitsandbytes” for on-the-fly quantization. GPTQ (“gptq”) is also available but I’m less familiar with this algorithm.
With Docker (if the manual installation failed)
Note: if the Docker daemon is not running, and if you run Ubuntu through WSL, start the daemon in another terminal with “sudo dockerd”.
volume=$PWD/data
sudo docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.9 --model-id tiiuae/falcon-7b-instruct --num-shard 1 --quantize bitsandbytes
The parameters are almost the same as with text-generation-launcher. You can replace “all” with “0” if you have only one GPU.
Keep this Docker image running as long as you want to use the server.
Querying a model with TGI
To query the model served by TGI with a Python script, you will have to install the following library:
pip install text-generation
Then in a Python script, write something like this:
from text_generation import Client
client = Client("http://127.0.0.1:8080")
print(client.generate("Translate the following into French: 'What is Deep Learning?'", max_new_tokens=500).generated_text)
It should print:
Qu'est-ce que la profondeur de l'apprentissage ?
Which is a translation of poor quality. This is expected from a 7-billion-parameter model. It is slightly better at coding tasks:
from text_generation import Client
client = Client("http://127.0.0.1:8080")
print(client.generate("Code in Javascript a function to remove all spaces in a string and then print the string twice.", max_new_tokens=500).generated_text)
It generates:
Here is an example code snippet in JavaScript to remove all spaces in a string and then print the string twice:
```javascript
function removeSpaces(str) {
return str.replace(/\s+/g, '');
}
console.log(removeSpaces('Hello World'));
console.log(removeSpaces('Hello World'));
```
You can also query with curl instead of a Python script:
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"Code in Javascript a function to remove all spaces in a string and then print the string twice.","parameters":{"max_new_tokens":500}}' \
-H 'Content-Type: application/json'
TGI is indeed fast. Generating an output with Falcon-7B and a maximum number of tokens set to 500 takes only a few seconds with my RTX 3060 GPU.
With the standard inference pipeline, it takes nearly 40 seconds, without even counting the time it takes to load the model.
Conclusion
Self-hosting a chat model (i.e., instruct LLM) has many advantages. The main one is that you don’t send your data on the Internet. Another one is that you completely control the operating cost, which is only reflected in your electricity bill.
However, if you use a consumer GPU, you won’t be able to run state-of-the-art LLMs. Even for smaller LLMs, we have to quantize them to run only GPUs equipped with less than 24 GB of VRAM. Quantization also reduces LLMs’ accuracy.
Nonetheless, even small quantized LLMs can still be good for simple tasks: simple coding problems, binary classification, …
You can now do all these tasks on your computer, just by querying your self-hosted LLM.