Serve Multiple LoRA Adapters with vLLM

Without any increase in latency

Aug 01, 2024

∙ Paid

With a LoRA adapter, we can specialize a large language model (LLM) for a task or a domain. The adapter must be loaded on top of the LLM to be used for inference. For some applications, it might be useful to serve users with multiple adapters. For instance, one adapter could perform function calling and another could perform a very different task, such as classification, translation, or other language generation tasks.

However, to use multiple adapters, a standard inference framework would have to first unload the current adapter and then load the new adapter. This unload/load sequence can take several seconds which would degrade the user experience.

Fortunately, there are open source frameworks that can serve multiple adapters at the same time without any noticeable time between the use of two different adapters. For instance, vLLM, which is one of the most efficient open source inference frameworks, can easily run and serve multiple LoRA adapters at the same time.

vLLM: Serve Fast Mistral 7B and Llama 2 Models from Your Computer

Benjamin Marie

February 15, 2024

Read full story

In this article, we will see how to use vLLM with multiple LoRA adapters. I explain how to use LoRA adapters with offline inference and how to serve several adapters to users for online inference. I use Llama 3 for the examples with adapters for function calling and chat.

The following notebook implements the code for serving multiple LoRA adapters with vLLM:

Get the notebook (#91)

The Kaitchup – AI on a Budget

Serve Multiple LoRA Adapters with vLLM

Without any increase in latency

vLLM: Serve Fast Mistral 7B and Llama 2 Models from Your Computer

This post is for paid subscribers