The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Serve Multiple LoRA Adapters with vLLM

Serve Multiple LoRA Adapters with vLLM

Without any increase in latency

Benjamin Marie's avatar
Benjamin Marie
Aug 01, 2024
∙ Paid
16

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Serve Multiple LoRA Adapters with vLLM
7
1
Share
Generated with DALL-E

With a LoRA adapter, we can specialize a large language model (LLM) for a task or a domain. The adapter must be loaded on top of the LLM to be used for inference. For some applications, it might be useful to serve users with multiple adapters. For instance, one adapter could perform function calling and another could perform a very different task, such as classification, translation, or other language generation tasks.

However, to use multiple adapters, a standard inference framework would have to first unload the current adapter and then load the new adapter. This unload/load sequence can take several seconds which would degrade the user experience.

Fortunately, there are open source frameworks that can serve multiple adapters at the same time without any noticeable time between the use of two different adapters. For instance, vLLM, which is one of the most efficient open source inference frameworks, can easily run and serve multiple LoRA adapters at the same time.

vLLM: Serve Fast Mistral 7B and Llama 2 Models from Your Computer

vLLM: Serve Fast Mistral 7B and Llama 2 Models from Your Computer

Benjamin Marie
·
February 15, 2024
Read full story

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we will see how to use vLLM with multiple LoRA adapters. I explain how to use LoRA adapters with offline inference and how to serve several adapters to users for online inference. I use Llama 3 for the examples with adapters for function calling and chat.

The following notebook implements the code for serving multiple LoRA adapters with vLLM:

Get the notebook (#91)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share