With a LoRA adapter, we can specialize a large language model (LLM) for a task or a domain. The adapter must be loaded on top of the LLM to be used for inference. For some applications, it might be useful to serve users with multiple adapters. For instance, one adapter could perform function calling and another could perform a very different task, such as classification, translation, or other language generation tasks.
However, to use multiple adapters, a standard inference framework would have to first unload the current adapter and then load the new adapter. This unload/load sequence can take several seconds which would degrade the user experience.
Fortunately, there are open source frameworks that can serve multiple adapters at the same time without any noticeable time between the use of two different adapters. For instance, vLLM, which is one of the most efficient open source inference frameworks, can easily run and serve multiple LoRA adapters at the same time.
In this article, we will see how to use vLLM with multiple LoRA adapters. I explain how to use LoRA adapters with offline inference and how to serve several adapters to users for online inference. I use Llama 3 for the examples with adapters for function calling and chat.
The following notebook implements the code for serving multiple LoRA adapters with vLLM: