Serve Multiple LoRA Adapters with vLLM and Custom Chat Templates
Swap adapters per request, reuse your chat template, and run offline or via an OpenAI-compatible server.
LoRA adapters let you specialize a base LLM for specific tasks or domains by attaching low-rank weight deltas to selected layers. At inference time, the adapter must be loaded alongside the base model, and many applications benefit from serving several adapters, e.g., one for function calling and others for classification, translation, or general generation.
In a standard setup, switching tasks means unloading one adapter and loading another, which can add seconds of latency. Modern open-source servers avoid this by keeping multiple LoRA adapters resident and selecting the appropriate one per request. For example, vLLM can host several adapters simultaneously and apply them on demand with negligible switch time, subject to GPU memory limits.
This guide shows how to run vLLM with multiple LoRA adapters and a custom chat template, both offline (Python API) and online (HTTP server). As a running example, it uses two adapters fine-tuned for French and Japanese translation on a Qwen3 base model, and keeps the exact same custom chat template used during fine-tuning.
We’ll load several adapters side by side, route each request to the desired adapter, and register a custom chat template so prompts match training. The accompanying notebook contains the full, runnable code for serving multiple LoRA adapters with vLLM.
If you want to know how I fine-tuned the adapters for translation, I used the same code I explained in this article: