vLLM v1 Engine: How Faster Is It for RTX and Mid-Range GPUs?
Better design and faster with H100s but what about the smaller GPUs?
vLLM, one of the fastest LLM inference frameworks, has introduced a new inference engine called “v1”, which is now enabled by default starting with version 0.8.0. If you're still using vLLM 0.7.x, you can manually enable the new engine by setting the following environment variable:
VLLM_USE_V1=1
The new v1 engine is a complete refactor of vLLM’s inference engine, bringing a lot of improvements. The implementation is now cleaner, more efficient, and more maintainable. Alongside the release, the vLLM team shared benchmark results demonstrating significant latency reductions (i.e., faster inference) on high-end GPUs such as the H100. They've also observed performance gains in multi-GPU configurations.
But what about Ada Lovelace GPUs (like the RTX 4090 and RTX 6000 Ada) or even older models like the RTX 3090? Does the new engine deliver meaningful performance improvements on consumer and mid-range GPUs?
In this article, we’ll start by exploring what’s new in the v1 engine. Then, I’ll walk you through my own benchmarking results comparing the v1 engine to the older “v0” engine. The tests were conducted across a range of LLM sizes, with and without quantization, on three different GPUs: RTX 3090, RTX 4090, and RTX 6000 Ada. Finally, I’ll wrap up with a brief overview of the issues I encountered while working with vLLM since v1 became the default.
The benchmarking was performed using the following notebook:
Keep reading with a 7-day free trial
Subscribe to The Kaitchup – AI on a Budget to keep reading this post and get 7 days of free access to the full post archives.