The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
vLLM v1 Engine: How Faster Is It for RTX and Mid-Range GPUs?
Copy link
Facebook
Email
Notes
More

vLLM v1 Engine: How Faster Is It for RTX and Mid-Range GPUs?

Better design and faster with H100s but what about the smaller GPUs?

Benjamin Marie's avatar
Benjamin Marie
Mar 31, 2025
∙ Paid
8

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
vLLM v1 Engine: How Faster Is It for RTX and Mid-Range GPUs?
Copy link
Facebook
Email
Notes
More
2
Share
Generated with ChatGPT

vLLM, one of the fastest LLM inference frameworks, has introduced a new inference engine called “v1”, which is now enabled by default starting with version 0.8.0. If you're still using vLLM 0.7.x, you can manually enable the new engine by setting the following environment variable:

VLLM_USE_V1=1

The new v1 engine is a complete refactor of vLLM’s inference engine, bringing a lot of improvements. The implementation is now cleaner, more efficient, and more maintainable. Alongside the release, the vLLM team shared benchmark results demonstrating significant latency reductions (i.e., faster inference) on high-end GPUs such as the H100. They've also observed performance gains in multi-GPU configurations.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

But what about Ada Lovelace GPUs (like the RTX 4090 and RTX 6000 Ada) or even older models like the RTX 3090? Does the new engine deliver meaningful performance improvements on consumer and mid-range GPUs?

In this article, we’ll start by exploring what’s new in the v1 engine. Then, I’ll walk you through my own benchmarking results comparing the v1 engine to the older “v0” engine. The tests were conducted across a range of LLM sizes, with and without quantization, on three different GPUs: RTX 3090, RTX 4090, and RTX 6000 Ada. Finally, I’ll wrap up with a brief overview of the issues I encountered while working with vLLM since v1 became the default.

The benchmarking was performed using the following notebook:

Get the notebook (#154)

Keep reading with a 7-day free trial

Subscribe to The Kaitchup – AI on a Budget to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More