Fine-Tuning and Inference with an RTX 5090
Is the RTX 5090 noticeably faster than the previous generation for LLMs?
NVIDIA released the RTX 5090 in January. While it's still difficult to find one at a reasonable price, it's starting to become available on cloud platforms.
I use RunPod (referral link) extensively, and they've begun deploying the RTX 5090. This gave me the opportunity to benchmark it for two key use cases: fine-tuning (using Transformers with TRL) and inference (with vLLM).
Here are the main questions I wanted to answer:
Is the RTX 5090 fast for LoRA fine-tuning?
How does it perform for LLM inference, both with quantized and non-quantized models?
The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
This article aims to answer those. I’ll start by covering the most important—and often most frustrating—part: setting up the environment, including CUDA, PyTorch, and all necessary dependencies required to run RTX 50xx GPUs.
To get some points of comparison, I also benchmarked three other GPUs along with the RTX 5090: the RTX 3090, RTX 4090, and RTX 5000 Ada. This allows for a clearer comparison of performance and cost-effectiveness.
Each test was run across a range of sequence lengths and batch sizes—some consuming the entire GPU memory, others mimicking more constrained scenarios with smaller batches. The results include time for fine-tuning and throughput (tokens/sec) for inference using vLLM.
I’ve also prepared a notebook that shows the environment setup for running vLLM and fine-tuning for the RTX 5090 and 5080, with code examples. It should work with the RTX 5070 as well, although I haven’t tested that one yet.
Using Transformers and vLLM with the RTX 5090 (and 5080)
Keep reading with a 7-day free trial
Subscribe to The Kaitchup – AI on a Budget to keep reading this post and get 7 days of free access to the full post archives.