The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-Tuning and Inference with an RTX 5090
Copy link
Facebook
Email
Notes
More

Fine-Tuning and Inference with an RTX 5090

Is the RTX 5090 noticeably faster than the previous generation for LLMs?

Benjamin Marie's avatar
Benjamin Marie
Mar 24, 2025
∙ Paid
9

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-Tuning and Inference with an RTX 5090
Copy link
Facebook
Email
Notes
More
Share
A cartoon-style llama standing proudly as a guard, watching over a herd of animated GPUs (graphics cards) in a grassy field. The GPUs have cute faces and are grouped like livestock. The scene is colorful and playful, with a bright blue sky and a few fluffy clouds.
Generated with ChatGPT

NVIDIA released the RTX 5090 in January. While it's still difficult to find one at a reasonable price, it's starting to become available on cloud platforms.

I use RunPod (referral link) extensively, and they've begun deploying the RTX 5090. This gave me the opportunity to benchmark it for two key use cases: fine-tuning (using Transformers with TRL) and inference (with vLLM).

Here are the main questions I wanted to answer:

  • Is the RTX 5090 fast for LoRA fine-tuning?

  • How does it perform for LLM inference, both with quantized and non-quantized models?

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

This article aims to answer those. I’ll start by covering the most important—and often most frustrating—part: setting up the environment, including CUDA, PyTorch, and all necessary dependencies required to run RTX 50xx GPUs.

To get some points of comparison, I also benchmarked three other GPUs along with the RTX 5090: the RTX 3090, RTX 4090, and RTX 5000 Ada. This allows for a clearer comparison of performance and cost-effectiveness.

Each test was run across a range of sequence lengths and batch sizes—some consuming the entire GPU memory, others mimicking more constrained scenarios with smaller batches. The results include time for fine-tuning and throughput (tokens/sec) for inference using vLLM.

I’ve also prepared a notebook that shows the environment setup for running vLLM and fine-tuning for the RTX 5090 and 5080, with code examples. It should work with the RTX 5070 as well, although I haven’t tested that one yet.

Get the notebook (#153)

Using Transformers and vLLM with the RTX 5090 (and 5080)

Keep reading with a 7-day free trial

Subscribe to The Kaitchup – AI on a Budget to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More