The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Best GPUs Under $1,500 for AI: Should You Upgrade?

Comparing mid-tier consumer GPUs, from RTX 30xx to 50xx, for running and fine-tuning LLMs

Benjamin Marie's avatar
Benjamin Marie
Nov 17, 2025
∙ Paid
Image generated with ChatGPT

We often see inference throughput and fine-tuning stats for consumer GPUs, but they mostly focus on the high end (RTX 4090/5090). What about more affordable cards: are they simply too slow, or too memory-constrained to run and fine-tune LLMs?

To find out, I benchmarked GPUs across the last three NVIDIA RTX generations: 3080 Ti, 3090, 4070 Ti, 4080 Super, 4090, 5080, and 5090. With the exception of the xx90 cards, these GPUs offer only 12–16 GB of VRAM.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Using vLLM, I measured throughput when the model fully fits in GPU memory and when part of it must be offloaded to system RAM. For fine-tuning, I evaluated both LoRA and QLoRA on 1.7B and 8B LLMs.

Benchmark code and logs:

Get the notebooks (#189)

I used GPUs from RunPod (referral link) and also report cost-efficiency based on their pricing.

Running LLMs without High-End GPUs

To benchmark GPUs for inference throughput, use the same stack you plan to deploy. It sounds obvious, but many popular (often marketing-driven) benchmarks don’t resemble real inference frameworks, so their numbers are speeds you’ll never hit in your use case. If you run Ollama, benchmark with Ollama and GGUF models. If you use vanilla Hugging Face Transformers, benchmark with Transformers directly.

Different libraries ship different kernel implementations, each optimized to varying degrees for specific GPU generations.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture