RTX 50 and DIGITS: What Does It Mean for Local AI?
Fine-tuning a 2-bit Llama 4 70B with a single consumer GPU
NVIDIA has officially unveiled the RTX 50 GPUs at CES 2025, and it’s big news for the AI community. NVIDIA GPUs have long been the go-to choice for running and fine-tuning large language models (LLMs), so every new release of GPUs is watched closely by AI enthusiasts and professionals alike.
What makes the RTX 50 series particularly exciting is their potential as "consumer" GPUs. They’re powerful, relatively affordable, and perfectly suited for local AI setups.
In a previous article, we benchmarked several NVIDIA GPUs and found the RTX 3090 and 4090 to be the most cost-effective for tasks like parameter-efficient fine-tuning (LoRA) and small-batch inference, especially for LLMs that fit within 24 GB of VRAM.
Will the RTX 5090 be even more cost-effective? And what new possibilities for local LLM applications could it unlock?
In this article, we’ll break down everything we know so far about the RTX 50 series, including what they make possible for local AI. We’ll also take a look at DIGITS, NVIDIA’s upcoming device designed for local LLM applications, featuring a massive 128 GB of memory.
An RTX 5090 with 32GB and an RTX 5070 as fast as the RTX 4090!
As expected with consumer GPUs, Jensen Huang (NVIDIA’s CEO) at CES 2025 and NVIDIA’s official blog post focused heavily on the RTX 50 series’ capabilities for video games and 3D rendering. And to be fair, it’s impressive—especially with DLSS 4, which can render most video frames on the fly from just a few pixels. Since this approach focuses purely on rendering rather than full scene modeling, it significantly improves the GPU’s efficiency when handling complex scenes with intricate lighting.
That said, the RTX 5090 isn’t limited to gaming and rendering. Its increased power also makes it a strong option for running and fine-tuning LLMs locally.
The RTX 5090 represents a significant leap forward in power compared to the previous Ada architecture. While we don’t yet have all the details (or at least I couldn’t find them), NVIDIA has likely optimized it for low-precision computation formats like FP8, making it even more efficient for local AI tasks.
However, the most exciting feature is, in my opinion, its memory capacity. With 32 GB of VRAM, the RTX 5090 will be the first “consumer GPU” to offer this level of memory. While a 33% increase over the 24 GB of the RTX 3090 and 4090 might not sound like much, it’s a game-changer for those running LLMs locally on a single GPU.
For example, LoRA fine-tuning for 8B and 9B parameter LLMs with large vocabularies, such as Gemma and Qwen, is currently limited to short sequences and small batches on 24 GB GPUs. The additional 8 GB of VRAM on the RTX 5090 will enable longer context lengths and larger batch sizes, significantly speeding up the process.
Even more exciting is what this means for larger models. With 32 GB of VRAM, 14B or 13B parameter models will become the largest models that can be fine-tuned, with LoRA, at 16-bit precision on a consumer GPU like the RTX 5090. But it doesn’t stop there. The RTX 5090’s expanded memory opens up the possibility of fine-tuning LoRA adapters for 70B parameter models quantized to 2-bit precision, something currently impossible with 24 GB GPUs.
For context, most 70B models, even when quantized to 2-bit, consume nearly 20 GB of memory, leaving only 4 GB for buffers, activations, and other processing requirements. This is insufficient. The extra 8 GB on the RTX 5090 resolves this, allowing for efficient fine-tuning of large models like Qwen3 and Llama 4 ~70B, expected later this year, on a single consumer GPU.
In short, the RTX 5090 could redefine what’s possible for local AI development. Fine-tuning massive models like a 2-bit quantized 70B LLM with a single GPU will soon be within reach.
Will the RTX 5090 be Cost-Effective for LLMs?
Yes! It will be launched at $2,000, which is not so far from the current price of the RTX 4090. It’s unclear what will be the performance of the RTX 5090 for LLMs but even if it’s only 20% faster, it will be more cost-effective than the RTX 3090/4090.
However, we will have to be patient for it to be cost-effective. The RTX 5090 will be available on January 30th but most of them will be sold in an instant, and then there will be a shortage that will significantly increase its cost, probably for several months. Based on what happened for the RTX 4090, I wouldn’t expect it to be affordable again before the second half of 2025 at the earliest.
You can check The Weekly Kaitchup to see how the prices of the RTX cards evolve:
The main downside of the RTX 5090 is its power consumption. At 575W, it consumes 125W more than the RTX 4090. Putting 4 cards in one local machine will be challenging.
The Mid-Range RTX 50s Show Promise Too!
NVIDIA also announced 3 other cards: the RTX 5080, the RTX 5070 Ti, and the RTX 5070.
It’s difficult to know how they will perform but they surely look like good deals. The RTX 5080 will be launched at the current price of the RTX 4080.
The RTX 5070 Ti looks particularly good and I expect it to enable very fast fine-tuning and inference for small language models (NVIDIA claims it has the same performance as the RTX 4090):
DIGITS: A Tiny DGX
Jensen Huang introduced the DIGITS project as a tiny DGX, a very powerful machine for AI. It looks good and affordable.
For $3,000, this machine offers a GPU with 128 GB of low-power DDR5X unified memory. While it’s not as fast as dedicated GPU memory (GDDR), it is significantly faster than relying on CPU RAM to load/run a model.
Although 128 GB isn’t sufficient to fully load a 70B model in a 16-bit format, an FP8 version would fit comfortably. NVIDIA has also revealed that it will be possible to stack multiple units of this machine, though the maximum number of stackable units has not been specified yet.
With a single machine like this, it will also be possible to fully fine-tune LLMs with up to 40B parameters, making it a very capable option for many AI practitioners.
The machine is set to be available starting in May.
Conclusion
I plan to benchmark all the RTX 50 cards for parameter-efficient fine-tuning (PEFT), quantization, and inference as soon as they become available on RunPod, which I anticipate will happen in February.
As for DIGITS, I’ll aim to get my hands on one as early as possible to create a hands-on video review for The Kaitchup.
With these devices offering more memory while remaining relatively affordable, there’s a strong possibility that LLM makers will start scaling up their mid-sized models. We might see the release of more LLMs in the 10B to 15B parameter range, filling a currently underserved segment in the market.
I'm definitely going to get digits too! I've been excited since I heard about it!