AI HardwareFebruary 25, 20265 min read

Nvidia Vera Rubin: 5x Faster Than Blackwell and 10x Cheaper Per Token

CNBC got an exclusive first look at Nvidia's Vera Rubin system at the company's Santa Clara headquarters. The next-generation AI platform, which enters full production in the second half of 2026, represents a generational leap over Blackwell in both raw performance and cost efficiency. The numbers are significant enough to reshape how AI infrastructure gets planned and deployed.

The Performance Numbers

The Rubin GPU delivers up to 50 PFLOPs of NVFP4 inference and 35 PFLOPs of training, roughly 5x and 3.5x higher than Blackwell respectively. At the system level, the Vera Rubin NVL72 configuration (72 GPUs and 36 CPUs connected through NVLink 6) reaches 3.6 EFLOPS of inference and 2.5 EFLOPS of training.

But the raw FLOPS number is less important than the efficiency gains. Nvidia claims a 10x reduction in inference token cost and a 4x reduction in the number of GPUs needed to train mixture-of-experts (MoE) models compared to Blackwell. For organizations running large language models at scale, that cost reduction directly translates to either lower operating expenses or the ability to serve significantly more users on the same hardware budget.

Architecture: Six Chips, One System

Vera Rubin is not just a new GPU. It is a co-designed platform built around six new chips: the Rubin GPU, Vera CPU, NVLink 6 Switch, ConnectX-9 SuperNIC, BlueField-4 DPU, and Spectrum-6 Ethernet Switch. Every component was designed to work together, which is where the efficiency gains come from.

The Rubin GPU itself is built on TSMC's 3nm process and packs 336 billion transistors across two reticle-sized dies. It moves to HBM4 memory with up to 288GB per GPU and nearly triples memory bandwidth compared to Blackwell at 22 TB/s. NVLink 6 delivers 3.6 TB/s of bidirectional GPU-to-GPU bandwidth, doubling the previous generation.

The Vera CPU

On the CPU side, the Vera processor uses custom Arm-based "Olympus" cores with 88 cores and 176 threads via Nvidia's Spatial Multi-Threading technology. It supports up to 1.5TB of LPDDR5x memory with 1.2 TB/s of bandwidth. The full NVL72 system combines 20.7TB of HBM4 capacity with 54TB of LPDDR5x, totaling over 74TB of accessible memory.

Power and Efficiency

Vera Rubin uses roughly twice the power of a Blackwell system but delivers 10x more performance per watt. This efficiency gain is critical because power availability is becoming the primary constraint on AI infrastructure expansion. Data centers are increasingly limited not by floor space or capital, but by how many megawatts they can draw from the grid.

A 10x improvement in performance per watt means organizations can get dramatically more compute from their existing power allocation, or achieve the same throughput with a fraction of the energy consumption.

Supply Chain and Manufacturing

The Vera Rubin superchip, which combines two Rubin GPUs and one Vera CPU, contains roughly 17,000 components sourced from over 80 suppliers across at least 20 countries. TSMC fabricates the primary silicon, but the full system depends on a global supply chain spanning liquid cooling, power delivery, and high-bandwidth interconnects.

Availability

Nvidia CEO Jensen Huang confirmed the system is in full production. The first cloud providers to deploy Vera Rubin instances will be AWS, Google Cloud, Microsoft Azure, and Oracle Cloud Infrastructure, along with Nvidia Cloud Partners CoreWeave, Lambda, Nebius, and Nscale. General availability is expected in the second half of 2026.

Why This Matters for AI

The 10x reduction in cost per token is the most consequential number in the entire announcement. Training and inference costs are the primary bottleneck limiting how many organizations can build and deploy large AI models. When the cost of running a model drops by an order of magnitude, applications that were previously economically unviable become feasible.

This has downstream effects across every AI application category: video generation gets cheaper per frame, language models can serve more concurrent users, and multimodal systems that combine text, image, and video processing become more practical at scale. The hardware improvements in Vera Rubin do not just make existing workloads faster. They expand the set of workloads that are economically possible.