NVIDIA surpasses 1,000 TPS/users with llama 4 Maverick and Blackwell GPUS.

Lawrence Zenga
May 23, 2025 02:10

NVIDIA uses the BLACKWELL GPUS and LLAMA 4 Maverick to achieve the world’s record reasoning speed of 1,000 TPS/users to set new standards for AI model performance.

NVIDIA has set up a new benchmark with AI performance, breaking LLAMA 4 Maverick Model and Blackwell GPU to break 1,000 tokens (TPS) per user barrier. This achievement has been independently verified by artificial analysis of AI benchmarking service, and significant milestones in the speed of LLM (Lange Language Model) reasoning.

Technology development

This breakthrough has been achieved in a single NVIDIA DGX B200 node equipped with eight NVIDIA BLACKWELL GPUs that can handle more than 1,000 tp per user in LLAMA 4 MAVERICK, an 800 million parameter model. Due to this performance, Blackwell is an optimal hardware for deploying LLAMA 4 to maximize throughput or minimize atmospheric time.

Optimization

NVIDIA has completely utilized the Blackwell GPU by using TensOrt-Llm to implement extensive software optimization. The company also trained a speculative decoding draft model using the EAGLE-3 technology, resulting in a four-fold increase compared to the previous baseline. This improvement maintains response accuracy while improving performance and uses the FP8 data type for gemms and professional mixing to ensure the accuracy that can be compared with BF16 metrics.

The importance of low standby time

In the generated AI application, throughput balance and waiting time are important. In the case of important applications that require quick decision -making, NVIDIA’s BLACKWELL GPU is excellent by minimizing the delay time as shown in the TPS/user record. The function of hardware that handles high throughput and low standby time is ideal for various AI tasks.

CUDA kernel and speculation decoding

NVIDIA optimized the CUDA kernel for the work of Gemms, MoE and stocks to maximize performance by using spatial partitioning and efficient memory data rods. Dumping decoding was used to accelerate the speed of LLM reasoning using a smaller and faster draft model proven by smaller Target LLM. This approach increases significant speed, especially when the prediction of the draft model is correct.

Programming method dependency launch

To further improve performance, NVIDIA has reduced GPU idle time between continuous CUDA kernels using PDL (Programmatic Dependent Lunch). This technique allows you to run the kernel to improve the GPU usage rate and remove the performance interval.

The performance of NVIDIA emphasizes leadership in the field of AI infrastructure and data center technology, setting a new standard for the speed and efficiency of the AI model deployment. Innovation of the Blackwell architecture and software optimization continues to react with possible boundaries of AI performance and guarantee real -time user experience and powerful AI applications.

For more information, visit the NVIDIA official blog.

Image Source: Shutter Stock

NVIDIA surpasses 1,000 TPS/users with llama 4 Maverick and Blackwell GPUS.

Google unveils Gemini Omni and Gemini 3.5 Flash AI models

These three Bitcoin charts say BTC price will recover to $82,000.

Stellar (XLM) Highlights the Superiority of Native Tokenization in Securities

Bybit Launches New Daily Treasure Hunt Season Featuring Football Match Tickets And XAUT Rewards

World Cup 2026 Prediction Markets Now Live On Whale.io With $90K In Prizes

Chris Jericho To Join And Co-Create Official Community Traits For Kokopi Koalas™ NFT Collection

Bancor reduced its stable fee to 0.001%. Can BNT bounce back?

Neura Closes Strategic Funding Round And Partnerships To Build Emotional AI With Persistent, User-Owned Memory

Phemex Kicks Off $7 Million Ultimate Championship, Bringing Trading Competition To Football Season

MEXC Prediction Markets Launches Combo To Enable Multi-Event Combination Trading

ZIGChain expands on-chain access by integrating Ondo tokenized stocks and ETFs.

Bitmine Immersion Technologies (BMNR) Announces ETH Holdings Reach 5.54 Million Tokens, And Total Crypto And Total Cash Holdings Of $9.6 Billion

MapleStory Universe Opens MSU Space And Launches Global Game Jam Competition As Part Of MSU 2.0 Expansion

Why is UK Financial Ltd’s trillion-dollar ERC-3643 conversion attracting major platforms?

Top Insights

Bybit Launches New Daily Treasure Hunt Season Featuring Football Match Tickets And XAUT Rewards

World Cup 2026 Prediction Markets Now Live On Whale.io With $90K In Prizes

Chris Jericho To Join And Co-Create Official Community Traits For Kokopi Koalas™ NFT Collection

Most Popular

Binance unveils Cristiano Ronaldo NFT collection ‘Forever Worldwide: Road to Saudi Arabia’

Exploring Risk Management in DeFi: Balancing Paternalism with the Invisible Hand – The Defi Info

BNB Greenfield Ural hard fork will improve user experience and storage provider performance

NVIDIA surpasses 1,000 TPS/users with llama 4 Maverick and Blackwell GPUS.

Technology development

Optimization

The importance of low standby time

CUDA kernel and speculation decoding

Programming method dependency launch

Related Posts