NVIDIA NVLink and NVSwitch Improve Large-Scale Language Model Inference

Felix Pinkston
13 Aug 2024 07:49

NVIDIA’s NVLink and NVSwitch technologies enhance large-scale language model inference, enabling faster and more efficient multi-GPU processing.

As large-scale language models (LLMs) rapidly scale, the computing power needed to handle inference requests increases. According to the NVIDIA Technology Blog, multi-GPU computing is essential to meet real-time latency requirements and serve a growing number of users.

Benefits of Multi-GPU Computing

Even if a large model fits into a single memory of a modern GPU, the speed at which tokens are generated depends on the total compute power available. Combining the capabilities of multiple cutting-edge GPUs enables real-time user experiences. Technologies such as Tensor Parallelism (TP) can be used to speed up inference requests and carefully select the number of GPUs for each model to optimize user experience and cost.

Multi-GPU Inference: Communication Intensive

Multi-GPU TP inference involves distributing the computation of each model layer across multiple GPUs. The GPUs must communicate extensively to share results to advance to the next model layer. This communication is important because the Tensor Cores are often idle, waiting for data. For example, a single query on Llama 3.1 70B can require up to 20 GB of data transfer per GPU, highlighting the need for a high-bandwidth interconnect.

NVSwitch: The Key to Fast Multi-GPU LLM Inference

Effective multi-GPU scaling requires GPUs with high interconnect bandwidth per GPU and fast connectivity. NVIDIA Hopper Architecture GPUs with fourth-generation NVLink can communicate at 900 GB/s. When combined with NVSwitch, all GPUs in a server can communicate at this speed simultaneously, ensuring non-blocking communication. Systems such as the NVIDIA HGX H100 and H200 with multiple NVSwitch chips provide significant bandwidth, improving overall performance.

Performance comparison

Without NVSwitch, GPUs have to split their bandwidth across multiple point-to-point connections, and the communication speed decreases as more GPUs are involved. For example, point-to-point architectures only provide 128 GB/s of bandwidth for two GPUs, while NVSwitch provides 900 GB/s. This difference has a significant impact on overall inference throughput and user experience. The table in the original blog shows the bandwidth and throughput advantages of NVSwitch over point-to-point connections.

Future Innovation

NVIDIA continues to push the boundaries of real-time inference performance with NVLink and NVSwitch technology. The upcoming NVIDIA Blackwell architecture features fifth-generation NVLink, doubling the speed to 1,800 GB/s. Additionally, new NVSwitch chips and NVLink switch trays support larger NVLink domains, further improving performance on trillion-parameter models.

The NVIDIA GB200 NVL72 system, which combines 36 NVIDIA Grace CPUs and 72 NVIDIA Blackwell GPUs, is a good example of this advancement. It allows all 72 GPUs to operate as a single device, achieving real-time trillion-parameter inference that is 30x faster than the previous generation.

Image source: Shutterstock

NVIDIA NVLink and NVSwitch Improve Large-Scale Language Model Inference

Michael Burry’s Short-Term Investment in the AI Market: A Cautionary Tale Amid the Tech Hype

BTC Rebound Targets $110K, but CME Gap Cloud Forecasts

TRX Price Prediction: TRON targets $0.35-$0.62 despite the current oversold situation.

Metaplanet plans to raise $135 million to buy more Bitcoin.

MEXC Launches Ethereum Eco Month With $1 Million Prize Pool

The RWA market is expected to surge in 2026, according to Plume Growth Forecast.

BTC price could be range-bound to $60,000-$80,000 pending a rate cut.

VerifiedX Partners With Crypto.com For Institutional Custody And Liquidity Solution

Bitcoin Policy Institute Launches Interactive US Tax Payment Model to Support Bitcoin For America Act

Lido Triggerable Withdrawal Audit – Ackee Blockchain

Numerai Raises $30 Million Series C Led By Top University Endowments, At $500 Million Valuation

Logos Unifies Under One Identity To Deliver A Private Tech Stack To Revitalise Civil Society

Tapbit Marks 4th Anniversary With Continued Focus On Innovation And User Trust

Reuters: Brazil considers taxing international cryptocurrency payments

Top Insights

Metaplanet plans to raise $135 million to buy more Bitcoin.

MEXC Launches Ethereum Eco Month With $1 Million Prize Pool

The RWA market is expected to surge in 2026, according to Plume Growth Forecast.

Most Popular

Bitcoin Miner Riot Platforms Reports Widening Q2 Loss, Buys More Bitfarms Shares

Bitcoin analysts explain why BTC could avoid a fall below $90,000.

Pepe price is down 24%, but its rival PEPE ICO is close to $35 million.

NVIDIA NVLink and NVSwitch Improve Large-Scale Language Model Inference

Benefits of Multi-GPU Computing

Multi-GPU Inference: Communication Intensive

NVSwitch: The Key to Fast Multi-GPU LLM Inference

Performance comparison

Future Innovation

Related Posts