Crypto Flexs
  • DIRECTORY
  • CRYPTO
    • ETHEREUM
    • BITCOIN
    • ALTCOIN
  • BLOCKCHAIN
  • EXCHANGE
  • ADOPTION
  • TRADING
  • HACKING
  • SLOT
  • CASINO
Crypto Flexs
  • DIRECTORY
  • CRYPTO
    • ETHEREUM
    • BITCOIN
    • ALTCOIN
  • BLOCKCHAIN
  • EXCHANGE
  • ADOPTION
  • TRADING
  • HACKING
  • SLOT
  • CASINO
Crypto Flexs
Home»ADOPTION NEWS»NVIDIA Improves Llama 3.1 405B Performance with TensorRT Model Optimizations
ADOPTION NEWS

NVIDIA Improves Llama 3.1 405B Performance with TensorRT Model Optimizations

By Crypto FlexsAugust 29, 20244 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Email
NVIDIA Improves Llama 3.1 405B Performance with TensorRT Model Optimizations
Share
Facebook Twitter LinkedIn Pinterest Email

Lawrence Jengar
29 Aug 2024 16:10

NVIDIA’s TensorRT Model Optimizer significantly improves performance of Meta’s Llama 3.1 405B large-scale language model on H200 GPUs.





According to the NVIDIA Tech Blog, Meta’s Llama 3.1 405B large-scale language model (LLM) is reaching new levels of performance thanks to NVIDIA’s TensorRT Model Optimizer. These improvements result in up to a 1.44x increase in throughput when running on NVIDIA H200 GPUs.

Outstanding Llama 3.1 405B inference throughput using TensorRT-LLM

TensorRT-LLM has already delivered impressive inference throughput on Llama 3.1 405B since the model was released. This was achieved through a variety of optimizations, including in-flight batching, KV caching, and optimized attention kernels. These techniques accelerate inference performance while maintaining low-precision computing.

TensorRT-LLM adds support for the official Llama FP8 quantization recipe, which computes static and dynamic scaling factors to maintain maximum accuracy. Additionally, custom kernels, such as matrix multiplication in FBGEMM, are optimized via plugins inserted into the network graph at compile time.

Up to 1.44x performance improvement with TensorRT Model Optimizer

NVIDIA’s custom FP8 post-training quantization (PTQ) recipe, available via the TensorRT Model Optimizer library, improves Llama 3.1 405B throughput and reduces latency without sacrificing accuracy. The recipe reduces inference compute overhead by integrating FP8 KV cache quantization with self-attention static quantization.

Table 1 shows the peak throughput performance, showing significant improvements across a range of input and output sequence lengths on an 8-GPU HGX H200 system. The system features eight NVIDIA H200 Tensor Core GPUs, each with 141 GB of HBM3e memory, and four NVLink switches, providing 900 GB/s of GPU-to-GPU bandwidth.








Maximum throughput performance – output tokens/sec
8 NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Length2,048 | 12832,768 | 2,048120,000 | 2,048
TensorRT Model Optimization FP8463.1320.171.5
Official Llama FP8 Recipe399.9230.849.6
Speed ​​up1.16x1.39x1.44x

Table 1. Maximum throughput performance of Llama 3.1 405B based on NVIDIA internal measurements.

Similarly, Table 2 shows the minimum delay performance using the same input and output sequence lengths.








Batch size = 1 Performance – Output tokens/sec
8 NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Length2,048 | 12832,768 | 2,048120,000 | 2,048
TensorRT Model Optimization FP849.644.227.2
Official Llama FP8 Recipe37.433.122.8
Speed ​​up1.33 times1.33 times1.19x

Table 2. Minimum latency performance of Llama 3.1 405B based on NVIDIA internal measurements.

These results demonstrate that H200 GPUs with TensorRT-LLM and TensorRT Model Optimizer deliver superior performance in both latency-optimized and throughput-optimized scenarios. The TensorRT Model Optimizer FP8 recipe also achieves similar accuracy to the official Llama 3.1 FP8 recipe on the Massively Multitask Language Understanding (MMLU) and MT-Bench benchmarks.

Equipping Llama 3.1 405B with two H200 GPUs using INT4 AWQ

For developers with hardware resource constraints, the INT4 AWQ technique in the TensorRT Model Optimizer compresses models so that the Llama 3.1 405B fits on just two H200 GPUs. This method compresses weights into 4-bit integers while encoding activations using FP16, significantly reducing the memory footprint required.

Tables 4 and 5 show the maximum throughput and minimum latency performance measurements, showing that the INT4 AWQ scheme provides similar accuracy scores as Meta’s Llama 3.1 official FP8 recipe.






Maximum throughput performance – output tokens/sec
2x NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Length2,048 | 12832,768 | 2,04860,000 | 2,048
TensorRT Model Optimization INT4 AWQ75.628.716.2

Table 4. Maximum throughput performance of Llama 3.1 405B based on NVIDIA internal measurements.






Batch size = 1 Performance – Output tokens/sec
2x NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Length2,048 | 12832,768 | 2,04860,000 | 2,048
TensorRT Model Optimization INT4 AWQ21.618.712.8

Table 5. Minimum latency performance of Llama 3.1 405B based on NVIDIA internal measurements.

Advances in NVIDIA’s TensorRT Model Optimizer and TensorRT-LLM pave the way for improved performance and efficiency in running large language models such as Llama 3.1 405B. These improvements provide developers with more flexibility and cost-effectiveness, whether they have extensive hardware resources or more limited environments.

Image source: Shutterstock


Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

As you challenge the mixed technology signal, OnDo Price Hovers challenges the August Bullish predictions.

August 7, 2025

XRP Open Interests decrease by $ 2.4B after recent sale

July 30, 2025

KAITO unveils Capital Launchpad, a Web3 crowdfunding platform that will be released later this week.

July 22, 2025
Add A Comment

Comments are closed.

Recent Posts

FLOKI’s Valhalla MMORPG Storms U.S. Television With 60-Day National Commercial Blitz

August 11, 2025

A Global Initiative To Transform Crypto Education From The Ground Up

August 11, 2025

Cango Inc. Acquires 50 MW Bitcoin Mining Facility In Georgia, Laying Groundwork For Future Energy Strategy

August 11, 2025

SIM Mining Cloud Mining Allows Global Investors To Easily Earn BTC And DOGE Profits Using Just Their Smartphones (daily Income Of $23,999 USD)

August 11, 2025

MultiBank Group Delivers Record H1 Results With $209M Revenue And MBG Token Driving 7X Returns Since Launch.

August 11, 2025

The Animoca brand invests in a nice cat

August 11, 2025

Is Alt Season finally here, just as Ether Lee’s tearing and a small cap follows?

August 11, 2025

Flareonix airdrop is live! Under the share of 100m FXP today!

August 11, 2025

Carv can be used for transactions!

August 10, 2025

Ethereum (ETH), SEI (Sei), and Bonk (Bonk) gathered in July, but one token is prepared to dominate next.

August 10, 2025

Floki and OnDo expand their profits as Robinhood Listing strengthens.

August 10, 2025

Crypto Flexs is a Professional Cryptocurrency News Platform. Here we will provide you only interesting content, which you will like very much. We’re dedicated to providing you the best of Cryptocurrency. We hope you enjoy our Cryptocurrency News as much as we enjoy offering them to you.

Contact Us : Partner(@)Cryptoflexs.com

Top Insights

FLOKI’s Valhalla MMORPG Storms U.S. Television With 60-Day National Commercial Blitz

August 11, 2025

A Global Initiative To Transform Crypto Education From The Ground Up

August 11, 2025

Cango Inc. Acquires 50 MW Bitcoin Mining Facility In Georgia, Laying Groundwork For Future Energy Strategy

August 11, 2025
Most Popular

India considering ban on new cryptocurrencies to support CBDC, Lazarus Group strikes again: Asia Express

October 25, 2024

Mt. Gox Reactivates Bitcoin Wallet, Small Transfers to Bitstamp

July 22, 2024

What is XTP? – Bitfinex Blog

January 20, 2024
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms and Conditions
© 2025 Crypto Flexs

Type above and press Enter to search. Press Esc to cancel.