NVIDIA launches Nemotron-CC, a large-scale dataset for LLM pre-training

Iris Coleman
January 10, 2025 14:13

NVIDIA launched Nemotron-CC, a 6.3 trillion token English dataset, powering pre-training of large-scale language models with innovative data curation methods.

NVIDIA announced the launch of Nemotron-CC, a groundbreaking 6.3 trillion token English language dataset designed to improve dictionary training of large-scale language models (LLMs). This dataset, derived from Common Crawl, aims to increase the accuracy and efficiency of LLM through innovative data curation techniques, including the use of 1.9 trillion synthetically generated data tokens, according to NVIDIA.

Enhancing LLM pre-education

NVIDIA’s initiative addresses a critical need in LLM training, where the quality of pre-training datasets plays a pivotal role. Recent models, such as Meta’s Llama series, have been based on datasets consisting of up to 15 trillion tokens, but the exact composition of these datasets is largely unknown. Nemotron-CC seeks to fill this gap by providing the wider community with high-quality datasets that can support both short-term and long-term Token Horizon training.

Existing datasets often sacrifice up to 90% of data to improve benchmark accuracy, limiting their usefulness for widespread training. However, Nemotron-CC demonstrates how advanced methods such as classifier ensembles and synthetic data reconstruction can transform the Common Crawl data into a superior dataset that outperforms the Llama 3.1 8B model.

important results

The efficacy of Nemotron-CC is demonstrated by its performance on a variety of benchmarks. When training an 8B parameter model on 1 trillion tokens, the high-quality subset Nemotron-CC-HQ outperforms key datasets such as DCLM, increasing the MMLU score by 5.6 points. Additionally, the full 6.3 trillion token dataset matches MMLU’s DCLM while providing 4x more unique real-world tokens. This allowed the Nemotron-CC trained model to outperform Llama 3.1 8B on several metrics, including a 5-point increase in MMLU and a 3.1-point increase in ARC-Challenge score, enabling effective training over long token periods.

Innovative data curation technology

The development of Nemotron-CC involved several key insights. By combining different model-based classifiers, NVIDIA was able to select a wider range of high-quality tokens. Paraphrasing techniques also reduces noise and errors, creating diverse and valuable data transformations. The decision to disable the existing heuristic filter further improved the quality of the data set without compromising accuracy.

NVIDIA leveraged the NeMo Curator tool to extract and refine data from Common Crawl, applying filters for language, deduplication, and quality classification. This process was complemented by synthetic data generation, contributing approximately 2 trillion tokens to the dataset.

future prospects

Nemotron-CC has established itself as an essential resource for pre-training cutting-edge LLMs across a diverse range of tokens. To further enhance LLM capabilities, NVIDIA plans to expand its product by releasing more specialized datasets, including datasets focused on specific areas such as mathematics.

Image source: Shutterstock

NVIDIA launches Nemotron-CC, a large-scale dataset for LLM pre-training

KAITO unveils Capital Launchpad, a Web3 crowdfunding platform that will be released later this week.

Algorand (Algo) Get momentum in the launch and technical growth.

It flashes again in July

Wake’s GMX Hacking Analysis and Attack Scenario

Pepeto Announces $5.5M Presale And Demo Trading Platform

$75K In Rewards Announced For Valhalla’s First-Ever Tournament

Bitcoin Market Bullish? DL Mining Launches $100 Bonus + Sustainable Cloud Mining

Bybit And Tether Launch Strategic Partnership To Accelerate Crypto Adoption In Brazil

Remittix Presale Raises $17M After Revealing Next-Gen Web3 Wallet Beta Launch Date

Pioneering Real-World Asset Tokenization In The U.S. Market

How to travel to the world with encryption wallet?

LFG… Launches AI Alpha Pilot For Meme-Coin Hunters

Zircuit Launches AI Trading Engine For Lightning-Fast, Cross-Chain Trading

Bybit Card Celebrates Two Million Users With Limited-Edition Collectible And 1 BTC Giveaway

Top Insights

Wake’s GMX Hacking Analysis and Attack Scenario

Pepeto Announces $5.5M Presale And Demo Trading Platform

$75K In Rewards Announced For Valhalla’s First-Ever Tournament

Most Popular

Sinbad Crypto Mixer Approved by the U.S. Treasury

Coinbase Clashes with SEC Over ‘Irrational’ DEX Regulations

Terra LUNA Classic Dives 69% per year: Burns can save LUNC?

NVIDIA launches Nemotron-CC, a large-scale dataset for LLM pre-training

Enhancing LLM pre-education

important results

Innovative data curation technology

future prospects

Related Posts