Crypto Flexs
  • DIRECTORY
  • CRYPTO
    • ETHEREUM
    • BITCOIN
    • ALTCOIN
  • BLOCKCHAIN
  • EXCHANGE
  • TRADING
  • SUBMIT
Crypto Flexs
  • DIRECTORY
  • CRYPTO
    • ETHEREUM
    • BITCOIN
    • ALTCOIN
  • BLOCKCHAIN
  • EXCHANGE
  • TRADING
  • SUBMIT
Crypto Flexs
Home»ADOPTION NEWS»NVIDIA launches Nemotron-CC, a large-scale dataset for LLM pre-training
ADOPTION NEWS

NVIDIA launches Nemotron-CC, a large-scale dataset for LLM pre-training

By Crypto FlexsJanuary 10, 20253 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Email
NVIDIA launches Nemotron-CC, a large-scale dataset for LLM pre-training
Share
Facebook Twitter LinkedIn Pinterest Email

Iris Coleman
January 10, 2025 14:13

NVIDIA launched Nemotron-CC, a 6.3 trillion token English dataset, powering pre-training of large-scale language models with innovative data curation methods.





NVIDIA announced the launch of Nemotron-CC, a groundbreaking 6.3 trillion token English language dataset designed to improve dictionary training of large-scale language models (LLMs). This dataset, derived from Common Crawl, aims to increase the accuracy and efficiency of LLM through innovative data curation techniques, including the use of 1.9 trillion synthetically generated data tokens, according to NVIDIA.

Enhancing LLM pre-education

NVIDIA’s initiative addresses a critical need in LLM training, where the quality of pre-training datasets plays a pivotal role. Recent models, such as Meta’s Llama series, have been based on datasets consisting of up to 15 trillion tokens, but the exact composition of these datasets is largely unknown. Nemotron-CC seeks to fill this gap by providing the wider community with high-quality datasets that can support both short-term and long-term Token Horizon training.

Existing datasets often sacrifice up to 90% of data to improve benchmark accuracy, limiting their usefulness for widespread training. However, Nemotron-CC demonstrates how advanced methods such as classifier ensembles and synthetic data reconstruction can transform the Common Crawl data into a superior dataset that outperforms the Llama 3.1 8B model.

important results

The efficacy of Nemotron-CC is demonstrated by its performance on a variety of benchmarks. When training an 8B parameter model on 1 trillion tokens, the high-quality subset Nemotron-CC-HQ outperforms key datasets such as DCLM, increasing the MMLU score by 5.6 points. Additionally, the full 6.3 trillion token dataset matches MMLU’s DCLM while providing 4x more unique real-world tokens. This allowed the Nemotron-CC trained model to outperform Llama 3.1 8B on several metrics, including a 5-point increase in MMLU and a 3.1-point increase in ARC-Challenge score, enabling effective training over long token periods.

Innovative data curation technology

The development of Nemotron-CC involved several key insights. By combining different model-based classifiers, NVIDIA was able to select a wider range of high-quality tokens. Paraphrasing techniques also reduces noise and errors, creating diverse and valuable data transformations. The decision to disable the existing heuristic filter further improved the quality of the data set without compromising accuracy.

NVIDIA leveraged the NeMo Curator tool to extract and refine data from Common Crawl, applying filters for language, deduplication, and quality classification. This process was complemented by synthetic data generation, contributing approximately 2 trillion tokens to the dataset.

future prospects

Nemotron-CC has established itself as an essential resource for pre-training cutting-edge LLMs across a diverse range of tokens. To further enhance LLM capabilities, NVIDIA plans to expand its product by releasing more specialized datasets, including datasets focused on specific areas such as mathematics.

Image source: Shutterstock


Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

Bitcoin is at risk of liquidation of $1.4 billion if BTC rises to $80,000.

April 28, 2026

Polymarket Seeks $400 Million Raise to $15 Billion Valuation: Report

April 20, 2026

Ether risks a $1.7K retest as traders fail to overcome a key resistance area.

April 4, 2026
Add A Comment

Comments are closed.

Recent Posts

VerifyVASP Acquires Sygna, Consolidating The Global Travel Rule Network

April 29, 2026

Dogecoin Price Analysis: Is $DOGE’s $0.10 Level a Smart Entry or a Market Trap?

April 29, 2026

How to Connect OpenClaw with Binance for Live AI Trading (2026)

April 28, 2026

BitMart X $EAT Trade-to-Feed Competition To Pay Out $4.4M USDT To Traders In May 2026

April 28, 2026

ORBS) Reports Total Holdings Of Approximately $333 Million, Includes OpenAI, Beast Industries, More Than 11,000 ETH And Over 283 Million WLD Tokens

April 28, 2026

Core Scientific moves forward with 1.5GW AI data center campus in Texas

April 28, 2026

AxeCasino To Attend IGB L!VE 2026 Following Front-End Update Focused On Usability And Cross-Device Performance

April 28, 2026

Ondo Finance adds proxy voting for holders of $700 million worth of tokenized shares.

April 28, 2026

Bitcoin is at risk of liquidation of $1.4 billion if BTC rises to $80,000.

April 28, 2026

MBitmine Immersion Technologies Reports ETH Holdings Of 5.078M Tokens, Total Assets At $13.3B

April 28, 2026

Harvey AI opens Dallas office, expands legal AI presence

April 28, 2026

Crypto Flexs is a Professional Cryptocurrency News Platform. Here we will provide you only interesting content, which you will like very much. We’re dedicated to providing you the best of Cryptocurrency. We hope you enjoy our Cryptocurrency News as much as we enjoy offering them to you.

Contact Us : Partner(@)Cryptoflexs.com

Top Insights

VerifyVASP Acquires Sygna, Consolidating The Global Travel Rule Network

April 29, 2026

Dogecoin Price Analysis: Is $DOGE’s $0.10 Level a Smart Entry or a Market Trap?

April 29, 2026

How to Connect OpenClaw with Binance for Live AI Trading (2026)

April 28, 2026
Most Popular

Our bit

March 13, 2025

Enhancing Robot Simulation with ROS 2 and NVIDIA Isaac Sim

October 22, 2024

ATLETA and Bybit have formed a strong partnership, giving you the opportunity to win a real Porsche, Rolex or iPhone.

October 9, 2024
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms and Conditions
© 2026 Crypto Flexs

Type above and press Enter to search. Press Esc to cancel.