Crypto Flexs
  • DIRECTORY
  • CRYPTO
    • ETHEREUM
    • BITCOIN
    • ALTCOIN
  • BLOCKCHAIN
  • EXCHANGE
  • TRADING
  • SUBMIT
Crypto Flexs
  • DIRECTORY
  • CRYPTO
    • ETHEREUM
    • BITCOIN
    • ALTCOIN
  • BLOCKCHAIN
  • EXCHANGE
  • TRADING
  • SUBMIT
Crypto Flexs
Home»ADOPTION NEWS»Optimizing Parquet String Data Compression with RAPIDS
ADOPTION NEWS

Optimizing Parquet String Data Compression with RAPIDS

By Crypto FlexsJuly 17, 20243 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Email
Optimizing Parquet String Data Compression with RAPIDS
Share
Facebook Twitter LinkedIn Pinterest Email

Jessie A Ellis
17 Jul 2024 17:53

Learn how to optimize encoding and compression of Parquet string data using RAPIDS, thereby dramatically improving performance.





Parquet writer offers a variety of encoding and compression options that are turned off by default. Enabling these options can provide better lossless compression for your data, but it is important to understand which options to use for optimal performance, according to the NVIDIA Tech Blog.

Understanding Parquet Encoding and Compression

Parquet’s encoding phase reorganizes the data to reduce its size while preserving access to each data point. The compression phase further reduces the total size in bytes, but requires decompression before the data can be accessed again. The Parquet format includes two delta encodings designed to optimize the storage of string data: DELTA_LENGTH_BYTE_ARRAY (DLBA) and DELTA_BYTE_ARRAY (DBA).

RAPIDS libcudf and cudf.pandas

RAPIDS is a collection of open source accelerated data science libraries. In this context, libcudf is a CUDA C++ library for thermal data processing. It supports GPU-accelerated readers, writers, relational algebra functions, and columnar transformations. The Python cudf.pandas library accelerates existing pandas code by up to 150x.

Benchmarking using Kaggle String Data

We compared encoding and compression methods using a dataset of 149 string columns containing 12 billion total characters and a total file size of 4.6 GB. The study found that the encoding size difference between libcudf and arrow-cpp was less than 1%, and that file size increased by 3-8% when using the ZSTD implementation in nvCOMP 3.0.6 compared to libzstd 1.4.8+dfsg-3build1.

String encoding in Parquet

Parquet string data is represented using a byte array physical type. Most writers default to RLE_DICTIONARY encoding for string data, which uses dictionary pages to map string values ​​to integers. When dictionary pages become too large, writers fall back to PLAIN encoding.

Total file size based on encoding and compression

For the 149 string columns in the dataset, the default settings for dictionary encoding and SNAPPY compression produce a total file size of 4.6 GB. ZSTD compression outperforms SNAPPY, and both outperform the uncompressed option. The best single setting for the dataset is the default ZSTD, with delta encoding available for additional reduction under certain conditions.

If you choose delta encoding:

Delta encoding is useful for data with high cardinality or long string lengths, and typically achieves smaller file sizes. For string columns with less than 50 characters, DBA encoding can provide significant file size reductions, especially for sorted or semi-sorted data.

Achievements of readers and writers

The GPU-accelerated cudf.pandas library showed impressive performance compared to pandas, with Parquet reads being 17-25x faster. Using cudf.pandas with an RMM pool further improved throughput to 552 MB/s read and 263 MB/s write.

conclusion

RAPIDS libcudf provides flexible GPU-accelerated tools for reading and writing columnar data in formats such as Parquet, ORC, JSON, and CSV. For those looking to leverage GPU acceleration for Parquet processing, RAPIDS cudf.pandas and libcudf offer significant performance benefits.

Image source: Shutterstock


Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

Stellar (XLM) Highlights the Superiority of Native Tokenization in Securities

May 6, 2026

Bitcoin is at risk of liquidation of $1.4 billion if BTC rises to $80,000.

April 28, 2026

Polymarket Seeks $400 Million Raise to $15 Billion Valuation: Report

April 20, 2026
Add A Comment

Comments are closed.

Recent Posts

ORBS) Reports Total Holdings Of Approximately $337 Million, Includes OpenAI, Beast Industries, More Than 11,000 ETH And Over 283 Million WLD Tokens

May 21, 2026

Bybit Launches SPCXUSDT Pre-IPO Perpetual Contract With Up To 10x Leverage Ahead Of SpaceX’s Blockbuster IPO

May 21, 2026

Blockchain.com Announces Confidential Submission Of Draft Registration Statement For Proposed Initial Public Offering Of Class A Ordinary Shares

May 21, 2026

OSL Strengthens Asia’s Digital Asset Ecosystem with Listing of State-Supervised Gold-backed Stablecoin USDKG

May 21, 2026

MEXC Launches Ondo Tokenized Stocks Carnival With A $1,000,000 Reward Pool

May 21, 2026

OSL Strengthens Asia’s Digital Asset Ecosystem With Listing Of State-Supervised Gold-Backed Stablecoin USDKG

May 21, 2026

BC.GAME Brings A Crypto-First Betting Experience To The 2026 Football Season

May 21, 2026

SOL Negative Funding Rate Highlights Declining SOL Demand

May 21, 2026

Sui Launches Gasless Stablecoin Transfers With Support From Fireblocks

May 20, 2026

Bitcoin Ally Kevin Warsh’s Polymarket Odds Jump to 94%

May 20, 2026

AI Astrology And The Future Of Personalized Digital Ecosystems

May 20, 2026

Crypto Flexs is a Professional Cryptocurrency News Platform. Here we will provide you only interesting content, which you will like very much. We’re dedicated to providing you the best of Cryptocurrency. We hope you enjoy our Cryptocurrency News as much as we enjoy offering them to you.

Contact Us : Partner(@)Cryptoflexs.com

Top Insights

ORBS) Reports Total Holdings Of Approximately $337 Million, Includes OpenAI, Beast Industries, More Than 11,000 ETH And Over 283 Million WLD Tokens

May 21, 2026

Bybit Launches SPCXUSDT Pre-IPO Perpetual Contract With Up To 10x Leverage Ahead Of SpaceX’s Blockbuster IPO

May 21, 2026

Blockchain.com Announces Confidential Submission Of Draft Registration Statement For Proposed Initial Public Offering Of Class A Ordinary Shares

May 21, 2026
Most Popular

The 1.x Files: The State of Stateless Ethereum

February 15, 2024

MEXC’s First USD1 Event Concludes With Over 160K Participants & $2.4 Billion In Futures Trading Volume

May 15, 2026

As expected, the analysts delay the decision of Ether Staying and XRP ETF.

May 21, 2025
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms and Conditions
© 2026 Crypto Flexs

Type above and press Enter to search. Press Esc to cancel.