Timothy Morano
May 20, 2025 04:25
Anyscale introduces hash -based shuffle backends in Ray Data to improve join and performance improvement for re -establishing and aggregate. Discover the development in the Ray 2.46 release.
According to all scales, all scopes have announced significant improvements of Ray Data, emphasized by the introduction of the hash -based shuffle backend. This new feature, a part of the Ray 2.46 release, aims to reduce memory pressure while improving data re -establishment and aggregate joins and performance.
Improving light data
The latest release boasts some new features, including Native Join Support. ds.join()
API, key -based rebuilding and simplified custom aggregation API AggregateFnV2
. In addition, the performance of large -scale alignment is improved, improving range division shuffle.
The newly introduced hash -based shuffle back end deals with the relocation restrictions on the range -based shuffle access. In the previous version, the shuffle ring depended on the range partitioning of resource -intensive and easy -to -do phenomena. The new method is divided into a key value tuple, dividing the data blocks that come in and guiding them to the corresponding aggregator actor for efficient processing.
Implementing the hash shuffle and joining
Ray 2.46 introduces support for various tangers, including internal, left and right and all external joins. The hash shuffle back end is the same key to optimize the performance by jointly the record. This approach uses the APACHE Arrow’s ACERO engine through Pyarrow’s native. Table.join
It may be a memory -intensive but it works.
Benchmarking performance
Performance benchmarks show significant improvements on multiple workloads. Tests performed in a cluster with the M7I.4xlarge and M7I.16xlarge instances show 3.3 to 5.6x performance gain when using hash -based shuffle compared to the previous version. In particular, the TPCH-Q1-SF1000 Workroad, which was not previously managed, is now realized with the new backend.
According to additional tests, range partitioning shuffles have also been improved and runtime improvements are between 1.6 to 4.3 times. Importantly, the hash shuffle back end greatly reduces peak memory usage with up to 3.9 times improvement.
Future development
In the future, all sized plans will expand their support for various types of join and implement the logical plan optimization. Further improvement of the data furniture processor is also expected.
This development of Ray Data has been set to grant developers with more efficient data processing functions. To get more insights, visit the official scale blog.
Image Source: Shutter Stock