Jessie A Ellis
 Sep 07, 2024 08:39
NVIDIA’s NVSHMEM 3.0 improves GPU communication by providing multi-node support, backward compatibility with ABI versions, and CPU-assisted InfiniBand GPU Direct Async.
NVIDIA has announced the release of NVSHMEM 3.0, the latest version of its parallel programming interface designed to facilitate efficient and scalable communication across NVIDIA GPU clusters. Part of NVIDIA Magnum IO and based on OpenSHMEM, the update aims to improve application portability and compatibility across platforms, according to the NVIDIA Technical Blog.
Support for new features and interfaces
NVSHMEM 3.0 introduces several new features, including multi-node, multi-interconnect support, backward compatibility with host device ABI versions, and CPU-assisted InfiniBand GPU Direct Async (IBGDA).
Supports multi-node, multi-interconnection
The new version supports interconnection between multiple GPUs within a node using P2P interconnects such as NVIDIA NVLink/PCIe, and interconnection between nodes using RDMA interconnects such as InfiniBand and RDMA over Converged Ethernet (RoCE). This enhancement includes platform support for multiple racks of NVIDIA GB200 NVL72 systems connected over RDMA networks.
Compatibility with previous versions of the host device ABI
NVSHMEM 3.0 introduces backward compatibility between minor versions, allowing applications linked to older versions of NVSHMEM to run on newer systems. This feature facilitates smoother updates and reduces the need to recompile applications for each new release.
CPU-assisted InfiniBand GPU direct asynchronous
The latest release also supports CPU-assisted IBGDA, which splits control plane responsibilities between GPUs and CPUs. This approach helps improve IBGDA adoption on non-consistent platforms and alleviates management-level configuration constraints in large clusters.
Non-interface support and minor improvements
NVSHMEM 3.0 includes the following minor improvements and non-interface support:
An object-oriented programming framework for symmetric heaps.
This version introduces an object-oriented programming (OOP) framework that manages various types of symmetric heaps, including static and dynamic device memory. The OOP framework simplifies extensions to advanced features and improves data encapsulation.
Performance improvements and bug fixes
NVSHMEM 3.0 provides various performance improvements and bug fixes, including IBGDA setup, block-scoped on-device reduction, system-scoped atomic memory operations (AMO), and team management.
summation
The release of NVSHMEM 3.0 represents a significant upgrade to NVIDIA’s parallel programming interface. Key features such as multi-node multi-interconnect support, backward compatibility with host device ABIs, and CPU-assisted IBGDA are aimed at improving GPU communication and application portability. Administrators and developers can now update to the latest version of NVSHMEM without disruption to existing applications, ensuring a smoother transition and better performance on large-scale GPU clusters.
Image source: Shutterstock
                            

