Caroline Bishop
Sep 21, 2024 13:38
NVIDIA has unveiled NCCL 2.22, emphasizing memory optimization, quicker initialization, and cost estimation to enhance HPC and AI applications.
The NVIDIA Collective Communications Library (NCCL) has launched its newest version, NCCL 2.22, which introduces important upgrades focused on maximizing memory efficiency, speeding up initialization, and providing a cost estimation API. These advancements are vital for high-performance computing (HPC) and artificial intelligence (AI) applications, as outlined in the NVIDIA Technical Blog.
Release Highlights
NVIDIA Magnum IO NCCL is tailored for optimizing inter-GPU and multi-node communications, which are crucial for effective parallel computing. The NCCL 2.22 update boasts the following key features:
- Lazy Connection Establishment: This feature postpones connection creation until necessary, thus significantly lowering GPU memory consumption.
- New API for Cost Estimation: A novel API aids in optimizing compute and communication overlap or exploring the NCCL cost model.
- Enhancements for
ncclCommInitRank
: Elimination of redundant topology queries, resulting in up to 90% faster initialization for applications with multiple communicators. - Support for Multiple Subnets via IB Router: Enables communication for jobs spanning various InfiniBand subnets, facilitating larger deep learning training jobs.
Features Explained
Lazy Connection Establishment
NCCL 2.22 introduces lazy connection establishment, which greatly decreases GPU memory usage by postponing connection creation until absolutely necessary. This feature is especially advantageous for applications with a limited scope, such as those repeatedly executing the same algorithm. While this feature is on by default, it can be turned off by configuring NCCL_RUNTIME_CONNECT=0
.
New Cost Model API
The newly introduced API, ncclGroupSimulateEnd
, enables developers to predict the time required for certain operations, assisting in optimizing the overlap of compute and communication. Although the estimates may not always reflect actual performance, they serve as a valuable reference for fine-tuning.
Initialization Enhancements
To reduce initialization delays, the NCCL team has rolled out several enhancements, such as lazy connection establishment and intra-node topology fusion. These improvements can hasten ncclCommInitRank
execution time by as much as 90%, making it significantly quicker for applications that set up multiple communicators.
New Tuner Plugin Interface
The new tuner plugin interface (version 3) features a per-collective 2D cost table that indicates the estimated time required for operations. This allows external tuners to refine algorithm and protocol combinations for superior performance.
Static Plugin Linking
To enhance convenience and mitigate loading challenges, NCCL 2.22 supports the static linking of network or tuner plugins. Applications can choose this option by adjusting NCCL_NET_PLUGIN
or NCCL_TUNER_PLUGIN
to STATIC_PLUGIN
.
Group Semantics for Abort or Destroy
NCCL 2.22 implements group semantics for ncclCommDestroy
and ncclCommAbort
, allowing multiple communicators to be simultaneously destroyed. This functionality aims to alleviate deadlocks and enhance user experience.
IB Router Support
This release enables NCCL to function across different InfiniBand subnets, improving communication capabilities for larger networks. The library autonomously identifies and establishes connections between endpoints on various subnets, utilizing FLID for enhanced performance and adaptive routing.
Bug Fixes and Minor Changes
The NCCL 2.22 release also encompasses several bug fixes and minor modifications:
- Enabled support for the
allreduce
tree algorithm on DGX Google Cloud. - Logged NIC names in IB async errors.
- Enhanced performance of registered send and receive operations.
- Incorporated infrastructure code for NVIDIA Trusted Computing Solutions.
- Isolated traffic class for IB and RoCE control messages to facilitate advanced QoS.
- Facilitated support for PCI peer-to-peer communications across partitioned Broadcom PCI switches.
Conclusion
The NCCL 2.22 release introduces numerous vital features and optimizations aimed at boosting performance and efficiency for HPC and AI applications. Notable improvements include a new tuner plugin interface, support for static linking of plugins, and refined group semantics to prevent deadlocks.
Image source: Shutterstock