NVIDIA Launches NCCL 2.22: Boosted Memory Efficiency & Speed

NVIDIA Launches NCCL 2.22 with Improved Memory Optimization and Speedier Initialization

The NVIDIA Collective Communications Library (NCCL) has launched its newest version, NCCL 2.22, which introduces important upgrades focused on maximizing memory efficiency, speeding up initialization, and providing a cost estimation API. These advancements are vital for high-performance computing (HPC) and artificial intelligence (AI) applications, as outlined in the NVIDIA Technical Blog.

Release Highlights

NVIDIA Magnum IO NCCL is tailored for optimizing inter-GPU and multi-node communications, which are crucial for effective parallel computing. The NCCL 2.22 update boasts the following key features:

Lazy Connection Establishment: This feature postpones connection creation until necessary, thus significantly lowering GPU memory consumption.
New API for Cost Estimation: A novel API aids in optimizing compute and communication overlap or exploring the NCCL cost model.
Enhancements for ncclCommInitRank: Elimination of redundant topology queries, resulting in up to 90% faster initialization for applications with multiple communicators.
Support for Multiple Subnets via IB Router: Enables communication for jobs spanning various InfiniBand subnets, facilitating larger deep learning training jobs.

Features Explained

Lazy Connection Establishment

NCCL 2.22 introduces lazy connection establishment, which greatly decreases GPU memory usage by postponing connection creation until absolutely necessary. This feature is especially advantageous for applications with a limited scope, such as those repeatedly executing the same algorithm. While this feature is on by default, it can be turned off by configuring NCCL_RUNTIME_CONNECT=0.

New Cost Model API

The newly introduced API, ncclGroupSimulateEnd, enables developers to predict the time required for certain operations, assisting in optimizing the overlap of compute and communication. Although the estimates may not always reflect actual performance, they serve as a valuable reference for fine-tuning.

Initialization Enhancements

To reduce initialization delays, the NCCL team has rolled out several enhancements, such as lazy connection establishment and intra-node topology fusion. These improvements can hasten ncclCommInitRank execution time by as much as 90%, making it significantly quicker for applications that set up multiple communicators.

New Tuner Plugin Interface

The new tuner plugin interface (version 3) features a per-collective 2D cost table that indicates the estimated time required for operations. This allows external tuners to refine algorithm and protocol combinations for superior performance.

Static Plugin Linking

To enhance convenience and mitigate loading challenges, NCCL 2.22 supports the static linking of network or tuner plugins. Applications can choose this option by adjusting NCCL_NET_PLUGIN or NCCL_TUNER_PLUGIN to STATIC_PLUGIN.

Group Semantics for Abort or Destroy

NCCL 2.22 implements group semantics for ncclCommDestroy and ncclCommAbort, allowing multiple communicators to be simultaneously destroyed. This functionality aims to alleviate deadlocks and enhance user experience.

IB Router Support

This release enables NCCL to function across different InfiniBand subnets, improving communication capabilities for larger networks. The library autonomously identifies and establishes connections between endpoints on various subnets, utilizing FLID for enhanced performance and adaptive routing.

Bug Fixes and Minor Changes

The NCCL 2.22 release also encompasses several bug fixes and minor modifications:

Enabled support for the allreduce tree algorithm on DGX Google Cloud.
Logged NIC names in IB async errors.
Enhanced performance of registered send and receive operations.
Incorporated infrastructure code for NVIDIA Trusted Computing Solutions.
Isolated traffic class for IB and RoCE control messages to facilitate advanced QoS.
Facilitated support for PCI peer-to-peer communications across partitioned Broadcom PCI switches.

Conclusion

The NCCL 2.22 release introduces numerous vital features and optimizations aimed at boosting performance and efficiency for HPC and AI applications. Notable improvements include a new tuner plugin interface, support for static linking of plugins, and refined group semantics to prevent deadlocks.

Image source: Shutterstock

NVIDIA Launches NCCL 2.22: Boosted Memory Efficiency & Speed

Release Highlights