Numbast Links CUDA C++ with Python for Enhanced Performance

Numbast Bridges CUDA C++ and Python Ecosystems

The divide between Python developers and the CUDA C++ ecosystem is about to shrink considerably with the unveiling of Numbast, as detailed on the NVIDIA Technical Blog. This groundbreaking tool automates the binding of CUDA C++ APIs to Numba, greatly broadening the performance capabilities available to Python developers.

Closing the Divide

Numba has traditionally allowed Python developers to create CUDA kernels using a syntax similar to C++. However, many libraries exclusive to CUDA C++, like CUDA Core Compute Libraries and cuRAND, have remained inaccessible to Python users. The manual process of binding each library to Python has been laborious and prone to errors.

Meet Numbast

Numbast resolves this challenge by creating an automated pipeline that reads top-level declarations from CUDA C++ header files, serializes them, and produces Numba extensions. This method guarantees consistency and ensures that Python bindings are updated in accordance with changes in CUDA libraries.

Showcasing Numbast’s Features

A practical demonstration of Numbast’s capabilities involves generating Numba bindings for a basic myfloat16 struct, modeled after CUDA’s float16 header. This example illustrates how C++ declarations can be converted into bindings accessible in Python, thereby allowing developers to leverage CUDA’s performance benefits within a Python framework.

Real-World Use

One of the initial bindings enabled by Numbast is the bfloat16 data type, which seamlessly integrates with PyTorch’s torch.bfloat16. This integration facilitates the development of custom compute kernels that utilize CUDA intrinsics for effective processing.

Structure and Operation

Numbast consists of two principal components: AST_Canopy, responsible for parsing and serializing C++ headers, and the Numbast layer itself, which produces Numba bindings. AST_Canopy accommodates environment detection at runtime and provides flexibility in computing capabilities parsing, while Numbast acts as the intermediary between C++ and Python.

Performance and Future Potential

The bindings created with Numbast are fine-tuned via foreign function invocation, with future upgrades anticipated to further reduce the performance disparity between Numba kernels and native CUDA C++ implementations. Upcoming versions are expected to introduce additional bindings, including NVSHMEM and CCCL, thus broadening the tool’s applicability.

For further details, visit the NVIDIA Technical Blog.

Image source: Shutterstock