diff --git a/CHANGELOG.md b/CHANGELOG.md index 381798b6e..b0d92e8f6 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,15 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https://rccl.readthedocs.io) +## RCCL 2.20.5 for ROCm 6.2.1 +### Fixed +- GDR support flag now set with DMABUF +### Known issues +- On systems running Linux kernel 6.8.0, such as Ubuntu 24.04, Direct Memory Access (DMA) transfers between the GPU and NIC are disabled and impacts multi-node RCCL performance. + - This issue was reproduced with RCCL 2.20.5 (ROCm 6.2.0 and 6.2.1) on systems with Broadcom Thor-2 NICs and affects other systems with RoCE networks using Linux 6.8.0 or newer. + - Older RCCL versions are also impacted. + - This issue will be addressed in a future ROCm release. + ## RCCL 2.20.5 for ROCm 6.2.0 ### Changed - Compatibility with NCCL 2.20.5 @@ -24,12 +33,15 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https: - New unit test for main kernel stack size - New -n option for topo_expl to override # of nodes - Improved debug messages of memory allocations -- Channel shuffling for IB systems +- Channel shuffling for multi-node MI300X systems ### Fixed - Bug when configuring RCCL for only LL128 protocol - Scratch memory allocation after API change for MSCCL - Incorrect minNchannels in multi-node -- GDR support flag now set with DMABUF + +## RCCL 2.18.6 for ROCm 6.1.2 +### Changed +- Reduced NCCL_TOPO_MAX_NODES to limit stack usage and avoid overflow ## RCCL 2.18.6 for ROCm 6.1.0 ### Changed