From 612add2b066b7e222ea4d0ff7d6382ed2bcdbaf2 Mon Sep 17 00:00:00 2001 From: Nilesh M Negi Date: Wed, 23 Oct 2024 14:35:05 -0500 Subject: [PATCH] Update CHANGELOG for ROCm 6.2.x (#1343) * [CHANGELOG] Update for ROCm 6.2.1 Signed-off-by: nileshnegi * [CHANGELOG] Add Known issues for ROCm 6.2.1 Signed-off-by: nileshnegi * Updated 6.2.1 known issues to match the content in develop. --------- Signed-off-by: nileshnegi Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com> --- CHANGELOG.md | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 381798b6e..b0d92e8f6 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,15 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https://rccl.readthedocs.io) +## RCCL 2.20.5 for ROCm 6.2.1 +### Fixed +- GDR support flag now set with DMABUF +### Known issues +- On systems running Linux kernel 6.8.0, such as Ubuntu 24.04, Direct Memory Access (DMA) transfers between the GPU and NIC are disabled and impacts multi-node RCCL performance. + - This issue was reproduced with RCCL 2.20.5 (ROCm 6.2.0 and 6.2.1) on systems with Broadcom Thor-2 NICs and affects other systems with RoCE networks using Linux 6.8.0 or newer. + - Older RCCL versions are also impacted. + - This issue will be addressed in a future ROCm release. + ## RCCL 2.20.5 for ROCm 6.2.0 ### Changed - Compatibility with NCCL 2.20.5 @@ -24,12 +33,15 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https: - New unit test for main kernel stack size - New -n option for topo_expl to override # of nodes - Improved debug messages of memory allocations -- Channel shuffling for IB systems +- Channel shuffling for multi-node MI300X systems ### Fixed - Bug when configuring RCCL for only LL128 protocol - Scratch memory allocation after API change for MSCCL - Incorrect minNchannels in multi-node -- GDR support flag now set with DMABUF + +## RCCL 2.18.6 for ROCm 6.1.2 +### Changed +- Reduced NCCL_TOPO_MAX_NODES to limit stack usage and avoid overflow ## RCCL 2.18.6 for ROCm 6.1.0 ### Changed