Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix wmma api parity #6

Merged
merged 2 commits into from
Feb 19, 2024
Merged

Fix wmma api parity #6

merged 2 commits into from
Feb 19, 2024

Conversation

Lzy17
Copy link

@Lzy17 Lzy17 commented Feb 6, 2024

hipify the wmma api call with rocwmma

Copy link
Collaborator

@pnunna93 pnunna93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change datatypes and enums also to rocwmma equivalents. I have commented on one of them above.

csrc/kernels.hip Outdated
wmma::fragment<wmma::matrix_b, 8, 32, 16, half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, 8, 32, 16, half> c_frag;
wmma::fill_fragment(c_frag, 0.0f);
rocwmma::fragment<wmma::matrix_a, 8, 32, 16, half, wmma::row_major> a_frag;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace matrix_a and row_major with rocwmma equivalents

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@seungduk-yanolja
Copy link

Hello, I am sorry to disturb you but I cannot find any place I can report an issue related to this library.
I am testing the new MI300X machine and having trouble importing bitsandbytes after successful installation.
I am unsure if I can disclose the error message here, so I am waiting for a response from any developers on the AMD's side.
Please reach out to me through seungduk.kim@yanolja.com
I am a software engineer and I can explain what I am experiencing now.
Thanks!

@pnunna93
Copy link
Collaborator

pnunna93 commented Feb 7, 2024

Hello, I am sorry to disturb you but I cannot find any place I can report an issue related to this library. I am testing the new MI300X machine and having trouble importing bitsandbytes after successful installation. I am unsure if I can disclose the error message here, so I am waiting for a response from any developers on the AMD's side. Please reach out to me through seungduk.kim@yanolja.com I am a software engineer and I can explain what I am experiencing now. Thanks!

Hi @seungduk-yanolja , please try installing it from rocm_enabled branch, the instructions are on that page, please be aware that full enablement is still pending. You can report any future issues on https://github.com/ROCm/rocm repo.

@seungduk-yanolja
Copy link

seungduk-yanolja commented Feb 8, 2024

Hello, I am sorry to disturb you but I cannot find any place I can report an issue related to this library. I am testing the new MI300X machine and having trouble importing bitsandbytes after successful installation. I am unsure if I can disclose the error message here, so I am waiting for a response from any developers on the AMD's side. Please reach out to me through seungduk.kim@yanolja.com I am a software engineer and I can explain what I am experiencing now. Thanks!

Hi @seungduk-yanolja , please try installing it from rocm_enabled branch, the instructions are on that page, please be aware that full enablement is still pending. You can report any future issues on https://github.com/ROCm/rocm repo.

Reported the issue here: ROCm/ROCm#2885

Hi @pnunna93, yes, I installed it from the rocm_enabled branch because I saw PRs were merged into this branch.
ChatGPT said this:

The backtrace provided indicates that the core dump resulted from a segmentation fault (SIGABRT) triggered within the Python process. Specifically, the crash occurs during the dynamic loading of a shared library related to the torch package, more precisely within libhipblaslt.so, which is part of the ROCm platform for AMD GPUs. This suggests an issue related to the HIP/ROCm ecosystem, possibly due to an incompatibility or a bug in the library or its dependencies.

The key points in the backtrace indicating the source of the issue are:

  • The termination happens after an attempt to load ExtOpMasterLibrary from libhipblaslt.so, which is part of the ROCm software stack.
  • The crash is preceded by a std::runtime_error, indicating that an exception was thrown within the C++ standard library, leading to a call to std::terminate(), which then causes the process to abort.

Given the complexity of debugging segmentation faults in dynamically loaded libraries, especially within the context of GPU computing, resolving such issues can sometimes require deep technical knowledge of the libraries and the underlying hardware. Collaboration with the community or seeking support from the developers of the libraries involved may be necessary.

@pnunna93
Copy link
Collaborator

pnunna93 commented Feb 8, 2024

Hello, I am sorry to disturb you but I cannot find any place I can report an issue related to this library. I am testing the new MI300X machine and having trouble importing bitsandbytes after successful installation. I am unsure if I can disclose the error message here, so I am waiting for a response from any developers on the AMD's side. Please reach out to me through seungduk.kim@yanolja.com I am a software engineer and I can explain what I am experiencing now. Thanks!

Hi @seungduk-yanolja , please try installing it from rocm_enabled branch, the instructions are on that page, please be aware that full enablement is still pending. You can report any future issues on https://github.com/ROCm/rocm repo.

Reported the issue here: ROCm/ROCm#2885

Hi @pnunna93, yes, I installed it from the rocm_enabled branch because I saw PRs were merged into this branch. ChatGPT said this:

The backtrace provided indicates that the core dump resulted from a segmentation fault (SIGABRT) triggered within the Python process. Specifically, the crash occurs during the dynamic loading of a shared library related to the torch package, more precisely within libhipblaslt.so, which is part of the ROCm platform for AMD GPUs. This suggests an issue related to the HIP/ROCm ecosystem, possibly due to an incompatibility or a bug in the library or its dependencies.

The key points in the backtrace indicating the source of the issue are:

  • The termination happens after an attempt to load ExtOpMasterLibrary from libhipblaslt.so, which is part of the ROCm software stack.
  • The crash is preceded by a std::runtime_error, indicating that an exception was thrown within the C++ standard library, leading to a call to std::terminate(), which then causes the process to abort.

Given the complexity of debugging segmentation faults in dynamically loaded libraries, especially within the context of GPU computing, resolving such issues can sometimes require deep technical knowledge of the libraries and the underlying hardware. Collaboration with the community or seeking support from the developers of the libraries involved may be necessary.

Hi @seungduk-yanolja , please reinstall hipblaslt with these steps:
git clone --recurse https://github.com/ROCmSoftwarePlatform/hipBLASLt
cd hipBLASLt
git checkout 4b3b34405e7e25cff404f69bfd0a832644430477
./install.sh -idc

You may need to copy and relink hipblaslt .so files from build dir to /opt/rocm/lib/ if it doesn't automatically get replaced after build.

@seungduk-yanolja
Copy link

seungduk-yanolja commented Feb 9, 2024

Hello, I am sorry to disturb you but I cannot find any place I can report an issue related to this library. I am testing the new MI300X machine and having trouble importing bitsandbytes after successful installation. I am unsure if I can disclose the error message here, so I am waiting for a response from any developers on the AMD's side. Please reach out to me through seungduk.kim@yanolja.com I am a software engineer and I can explain what I am experiencing now. Thanks!

Hi @seungduk-yanolja , please try installing it from rocm_enabled branch, the instructions are on that page, please be aware that full enablement is still pending. You can report any future issues on https://github.com/ROCm/rocm repo.

Reported the issue here: ROCm/ROCm#2885
Hi @pnunna93, yes, I installed it from the rocm_enabled branch because I saw PRs were merged into this branch. ChatGPT said this:
The backtrace provided indicates that the core dump resulted from a segmentation fault (SIGABRT) triggered within the Python process. Specifically, the crash occurs during the dynamic loading of a shared library related to the torch package, more precisely within libhipblaslt.so, which is part of the ROCm platform for AMD GPUs. This suggests an issue related to the HIP/ROCm ecosystem, possibly due to an incompatibility or a bug in the library or its dependencies.
The key points in the backtrace indicating the source of the issue are:

  • The termination happens after an attempt to load ExtOpMasterLibrary from libhipblaslt.so, which is part of the ROCm software stack.
  • The crash is preceded by a std::runtime_error, indicating that an exception was thrown within the C++ standard library, leading to a call to std::terminate(), which then causes the process to abort.

Given the complexity of debugging segmentation faults in dynamically loaded libraries, especially within the context of GPU computing, resolving such issues can sometimes require deep technical knowledge of the libraries and the underlying hardware. Collaboration with the community or seeking support from the developers of the libraries involved may be necessary.

Hi @seungduk-yanolja , please reinstall hipblaslt with these steps: git clone --recurse https://github.com/ROCmSoftwarePlatform/hipBLASLt cd hipBLASLt git checkout 4b3b34405e7e25cff404f69bfd0a832644430477 ./install.sh -idc

You may need to copy and relink hipblaslt .so files from build dir to /opt/rocm/lib/ if it doesn't automatically get replaced after build.

It looks like the same command lines as described in the README.md of rocm_enabled branch. I used these but let me retry.

Update: I tried to install hipBLASLt again but there was an error (invalid memory access) and the whole filesystem became read-only. I rebooted the machine and then it did not correctly recognize the GPUs. I rebooted the IPMI and then it became normal. At this moment, what I can do with this machine (MI300X) is run vLLM with 4 out of 8 GPUs because the output became so weird when I used all 8 GPUs. Will try and explore more what I can do.

@Titus-von-Koeller
Copy link

Hey all! I'm Titus, one of the bitsandbytes maintainers. We currently have a strong push underway to officially make different hardware backends than CUDA possible in BNB. Would you be willing to help us to get the AMD part right and consolidate the code-bases?

@amathews-amd
Copy link
Collaborator

Hey all! I'm Titus, one of the bitsandbytes maintainers. We currently have a strong push underway to officially make different hardware backends than CUDA possible in BNB. Would you be willing to help us to get the AMD part right and consolidate the code-bases?

Hi @Titus-von-Koeller , sure! we were planning to reach out to you once we closed some internal dependencies. Is there a forum we can discuss ?

@pnunna93
Copy link
Collaborator

It looks like the same command lines as described in the README.md of rocm_enabled branch. I used these but let me retry.

Update: I tried to install hipBLASLt again but there was an error (invalid memory access) and the whole filesystem became read-only. I rebooted the machine and then it did not correctly recognize the GPUs. I rebooted the IPMI and then it became normal. At this moment, what I can do with this machine (MI300X) is run vLLM with 4 out of 8 GPUs because the output became so weird when I used all 8 GPUs. Will try and explore more what I can do.

Hi @seungduk-yanolja, sounds like there is an issue with hipblaslt build/linking. The version I pointed to has ExtOpMasterLibrary class but something else is going wrong in the build. Please check back on the ROCm issue, they would be able to help. Thanks.

@seungduk-yanolja
Copy link

seungduk-yanolja commented Feb 19, 2024

Thank you all. I do not have access to the machine anymore since it was a short-time PoC. There is another PoC scheduled next month so will try again. Thanks again.

@Lzy17 Lzy17 merged commit 2b77380 into rocm_enabled Feb 19, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants