-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release/2.4] [ROCm][layer_norm] Use __builtin_amdgcn_rcpf(x) instead of 1.f/x #1688
base: release/2.4
Are you sure you want to change the base?
[release/2.4] [ROCm][layer_norm] Use __builtin_amdgcn_rcpf(x) instead of 1.f/x #1688
Conversation
Jenkins build for 86e72f59ea71082a578ecaef6717677856675fb4 commit finished as FAILURE Detected error during Pytorch building:
|
Jenkins build for 86e72f59ea71082a578ecaef6717677856675fb4 commit finished as FAILURE Detected error during Pytorch building:
|
Jenkins build for 86e72f59ea71082a578ecaef6717677856675fb4 commit finished as FAILURE Detected error during Pytorch building:
|
Replace (more) exact calculation with hardware approximation. Benefits: Reduced code size. Improved performance for certain scenarios. Experiments show low reduction in precision. Experiments show no significant performance regressions. bfloat16 as well as float16 related calculations may benefit largely from this change. vectorized_layer_norm_kernel: Gains performance esp. for the following tensor shapes. Lower values for dim1 do not change performance significantly. dim1 = 8k-65k may gain considerable performance, but decline gradually with size. dim0 dim1 ---- ---- 1024 8192 1024 16384 1024 32768 1024 65536 1024 131072 1024 262144 1024 524288 --------- Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
86e72f5
to
1cd0f36
Compare
Jenkins build for 1cd0f364e5e7681a3f71ff81ceb2b6fd5eefab81 commit finished as FAILURE Detected error during Pytorch building:
|
Jenkins build for 1cd0f364e5e7681a3f71ff81ceb2b6fd5eefab81 commit finished as FAILURE Detected error during Pytorch building:
|
Jenkins build for 1cd0f364e5e7681a3f71ff81ceb2b6fd5eefab81 commit finished as FAILURE Detected error during Pytorch building:
|
Jenkins build for 1cd0f364e5e7681a3f71ff81ceb2b6fd5eefab81 commit finished as FAILURE |
Jenkins build for 1cd0f364e5e7681a3f71ff81ceb2b6fd5eefab81 commit finished as FAILURE |
Jenkins build for 1cd0f364e5e7681a3f71ff81ceb2b6fd5eefab81 commit finished as FAILURE Detected error during Pytorch building:
|
Jenkins build for 1cd0f364e5e7681a3f71ff81ceb2b6fd5eefab81 commit finished as FAILURE Detected error during Pytorch building:
|
Jenkins build for 1cd0f364e5e7681a3f71ff81ceb2b6fd5eefab81 commit finished as FAILURE Detected error during Pytorch building:
|
Jenkins build for 1cd0f364e5e7681a3f71ff81ceb2b6fd5eefab81 commit finished as FAILURE |
How do we fix the CI? Do we want to just try an upstream draft PR so we can get a CI signal? |
Can you also run pytorch UTs that uses layer norm ? After that you can also try to run the full pytorch UT test suite |
I agree with @jeffdaily suggest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
if no regression on UTs, good to merge.
Jenkins build for 1cd0f364e5e7681a3f71ff81ceb2b6fd5eefab81 commit finished as FAILURE Detected error during Pytorch building:
|
upstream PR paased CI. See pytorch#141309 |
Jenkins build for 1cd0f364e5e7681a3f71ff81ceb2b6fd5eefab81 commit finished as FAILURE |
Replace (more) exact calculation with hardware approximation.
Benefits:
Reduced code size.
Improved performance for certain scenarios.
Experiments show low reduction in precision.
Experiments show no significant performance regressions.
bfloat16 as well as float16 related calculations may benefit largely from this change.
vectorized_layer_norm_kernel:
Gains performance esp. for the following tensor shapes.
Lower values for dim1 do not change performance significantly.
dim1 = 8k-65k may gain considerable performance, but decline gradually with size.