Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] CPU and NPU produce different BFP8_B values from the same BF16 tensor. #14032

Closed
namhyeong-kim opened this issue Oct 21, 2024 · 24 comments
Assignees
Labels
blackhole bug Something isn't working LLK moreh moreh contribution P0 WH

Comments

@namhyeong-kim
Copy link
Contributor

Describe the bug
The CPU and NPU produce different BFP8_B values from the same BF16 tensor.

To Reproduce
Steps to reproduce the behavior:

  1. go to branch namhyeong/.... Link to branch
  2. run tests/ttnn/unit_tests/test_bfp8_bf16_conversion.py Link to test file
    1. test_typecast_bf16_to_bfp8_b fails, while test_typecast_bfp8_b_to_bf16 passes.

Expected behavior
The CPU and NPU should produce the same BFP8_B values.

Screenshots
스크린샷 2024-10-21 오후 6 28 49

Please complete the following environment information:

  • OS: [Ubuntu 20.04]

Additional context
No additional context

@namhyeong-kim namhyeong-kim added bug Something isn't working moreh moreh contribution labels Oct 21, 2024
@razorback3 razorback3 added the P1 label Oct 21, 2024
@rdjogoTT
Copy link
Contributor

I have started investigating and should have an update by tomorrow.

@rdjogoTT
Copy link
Contributor

I've managed to trace the data through the HW path it takes, and it looks like the issue occurs in the Packer. I will be investigating this further

@rdjogoTT
Copy link
Contributor

The issue comes from how the packer handles bfp8_b datums, with rounding and shifting with rounding again. I'm looking into best course of action now.

Can you please provide some context to why the values need to match exactly? Is this just a bug report or is the accuracy needed somewhere specific? One potential fix would come at the expense of perf due to not using full packer bandwidth, so it would help to know the application.

@namhyeong-kim
Copy link
Contributor Author

  1. We will use BFP8_B for training large language models. I can't fully anticipate the model-level side effects caused by conversion inconsistencies between the CPU and NPU. Could you implement an option for precise conversion, even if it slightly reduces packer bandwidth?
  2. CPU converts correctly, while the NPU converts approximately. Is that correct?

@rdjogoTT
Copy link
Contributor

Yes, the CPU is producing the more accurate values in this case. Ok I will look into how that can be done.

@jvasilje
Copy link
Collaborator

@rdjogoTT any updates on this P0 bug?

@rdjogoTT
Copy link
Contributor

rdjogoTT commented Oct 30, 2024

I have pushed a branch origin/rd/bfp8b_pack with the workaround we found. It involves changing the pack src format for bfp8b to fp16b instead, which would cut packer bandwidth in half for WHB0 (but shouldn't affect BH).

This raises the question of how to handle this tradeoff. The way to decide would be to try training with bfp8b and see whether or not the accuracy is needed/tradeoff is worth it.

@namhyeong-kim
Copy link
Contributor Author

It looks that I cannot apply the accurate conversion in program-object-level. Could you make an option that toggles this feature?

@rdjogoTT
Copy link
Contributor

rdjogoTT commented Nov 5, 2024

Ok, I will add an option ASAP.

@rdjogoTT
Copy link
Contributor

rdjogoTT commented Nov 6, 2024

#14822 contains the new flag bfp8_pack_precise which allows you to toggle the feature. Setting it to true in the ComputeConfig will result in the test above passing. I will work to merge this change quickly.

In the case of typecast, this is how it can be set to True:

auto eltwise_unary_kernel_group_1_id = tt::tt_metal::CreateKernel(
        program,
        "tt_metal/kernels/compute/eltwise_sfpu.cpp",
        core_group_1,
        tt::tt_metal::ComputeConfig{
            .math_fidelity = MathFidelity::HiFi4,
            .fp32_dest_acc_en = args.fp32_dest_acc_en,
            .unpack_to_dest_mode = unpack_to_dest_mode,
            .bfp8_pack_precise = true, // <----------- add this line
            .math_approx_mode = math_approx_mode,
            .compile_args = compute_kernel_args_group_1,
            .defines = unary_defines});

@namhyeong-kim
Copy link
Contributor Author

Thank you! Could I check the feature in various input?
The test test_typecast_bf16_to_bfp8_b only demonstrates the existence of the bug; it doesn't have sufficient coverage.

@rdjogoTT
Copy link
Contributor

rdjogoTT commented Nov 7, 2024

The feature should enable fp16b or fp32 dest to be packed out accurately. Yes, please test it with various inputs.

@namhyeong-kim
Copy link
Contributor Author

namhyeong-kim commented Nov 7, 2024

What is enabling fp16b?

@rdjogoTT
Copy link
Contributor

rdjogoTT commented Nov 7, 2024

Sorry I should have clarified - typecast works by unpacking the fp16b to Dest and using the packer to convert to bfp8b. Similar with fp32 to bfp8b typecast. This bug was caused by that packer conversion, so this feature should enable accurate conversion now.

@rdjogoTT
Copy link
Contributor

rdjogoTT commented Nov 7, 2024

Hi @namhyeong-kim, can you please share your tests with me when you make them. I would like to include them with my PR: #14822 before merging to main.

@namhyeong-kim
Copy link
Contributor Author

Ok. I will share it

@namhyeong-kim
Copy link
Contributor Author

namhyeong-kim commented Nov 8, 2024

I pushed the tests to my branch that is rebased on your branch.
Every test in test_typecast_bf16_to_bfp8_b_various_input passes.
But when insert_nan is true in test_typecast_bf16_to_bfp8_b_with_inf_nan, the test fails.
cpu makes nan but npu makes -inf.

idx=(tensor(0), tensor(1)) cpu=nan npu=-inf
idx=(tensor(0), tensor(4)) cpu=nan npu=-inf
idx=(tensor(0), tensor(18)) cpu=nan npu=-inf
idx=(tensor(0), tensor(29)) cpu=nan npu=-inf
idx=(tensor(0), tensor(35)) cpu=nan npu=-inf

Could you check these failed tests?

@rdjogoTT
Copy link
Contributor

rdjogoTT commented Nov 8, 2024

I've found that the issue is again to do with the packer, will post further updates soon.

@rdjogoTT
Copy link
Contributor

rdjogoTT commented Nov 8, 2024

Tensix team has confirmed that we have had issues in the past as well with these special values. Unfortunately nothing can be done from the HW/LLK side to remedy this.

LLK recommendation is to make the checks for special values aware that NaN in Bfp8_b may be converted to -Inf. If there is a case that you must know if there is a NaN, maybe check before converting to Bfp8_b?

@namhyeong-kim
Copy link
Contributor Author

namhyeong-kim commented Nov 9, 2024

I understood the limitation. I will check before conversion to bfp8_b as your recommendation if nan should be handled.
Lastly please double check the tests that I provided have enough coverage.

@jvasilje
Copy link
Collaborator

@namhyeong-kim can we close this P0 issue?

@rdjogoTT
Copy link
Contributor

@jvasilje I am working on adding the tests and merging the PR today.

@zzigler-tt
Copy link

@rdjogoTT Please advise when this is fully complete so we can close out this issue, thanks.

rdjogoTT added a commit that referenced this issue Nov 13, 2024
### Ticket
[Link to Github
Issue](#14032)

### Problem description
bfp8 packing is inaccurate when pack_src_format is also bpf8, since this
results in double rounding in the HW. First the gasket rounds to 7 bits,
then rounding occurs again when the mantissas are being shifted in order
to have common exponent.

### What's changed
Add a flag to compute config called `bfp_pack_precise` which toggles the
pack_src_format to either fp16 or fp32 (depending on fp32_mode_en) in
order to get more accurate output. This however will half the packer
bandwidth in the case of fp16, and reduce it to one quarter in the case
of fp32.
@rdjogoTT
Copy link
Contributor

This has been resolved with PR: #14822. Can now be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blackhole bug Something isn't working LLK moreh moreh contribution P0 WH
Projects
None yet
Development

No branches or pull requests

5 participants