Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TTNN pow - fail when input tensor BFLOAT8_B and exponent is scalar float #8593

Closed
Tracked by #13795
npetrovic-tenstorrent opened this issue May 17, 2024 · 4 comments
Closed
Tracked by #13795
Assignees
Labels
bug Something isn't working GS op_cat: eltwise WH

Comments

@npetrovic-tenstorrent
Copy link
Contributor

npetrovic-tenstorrent commented May 17, 2024

ttnn.pow operation fails in cases when the first argument is BFLOAT8_B tensor and second is float scalar:

TT_FATAL 
@ ../tt_eager/tt_dnn/op_library/bcast/bcast_op.cpp:94: input_tensor_a.get_dtype() == input_tensor_b.get_dtype()

To Reproduce
Steps to reproduce the behavior:
Checkout main branch and run unit test test_eltwise_pow_float.py (or others) using this command pattern:

pytest tests/ttnn/python_api_testing/non_working_unit_tests/grayskull/test_eltwise_pow_float.py

Expected behavior
There are three test cases presented in the unit test -two of them where the input tensor is BFLOAT8_B type fail: tests/tt_eager/python_api_testing/non_working_unit_tests/grayskull/test_eltwise_pow_float.py and they are expected to fail with error

../tt_eager/tt_dnn/op_library/bcast/bcast_op.cpp:94: input_tensor_a.get_dtype() == input_tensor_b.get_dtype()

Getting Additional info for the operation under test and its behavior
To get additional information and results for different combinations of input shapes, types, layouts and memory configs for which this operation was tested you can also run locally sweep test:

tests/ttnn/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/grayskull/ttnn_eltwise_pow_float_test.yaml

To do this you should:

  1. Follow the Getting Started page to setup the repo, environment variables and python-env
  2. Activate source build/python_env/bin/activate
  3. Run sweeps by using pytest tests/ttnn/python_api_testing/sweep_tests/run_sweep_test.py --input-path tests/ttnn/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/grayskull/ttnn_eltwise_pow_float_test.yaml --input-method cli --cli-input results_pow_float_broken
  4. After the run is completed all test sweeps results should be available inside specified output directory (in this case ./result-sweeps). There you will find .csv which holds all executed sweeps, among which you can also find the ones that failed and were recreated by the unit test, which you can get by searching unique data_seed field.
@umadevimcw
Copy link
Contributor

umadevimcw commented Oct 16, 2024

@npetrovic-tenstorrent @eyonland I tested this in the recent main , it is passing in grayskull and WHB0.

image

When I repeated the test n times, I observed inconsistencies: out of 100 tests, sometimes 3 tests fail, while other times only 1 fails. To analyze this, I’ve hardcoded specific input and scale values that cause the failures in the code below for testing purposes.

def run_pow_tests(input_shape, dtype, dlayout, in_mem_config, output_mem_config, data_seed, device):
    torch.manual_seed(data_seed)

    x = torch.Tensor(size=input_shape[0]).uniform_(-100, 100).to(torch.bfloat16)
    y = 7.003021497060542 #random.uniform(0, 10)
    
    x.fill_(-59.50000)
    
    print("===================================================>>>>>>>>> scale  >>>>>>>>>>>>>>>>>>>>>>>...", y)                  ########## print 0
    print("Torch result......",torch.pow(torch.tensor(-59.50000), y)).    ########## print 1
    try:
        # get ref result
        ref_value = torch.pow(x, y)                   

        x = ttnn_ops.setup_ttnn_tensor(x, device, dlayout[0], in_mem_config, dtype[0])

        tt_result = ttnn.pow(x, y)
        tt_result = ttnn_ops.ttnn_tensor_to_torch(tt_result, output_mem_config) 

    except Exception as e:
        logger.warning(f"Operation execution crashed")
        raise e

    assert len(tt_result.shape) == len(ref_value.shape)
    assert tt_result.shape == ref_value.shape
    torch.set_printoptions(sci_mode=False)
    print("reference torch result", ref_value)  ########## print 2
    print("TT result ", tt_result)  ########## print 3
    print("input.....", x)  ########## print 4
    assert_with_pcc(ref_value, tt_result, 0.99)

During my analysis, I noticed that the Torch results are inconsistent across different runs, while the TT results are as expected. The expression pow(-x) produces a NaN in the TT results, but in the Torch reference values, we get an unexpectedly large value. Please see the attached image for reference.

image

Note: In the image above, you can correlate the ######## print x in the code with the -------> print x in the image for clarity.

Even though ref_value = torch.pow(x, y) and torch.pow(torch.tensor(-59.50000), y) are performing the same operation on the same values, the results differ, TT result returns NaN for negative numbers (which is the expected result), which contributes to the drop in PCC.

It is observed on both WHB0 and GS .

@umadevimcw
Copy link
Contributor

umadevimcw commented Oct 16, 2024

@eyonland @npetrovic-tenstorrent

The reason for undefined behaviour is the bfloat conversion that we are doing for data generation

Please find the below image for reference.

image

umadevimcw added a commit that referenced this issue Oct 16, 2024
umadevimcw added a commit that referenced this issue Oct 16, 2024
@umadevimcw
Copy link
Contributor

#13874 - Please find the PR here

umadevimcw added a commit that referenced this issue Oct 21, 2024
umadevimcw added a commit that referenced this issue Oct 22, 2024
umadevimcw added a commit that referenced this issue Oct 24, 2024
umadevimcw added a commit that referenced this issue Oct 24, 2024
umadevimcw added a commit that referenced this issue Oct 24, 2024
@umadevimcw
Copy link
Contributor

PR merged to main. Hence closing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working GS op_cat: eltwise WH
Projects
None yet
Development

No branches or pull requests

2 participants