Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix [group] attn matmul to not throw circular buffer size exception in debug build #15458

Open
bbradelTT opened this issue Nov 26, 2024 · 1 comment
Assignees
Labels
bug Something isn't working op_cat: mm P1

Comments

@bbradelTT
Copy link
Contributor

7a15bc9#diff-1838d88a8afd70b0cb4ec91cfa4f7af1426019d7e1feaefbdbe27709e9f387aaR466 has added a check in tt_metal/impl/buffers/circular_buffer_types.cpp that fails for debug builds when running an attn matmul test:

tests/tt_eager/python_api_testing/unit_testing/misc/test_attn_matmul.py::test_group_attn_matmul_with_program_cache[True-5-in0_dtype0-in1_dtype0-output_dtype0-True] ...
Cannot set to globally allocated buffer. Circular buffer size 6144 B exceeds allocated L1 buffer bank size of 4096 B

This needs to be fixed.

Repro steps:

./build_metal.sh  -b Debug
pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_attn_matmul.py

Expected:

  • Everything should pass

Actual:

  • Exception is thrown

Note: was on BH, with skips removed, but expect failures for other other arch as well.

@bbradelTT bbradelTT added bug Something isn't working op_cat: mm P1 labels Nov 26, 2024
@bbradelTT bbradelTT self-assigned this Nov 26, 2024
@bbradelTT
Copy link
Contributor Author

Details about how to fix this issue from @tt-aho

This is the commit that fixes 2 ops: e620b63
One is concat, where the shard size of the first input tensor was used to size the cbs of all the other inputs, even when they could be bigger/smaller.
Other was ssm, it was a RM layout shard, but was being sized to what it would be in tiles.
Could be different reasons for why the total size set is incorrect, would be an op by op basis.
Neither of these actually result in l1 corruption I think since they’re used for reading input only, but we want the check to always be enabled to prevent any issues since there are no guarantees on how the user might use the cb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working op_cat: mm P1
Projects
None yet
Development

No branches or pull requests

1 participant