Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] Stable Diffusion interactive demo is hanging on N150 #15436

Open
dvartaniansTT opened this issue Nov 25, 2024 · 11 comments
Open

[Bug Report] Stable Diffusion interactive demo is hanging on N150 #15436

dvartaniansTT opened this issue Nov 25, 2024 · 11 comments

Comments

@dvartaniansTT
Copy link
Contributor

dvartaniansTT commented Nov 25, 2024

Describe the bug
Stable Diffusion interactive demo : pytest models/demos/wormhole/stable_diffusion/demo/demo.py::test_interactive_demo is hanging.

I have tested with both FWs: 80.13.2.0 and 80.10.0.0.

It best it runs up to 98%. Then crashes: see attached.

To Reproduce
Steps to reproduce the behavior:

  1. Go to main branch
  2. Follow the build instructions from main and build metal.
  3. run: pytest models/demos/wormhole/stable_diffusion/demo/demo.py::test_interactive_demo
  4. the demo hangs and I lose connection to the VM.
    Expected behavior
    the interactive demo is used for the web demo as well. We need this issue resolved to enable the SD web demo.

Screenshots
Image

Please complete the following environment information:

  • OS: [e.g. Ubuntu 20.04]
@dvartaniansTT
Copy link
Contributor Author

On my latest test from main (commit: 78075c6):

It did not hang but failed with an error at 12%.


                 Always | WARNING  | Circular buffer size 262144 B exceeds allocated L1 buffer bank size of 245760 B. This may allow this circular buffer to write outside the allocated buffer space.
                 Always | WARNING  | Circular buffer size 262144 B exceeds allocated L1 buffer bank size of 245760 B. This may allow this circular buffer to write outside the allocated buffer space.
                 Always | FATAL    | Statically allocated circular buffers in program 503 clash with L1 buffers on core range [(x=0,y=0) - (x=7,y=7)]. L1 buffer allocated at 941952 and static circular buffer region ends at 942848
 12%|████████████▌                                                                                              | 6/51 [02:50<21:20, 28.45s/it]
malloc(): unsorted double linked list corrupted
Fatal Python error: Aborted

@smehtaTT
Copy link

@mywoodstock to help with first level triage. Removing other owners.

@mywoodstock
Copy link
Contributor

mywoodstock commented Nov 26, 2024

@dvartaniansTT @mbahnasTT Is this issue non-deterministic in where it hangs/fails? Since this is a regression, we should to identify the commit that broke it and revert it.

@esmalTT
Copy link
Contributor

esmalTT commented Nov 26, 2024

@dvartaniansTT @mbahnasTT Is this issue non-deterministic in where it hangs/fails? Since this is a regression, we should to identify the commit that broke it and revert it.

@mywoodstock @dvartaniansTT Looking at the last 2 weeks of pipelines, it seems like the failure present about 50% of the time. I'm going to try and determine if it is machine-dependant or truly ND.

@mywoodstock
Copy link
Contributor

I am getting a deterministic failure right at the beginning (after the prompt). Narrowed down this regression to this PR: #15394
The commit before this PR was merged is good: 69870bdeaf1c9270e325810249def6a3e9f38fb4
After this PR we see the failing case -- the new/updated condition was added in the above PR -- either the check is buggy, or there was an OP bug that was previously hidden? cc: @tt-aho

@mywoodstock
Copy link
Contributor

@esmalTT re-assigning to you :)

@mywoodstock mywoodstock assigned esmalTT and unassigned mywoodstock Nov 26, 2024
@tt-aho
Copy link
Contributor

tt-aho commented Nov 26, 2024

The pr was made of 2 commits. Do you know if it was the first or second commit? The first commit enables correct asserts/warnings but doesn't change functionality of any op/infra.

The second commit fixes 2 ops to properly size cbs. Could be something was undersized previously (concat?)?

@tt-aho
Copy link
Contributor

tt-aho commented Nov 26, 2024

@mywoodstock what is the deterministic failure you see? Is it the same as the one mentioned in this issue?

@mywoodstock
Copy link
Contributor

mywoodstock commented Nov 26, 2024

@tt-aho The failure is different from the one mentioned in this issue:

                     Op | DEBUG    | shard_shape: (1, 16)
                     Op | DEBUG    | CB 0 :: npages = 4, pagesize = 2048
                     Op | DEBUG    | CB 16 :: npages = 4, pagesize = 2048
                     Op | DEBUG    | CB 17 :: npages = 264, pagesize = 64
                     Op | DEBUG    | CB 1 :: npages = 1, pagesize = 64
                 Always | FATAL    | Cannot set to globally allocated buffer. Circular buffer size 1536 B exceeds allocated L1 buffer bank size of 32 B
...
E       RuntimeError: TT_ASSERT @ /localdev/asarje/tt-metal.wh/tt_metal/impl/buffers/circular_buffer_types.cpp:104: false
E       info:
E       Cannot set to globally allocated buffer. Circular buffer size 1536 B exceeds allocated L1 buffer bank size of 32 B
E       backtrace:
E        --- /localdev/asarje/tt-metal.wh/ttnn/ttnn/_ttnn.so(+0x19b080b) [0x77d0c2fb080b]
E        --- void tt::assert::tt_assert<unsigned int, unsigned int>(char const*, int, char const*, bool, char const*, fmt::v11::basic_format_string<char, fmt::v11::type_identity<unsigned int const&>::type, fmt::v11::type_identity<unsigned int const&>::type>, unsigned int const&, unsigned int const&)
E        --- tt::tt_metal::v0::CircularBufferConfig::set_globally_allocated_address(tt::tt_metal::v0::Buffer const&)
E        --- ttnn::operations::data_movement::detail::untilize_with_halo_multi_core_v2(tt::tt_metal::v0::Program&, tt::tt_metal::Tensor const&, unsigned int, unsigned int, unsigned int, tt::tt_metal::Tensor const&, tt::tt_metal::Tensor const&, tt::tt_metal::Tensor const&, bool, bool, tt::tt_metal::Tensor&)
E        --- ttnn::operations::sliding_window::halo::HaloDeviceOperation::create_program(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>&) const

This happens for config tensor -- shape is [64, 24], with alignment is [64, 32], height sharded across 64 cores.

@mywoodstock
Copy link
Contributor

@tt-aho
Copy link
Contributor

tt-aho commented Nov 26, 2024

Yes, it should be size per core. Right now it only asserts in debug mode, and will log a warning in release. Will be changed to assert in release once existing tests are fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants