Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#13401: Add data parallel support for Bert-Tiny model #14033

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vigneshkeerthivasanx
Copy link
Contributor

Ticket

Link to Github Issue

Problem description

Add Data parallel support for BERT-Tiny model on n300.

What's changed

Data parallel support is enabled for BERT-Tiny model along with demo, e2e device and test perf.

Checklist

  • Post commit CI passes
  • Blackhole Post commit (if applicable)
  • Model regression CI testing passes (if applicable)
  • Device performance regression CI testing passes (if applicable)
  • New/Existing tests provide coverage for changes

@vigneshkeerthivasanx vigneshkeerthivasanx changed the title #13401: Add data parallel support for Bert model #13401: Add data parallel support for Bert-Tiny model Nov 7, 2024
@uaydonat
Copy link
Contributor

@skhorasganiTT can you review the data parallel implementation and testing (assuming the single-chip model is correct)?

@vigneshkeerthivasanx vigneshkeerthivasanx force-pushed the vignesh/ttnn_bert_tiny_data_parallel branch 2 times, most recently from dbc55f6 to 051d6a4 Compare November 18, 2024 09:38
@tt-rkim
Copy link
Collaborator

tt-rkim commented Nov 18, 2024

You ran no CI at all on this branch?

@vigneshkeerthivasanx
Copy link
Contributor Author

You ran no CI at all on this branch?

We were waiting for #13471 to pass and merge to main first. Will trigger CIs for this PR as well.

@tt-rkim
Copy link
Collaborator

tt-rkim commented Nov 19, 2024

@bkeith-TT @zzigler-tt Per @vigneshkeerthivasanx 's comments:

We were waiting for #13471 to pass and merge to main first. Will trigger CIs for this PR as well.

This is blocked on another MCW PR, which itself they say they're waiting on passing CI (which they haven't posted results for yet). They are completely unblocked from doing so.

If we put this down as a PR that is blocked on review, that's wrong.

@skip_for_grayskull()
@pytest.mark.models_performance_bare_metal
@pytest.mark.parametrize("device_params", [{"l1_small_size": 24576}], indirect=True)
@pytest.mark.parametrize("batch_size", [8])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like when this test is run on n300 the batch-per-device=4 and when it's run on n150 the batch-per-device=8. It would be good to test the same batch-per-device for any number of devices, potentially by treating the batch_size parameter as the per-device batch and multiplying by the number of devices when creating the inputs (and adding more options for batch size if desired).
Same comment for the tests in demo.py and test_bert_tiny_wh.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The reason why we are running the model data-parallel so we can increase the total batch size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the pipeline to run the test for batch-per-device=8 for n300 and n150.

@vigneshkeerthivasanx vigneshkeerthivasanx force-pushed the vignesh/ttnn_bert_tiny_data_parallel branch 2 times, most recently from 7d65817 to 8454f8c Compare November 20, 2024 09:31
@vigneshkeerthivasanx
Copy link
Contributor Author

CI links
All post-commit tests : Link (All tests are passing)
Nightly fast dispatch tests : n150 , n300, e150 (Bert-tiny data parallel is passing)
(Single-card) Demo tests : n150, n300 (Passing)
(Single-card) Device perf regressions : Link (In progress)
(Single-card) Model perf tests : Link (In progress)

@tt-rkim
Copy link
Collaborator

tt-rkim commented Nov 20, 2024

Looks like:

  • device perf has bad threshold
  • model perf e2e failed
  • rest look ok

Take a look?

@vigneshkeerthivasanx vigneshkeerthivasanx force-pushed the vignesh/ttnn_bert_tiny_data_parallel branch 3 times, most recently from 3c8e86b to ec627ba Compare November 25, 2024 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants