-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#13401: Add data parallel support for Bert-Tiny model #14033
base: main
Are you sure you want to change the base?
Conversation
f68b6ff
to
1b1ecec
Compare
1b1ecec
to
302d0ba
Compare
302d0ba
to
5802c00
Compare
@skhorasganiTT can you review the data parallel implementation and testing (assuming the single-chip model is correct)? |
dbc55f6
to
051d6a4
Compare
You ran no CI at all on this branch? |
We were waiting for #13471 to pass and merge to main first. Will trigger CIs for this PR as well. |
051d6a4
to
fd97b5a
Compare
@bkeith-TT @zzigler-tt Per @vigneshkeerthivasanx 's comments:
This is blocked on another MCW PR, which itself they say they're waiting on passing CI (which they haven't posted results for yet). They are completely unblocked from doing so. If we put this down as a PR that is blocked on review, that's wrong. |
@skip_for_grayskull() | ||
@pytest.mark.models_performance_bare_metal | ||
@pytest.mark.parametrize("device_params", [{"l1_small_size": 24576}], indirect=True) | ||
@pytest.mark.parametrize("batch_size", [8]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like when this test is run on n300 the batch-per-device=4 and when it's run on n150 the batch-per-device=8. It would be good to test the same batch-per-device for any number of devices, potentially by treating the batch_size parameter as the per-device batch and multiplying by the number of devices when creating the inputs (and adding more options for batch size if desired).
Same comment for the tests in demo.py and test_bert_tiny_wh.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. The reason why we are running the model data-parallel so we can increase the total batch size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the pipeline to run the test for batch-per-device=8 for n300 and n150.
7d65817
to
8454f8c
Compare
CI links |
Looks like:
Take a look? |
3c8e86b
to
ec627ba
Compare
ec627ba
to
9a52f3d
Compare
Ticket
Link to Github Issue
Problem description
Add Data parallel support for BERT-Tiny model on n300.
What's changed
Data parallel support is enabled for BERT-Tiny model along with demo, e2e device and test perf.
Checklist