#13401: Add data parallel support for Bert-Tiny model #14033

vigneshkeerthivasanx · 2024-10-21T10:03:28Z

Ticket

Link to Github Issue

Problem description

Add Data parallel support for BERT-Tiny model on n300.

What's changed

Data parallel support is enabled for BERT-Tiny model along with demo, e2e device and test perf.

Checklist

Post commit CI passes
Blackhole Post commit (if applicable)
Model regression CI testing passes (if applicable)
Device performance regression CI testing passes (if applicable)
New/Existing tests provide coverage for changes

uaydonat · 2024-11-15T03:45:24Z

@skhorasganiTT can you review the data parallel implementation and testing (assuming the single-chip model is correct)?

tt-rkim · 2024-11-18T15:46:56Z

You ran no CI at all on this branch?

vigneshkeerthivasanx · 2024-11-19T06:13:16Z

You ran no CI at all on this branch?

We were waiting for #13471 to pass and merge to main first. Will trigger CIs for this PR as well.

tt-rkim · 2024-11-19T17:18:20Z

@bkeith-TT @zzigler-tt Per @vigneshkeerthivasanx 's comments:

We were waiting for #13471 to pass and merge to main first. Will trigger CIs for this PR as well.

This is blocked on another MCW PR, which itself they say they're waiting on passing CI (which they haven't posted results for yet). They are completely unblocked from doing so.

If we put this down as a PR that is blocked on review, that's wrong.

skhorasganiTT · 2024-11-19T21:06:18Z

models/demos/wormhole/bert_tiny/tests/test_performance.py

+@skip_for_grayskull()
+@pytest.mark.models_performance_bare_metal
+@pytest.mark.parametrize("device_params", [{"l1_small_size": 24576}], indirect=True)
+@pytest.mark.parametrize("batch_size", [8])


It looks like when this test is run on n300 the batch-per-device=4 and when it's run on n150 the batch-per-device=8. It would be good to test the same batch-per-device for any number of devices, potentially by treating the batch_size parameter as the per-device batch and multiplying by the number of devices when creating the inputs (and adding more options for batch size if desired).
Same comment for the tests in demo.py and test_bert_tiny_wh.py.

Good point. The reason why we are running the model data-parallel so we can increase the total batch size.

Updated the pipeline to run the test for batch-per-device=8 for n300 and n150.

vigneshkeerthivasanx · 2024-11-20T15:31:32Z

CI links
All post-commit tests : Link (All tests are passing)
Nightly fast dispatch tests : n150 , n300, e150 (Bert-tiny data parallel is passing)
(Single-card) Demo tests : n150, n300 (Passing)
(Single-card) Device perf regressions : Link (In progress)
(Single-card) Model perf tests : Link (In progress)

tt-rkim · 2024-11-20T16:17:07Z

Looks like:

device perf has bad threshold
model perf e2e failed
rest look ok

Take a look?

vigneshkeerthivasanx requested review from uaydonat, tt-rkim, ayerofieiev-tt, dmakoviichuk-tt, rfurko-tt, cfjchu, TT-BrianLiu, razorback3 and dongjin-na as code owners October 21, 2024 10:03

vigneshkeerthivasanx marked this pull request as draft October 21, 2024 10:04

vigneshkeerthivasanx force-pushed the vignesh/ttnn_bert_tiny_data_parallel branch 4 times, most recently from f68b6ff to 1b1ecec Compare October 28, 2024 07:39

vigneshkeerthivasanx marked this pull request as ready for review October 28, 2024 07:39

vigneshkeerthivasanx force-pushed the vignesh/ttnn_bert_tiny_data_parallel branch from 1b1ecec to 302d0ba Compare November 5, 2024 12:31

saichandax mentioned this pull request Nov 5, 2024

Porting Bert-Tiny model to n300 #13332

Open

8 tasks

cfjchu approved these changes Nov 5, 2024

View reviewed changes

vigneshkeerthivasanx force-pushed the vignesh/ttnn_bert_tiny_data_parallel branch from 302d0ba to 5802c00 Compare November 7, 2024 12:15

vigneshkeerthivasanx changed the title ~~#13401: Add data parallel support for Bert model~~ #13401: Add data parallel support for Bert-Tiny model Nov 7, 2024

vigneshkeerthivasanx force-pushed the vignesh/ttnn_bert_tiny_data_parallel branch 2 times, most recently from dbc55f6 to 051d6a4 Compare November 18, 2024 09:38

vigneshkeerthivasanx force-pushed the vignesh/ttnn_bert_tiny_data_parallel branch from 051d6a4 to fd97b5a Compare November 19, 2024 06:43

skhorasganiTT reviewed Nov 19, 2024

View reviewed changes

vigneshkeerthivasanx force-pushed the vignesh/ttnn_bert_tiny_data_parallel branch 2 times, most recently from 7d65817 to 8454f8c Compare November 20, 2024 09:31

vigneshkeerthivasanx force-pushed the vignesh/ttnn_bert_tiny_data_parallel branch 3 times, most recently from 3c8e86b to ec627ba Compare November 25, 2024 11:55

#13401: Add data parallel support for Bert-Tiny model

9a52f3d

vigneshkeerthivasanx force-pushed the vignesh/ttnn_bert_tiny_data_parallel branch from ec627ba to 9a52f3d Compare November 26, 2024 12:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#13401: Add data parallel support for Bert-Tiny model #14033

#13401: Add data parallel support for Bert-Tiny model #14033

vigneshkeerthivasanx commented Oct 21, 2024

uaydonat commented Nov 15, 2024

tt-rkim commented Nov 18, 2024

vigneshkeerthivasanx commented Nov 19, 2024

tt-rkim commented Nov 19, 2024

skhorasganiTT Nov 19, 2024

uaydonat Nov 19, 2024

vigneshkeerthivasanx Nov 20, 2024

vigneshkeerthivasanx commented Nov 20, 2024

tt-rkim commented Nov 20, 2024

#13401: Add data parallel support for Bert-Tiny model #14033

Are you sure you want to change the base?

#13401: Add data parallel support for Bert-Tiny model #14033

Conversation

vigneshkeerthivasanx commented Oct 21, 2024

Ticket

Problem description

What's changed

Checklist

uaydonat commented Nov 15, 2024

tt-rkim commented Nov 18, 2024

vigneshkeerthivasanx commented Nov 19, 2024

tt-rkim commented Nov 19, 2024

skhorasganiTT Nov 19, 2024

Choose a reason for hiding this comment

uaydonat Nov 19, 2024

Choose a reason for hiding this comment

vigneshkeerthivasanx Nov 20, 2024

Choose a reason for hiding this comment

vigneshkeerthivasanx commented Nov 20, 2024

tt-rkim commented Nov 20, 2024