-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merging the new op: nlp_kv_cache_unpad_to_sharded into main #8459
Conversation
Perhaps a more generic name |
I think this is a great name! @cglagovichTT what do you think? |
...ry/nlp_tms/kernels/dataflow/reader_unary_unpad_dims_interleaved_start_id_shard_optimized.cpp
Show resolved
Hide resolved
...ry/nlp_tms/kernels/dataflow/reader_unary_unpad_dims_interleaved_start_id_shard_optimized.cpp
Outdated
Show resolved
Hide resolved
tt_eager/tt_dnn/op_library/nlp_tms/nlp_kv_cache_unpad_to_sharded.cpp
Outdated
Show resolved
Hide resolved
56333e8
to
6e96e86
Compare
tests/tt_eager/python_api_testing/unit_testing/misc/test_nlp_kv_cache_load_slice.py
Outdated
Show resolved
Hide resolved
Looks good! |
All Post commit run passed. Good to merge. Would be good if you can approve it? @TT-BrianLiu @tt-aho |
7d0a8d7
to
945104e
Compare
945104e
to
0c3704d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. Please address comments
tests/tt_eager/python_api_testing/unit_testing/misc/test_nlp_kv_cache_load_slice.py
Outdated
Show resolved
Hide resolved
tests/tt_eager/python_api_testing/unit_testing/misc/test_nlp_kv_cache_load_slice.py
Outdated
Show resolved
Hide resolved
tests/tt_eager/python_api_testing/unit_testing/misc/test_nlp_kv_cache_load_slice.py
Outdated
Show resolved
Hide resolved
…ng on long seqlen; removed debug printout; added bfp8 test cases
…unecessary variables and conditions in host code
…leaned up host side code
677b273
to
17ae8f4
Compare
All post commit tests passed. Rebased to main and merged |
Issue: #7379
This new op fused the unpadding and following i2s op together, with optimized noc_async_read_barrier placement, to speed up llama3 model's kv cache unpadding by 5-6x. The same op can be applied to many other transformer models.