Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging the new op: nlp_kv_cache_unpad_to_sharded into main #8459

Merged
merged 9 commits into from
May 16, 2024

Conversation

caixunshiren
Copy link
Contributor

Issue: #7379

This new op fused the unpadding and following i2s op together, with optimized noc_async_read_barrier placement, to speed up llama3 model's kv cache unpadding by 5-6x. The same op can be applied to many other transformer models.

@davorchap
Copy link
Collaborator

Perhaps a more generic name
kv_cache_load_slice ?

@caixunshiren caixunshiren self-assigned this May 14, 2024
@caixunshiren
Copy link
Contributor Author

Perhaps a more generic name
kv_cache_load_slice ?

I think this is a great name! @cglagovichTT what do you think?

@cglagovichTT
Copy link
Contributor

Looks good!

@caixunshiren
Copy link
Contributor Author

All Post commit run passed. Good to merge. Would be good if you can approve it? @TT-BrianLiu @tt-aho

Copy link
Contributor

@TT-BrianLiu TT-BrianLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Please address comments

@caixunshiren caixunshiren merged commit 2399c77 into main May 16, 2024
77 checks passed
@caixunshiren
Copy link
Contributor Author

All post commit tests passed. Rebased to main and merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants