Flash attention 2 on Turing / Triton? #473
dvianisoho
started this conversation in
General
Replies: 1 comment 1 reply
-
I don't know what kind of performance you could expect from Triton in this case, but that implementation doesn't seem to support paging, which is essential for dynamic batching to work. I'm not even sure it has lower-right aligned causal masking which is essential for a number of other reasons. I do have some ideas for making paged attention work without flash-attn 2.5.7+ as a prerequisite. It probably won't perform as well but it should still be an improvement on the old multi-cache hack. Stay tuned. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
First thank you for this fantastic library! I've written my own "dynamic batching" (you could barely call it that) version of a generator based on the old multi-cache example and was able to get fairly significant speedup (~2-3X) on the types of batch jobs I am running despite the single threaded prompt processing and manual memory management overhead & headaches. I was extremely excited to see proper support for batching and paged attention get pushed recently (and it works great in testing on my 'fancy' GPU) but, alas, I run my real workloads on Turing architecture GPUs, which are not supported under the 'official' flash attention 2 implementation. Based on the comments here I'm not going to hold my breath. I don't know the complexity, but have you considered using the triton implementation? Do you have any other suggestions if I'm stuck with Turing for the foreseeable future? Workload is offline batch processing of long prompts (~4k tokens) and generation of relatively short output (1 - 500 tokens).
Beta Was this translation helpful? Give feedback.
All reactions