Flash attention 2 on Turing / Triton? #473

dvianisoho · 2024-05-27T18:11:21Z

dvianisoho
May 27, 2024

First thank you for this fantastic library! I've written my own "dynamic batching" (you could barely call it that) version of a generator based on the old multi-cache example and was able to get fairly significant speedup (~2-3X) on the types of batch jobs I am running despite the single threaded prompt processing and manual memory management overhead & headaches. I was extremely excited to see proper support for batching and paged attention get pushed recently (and it works great in testing on my 'fancy' GPU) but, alas, I run my real workloads on Turing architecture GPUs, which are not supported under the 'official' flash attention 2 implementation. Based on the comments here I'm not going to hold my breath. I don't know the complexity, but have you considered using the triton implementation? Do you have any other suggestions if I'm stuck with Turing for the foreseeable future? Workload is offline batch processing of long prompts (~4k tokens) and generation of relatively short output (1 - 500 tokens).

turboderp · 2024-05-28T08:58:09Z

turboderp
May 28, 2024
Maintainer

I don't know what kind of performance you could expect from Triton in this case, but that implementation doesn't seem to support paging, which is essential for dynamic batching to work. I'm not even sure it has lower-right aligned causal masking which is essential for a number of other reasons.

I do have some ideas for making paged attention work without flash-attn 2.5.7+ as a prerequisite. It probably won't perform as well but it should still be an improvement on the old multi-cache hack. Stay tuned.

1 reply

dvianisoho May 28, 2024
Author

Thanks for the response - I figured there were going to be some fundamental issues with Triton. Paged attention seems to be a real game changer for any kind of batch workload. FYI I am seeing about 1.8x speedup over my multicache implementation on a 3060 with an 7B parameter 3.5bpw exl2. I'm curious to see how this will compare to the vllms of the world.

Looking forward to see if you are able to come up with anything.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash attention 2 on Turing / Triton? #473

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Flash attention 2 on Turing / Triton? #473

dvianisoho May 27, 2024

Replies: 1 comment · 1 reply

turboderp May 28, 2024 Maintainer

dvianisoho May 28, 2024 Author

dvianisoho
May 27, 2024

Replies: 1 comment 1 reply

turboderp
May 28, 2024
Maintainer

dvianisoho May 28, 2024
Author