Add matmul optimize #4

Byeong-Chan · 2024-04-05T13:21:18Z

Description

This PR implements a matrix multiplication optimization forward pass for flash attention. (~300 line)

I got these results on my RTX 3060 (sm_80 same or up)

in float minimal

=== profiling manual attention ===
...
Self CPU time total: 834.368ms
Self CUDA time total: 835.075ms

=== profiling minimal flash attention === 
...
Self CPU time total: 668.000us
Self CUDA time total: 687.000us

attn values sanity check: True

in half matmul opt

=== profiling manual attention ===
...
Self CPU time total: 849.544ms
Self CUDA time total: 849.698ms

=== profiling minimal flash attention ===
...
Self CPU time total: 89.000us
Self CUDA time total: 93.000us

attn values sanity check: True

Reference

Byeong-Chan added 8 commits April 5, 2024 22:14

Optimize matmul

6caf31b

new line

cc43607

add benchmark fp16

e603afa

modify threadblock comment

be8c11d

space

499871f

Br,Bc template

9c9efda

correct dim

ba92f3d

support d 128

35e6602

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add matmul optimize #4

Add matmul optimize #4

Byeong-Chan commented Apr 5, 2024 •

edited

Loading

Add matmul optimize #4

Are you sure you want to change the base?

Add matmul optimize #4

Conversation

Byeong-Chan commented Apr 5, 2024 • edited Loading

Description

Reference

Byeong-Chan commented Apr 5, 2024 •

edited

Loading