Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flash Attention Problem #18

Open
yyjpro opened this issue Nov 26, 2024 · 0 comments
Open

Flash Attention Problem #18

yyjpro opened this issue Nov 26, 2024 · 0 comments

Comments

@yyjpro
Copy link

yyjpro commented Nov 26, 2024

According to the code in the FA2 paper, there might be less details about online softmax in the project.

Analyzing each step, in the inner loop:

  1. S=QK OPs: 2 * d * Bc * Br
  2. max operation has no FLOP; solving P needs 2 * Br * Bc, and solving l needs (2Br + Br + Br * Bc), 2Br + Br means exp in the online softmax process, Br * Bc means rowsum in P
  3. solving O need total (2Br + Br * d + 2 * d * Bc * Br), but here could be (2Br + Br * d * Br + 2 * d * Bc * Br) if it was diag matrix mul in the kernel. In this case, diag matrix has many 0 elements, so in the actual env, it could be eliminated.

After all inner loop finished, total inner OPs: (N / Bc) * (4 * d * Bc * Br + 5Br + 3Br * Bc + Br *d)

Adding last calculation OPs: Br * d

In the Outer loop, after multiple N / Br, get: 4 * d * N^2 + 5N^2/Bc + 3N^2 + d * N^2/Bc + Nd

In the original project, 4 * d * N^2 fully corresponds to the prefill stage qk and sv operations: qk_matmul_OPs = seqlen * seqlen * head_size * num_attention_heads * batchsize * 2 , sv_matmul_OPs = seqlen * head_size * seqlen * num_attention_heads * batchsize * 2

but the softmax part cannot be matched. Is it necessary to reconsider a greater OPs for online softmax? Additionally, the inference time seems to theoretically depend only on the formula OPs/bandwidth. If my analysis is reasonable, an increase in actual OPs would lead to an increase in FA's inference time (bigger than normal attention), which clearly does not align with the practical situation. How should this be balanced theoretically?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant