[FR]: FlashAttention support for Whisper #1065

k2m5t2 · 2024-04-09T20:42:44Z

Proposal Summary

Hi, it is great to see the speed improvements for LLaMA 2 using GroupQueryAttention (GQA) operator.

It seems that HuggingFace's implementation of Whisper can now utilize Flash Attention v2 resulting in massive (~3x) speedup compared to baseline. Would it be possible to apply a similar optimization to Whisper (using GQA) in onnxruntime/Olive? If so, what would the process look like? (Does Olive offer a functionality to apply this kind of operator substitutions?)

Along that line, any pointers to documentation on how GQA/Flash Attention was applied to LLaMA 2 was would be really appreciated.

What component(s) does this request affect?

trajepl · 2024-04-11T08:35:54Z

Try this, Olive has the option to enable use_gqa in OrtTransformerOptimization:
https://microsoft.github.io/Olive/api/passes.html#cmdoption-arg-use_gpu

Also here is the case in llama2. You can try to write the similar configs for whisper.
https://github.com/microsoft/Olive/blob/main/examples/llama2/llama2_template.json#L182

k2m5t2 added the enhancement New feature or request label Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR]: FlashAttention support for Whisper #1065

[FR]: FlashAttention support for Whisper #1065

k2m5t2 commented Apr 9, 2024 •

edited

trajepl commented Apr 11, 2024

[FR]: FlashAttention support for Whisper #1065

[FR]: FlashAttention support for Whisper #1065

Comments

k2m5t2 commented Apr 9, 2024 • edited

Proposal Summary

What component(s) does this request affect?

trajepl commented Apr 11, 2024

k2m5t2 commented Apr 9, 2024 •

edited