Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR]: FlashAttention support for Whisper #1065

Open
1 of 7 tasks
k2m5t2 opened this issue Apr 9, 2024 · 1 comment
Open
1 of 7 tasks

[FR]: FlashAttention support for Whisper #1065

k2m5t2 opened this issue Apr 9, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@k2m5t2
Copy link

k2m5t2 commented Apr 9, 2024

Proposal Summary

Hi, it is great to see the speed improvements for LLaMA 2 using GroupQueryAttention (GQA) operator.

It seems that HuggingFace's implementation of Whisper can now utilize Flash Attention v2 resulting in massive (~3x) speedup compared to baseline. Would it be possible to apply a similar optimization to Whisper (using GQA) in onnxruntime/Olive? If so, what would the process look like? (Does Olive offer a functionality to apply this kind of operator substitutions?)

Along that line, any pointers to documentation on how GQA/Flash Attention was applied to LLaMA 2 was would be really appreciated.

What component(s) does this request affect?

  • OliveModels
  • OliveSystems
  • OliveEvaluator
  • Metrics
  • Engine
  • Passes
  • Other
@k2m5t2 k2m5t2 added the enhancement New feature or request label Apr 9, 2024
@trajepl
Copy link
Contributor

trajepl commented Apr 11, 2024

Try this, Olive has the option to enable use_gqa in OrtTransformerOptimization:
https://microsoft.github.io/Olive/api/passes.html#cmdoption-arg-use_gpu

Also here is the case in llama2. You can try to write the similar configs for whisper.
https://github.com/microsoft/Olive/blob/main/examples/llama2/llama2_template.json#L182

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants