You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that HuggingFace's implementation of Whisper can now utilize Flash Attention v2 resulting in massive (~3x) speedup compared to baseline. Would it be possible to apply a similar optimization to Whisper (using GQA) in onnxruntime/Olive? If so, what would the process look like? (Does Olive offer a functionality to apply this kind of operator substitutions?)
Along that line, any pointers to documentation on how GQA/Flash Attention was applied to LLaMA 2 was would be really appreciated.
What component(s) does this request affect?
OliveModels
OliveSystems
OliveEvaluator
Metrics
Engine
Passes
Other
The text was updated successfully, but these errors were encountered:
Proposal Summary
Hi, it is great to see the speed improvements for LLaMA 2 using GroupQueryAttention (GQA) operator.
It seems that HuggingFace's implementation of Whisper can now utilize Flash Attention v2 resulting in massive (~3x) speedup compared to baseline. Would it be possible to apply a similar optimization to Whisper (using GQA) in onnxruntime/Olive? If so, what would the process look like? (Does Olive offer a functionality to apply this kind of operator substitutions?)
Along that line, any pointers to documentation on how GQA/Flash Attention was applied to LLaMA 2 was would be really appreciated.
What component(s) does this request affect?
The text was updated successfully, but these errors were encountered: