Do models have to be rebuilt for Flash Attention? #2168

7k50 · 2024-05-18T15:54:57Z

7k50
May 18, 2024

Do the models (.bin and .mlmodelc for Metal/Core ML) have to be rebuilt to enable them with Flash Attention (--flash-attn) per Whisper.cpp v1.6.0?

If so, do both .bin and .mlmodelc have to be rebuilt?

I'm assuming that the answer can be inferred if one better understands the architecture and relationship between all these components, but I'm not versed or literate in that (my apologies).

ggerganov · 2024-05-19T08:50:57Z

ggerganov
May 19, 2024
Maintainer

No need to rebuild the models. Note that if you use CoreML + Flash Attention, then only the Decoder will utilize the FA kernels. The Encoder would run as usual (i.e. without FA)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do models have to be rebuilt for Flash Attention? #2168

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Do models have to be rebuilt for Flash Attention? #2168

7k50 May 18, 2024

Replies: 1 comment

ggerganov May 19, 2024 Maintainer

7k50
May 18, 2024

ggerganov
May 19, 2024
Maintainer