Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThunderKittens:a simple yet faster flashattention alternative #462

Open
sorasoras opened this issue May 14, 2024 · 1 comment
Open

ThunderKittens:a simple yet faster flashattention alternative #462

sorasoras opened this issue May 14, 2024 · 1 comment

Comments

@sorasoras
Copy link

ThunderKittens is an embedded domain-specific language (DSL) within CUDA designed to simplify the development of high-performance AI kernels on GPUs. It provides abstractions for working with small tiles (e.g., 16x16) of data, which aligns well with the capabilities of modern GPU architectures and tensor cores.

Performance: Despite its simplicity, kernels written in ThunderKittens can match or outperform hand-written CUDA kernels. For example, on the H100 GPU, a ThunderKittens implementation of the forward flash attention kernel outperforms FlashAttention-2 by around 30%.

On 4090s and A100s, TK matches FA2 performance in just a few lines of code.

On H100s, TK is faster forward and backward than FA2 by quite a bit -- so there is no tradeoff of clean versus speed (in this case!)

Tiles Seem Pretty General
Coming soon --
ThunderKittens on AMD hardware!

https://hazyresearch.stanford.edu/blog/2024-05-12-tk

https://github.com/HazyResearch/ThunderKittens


This could be alternative to FA2
AMD would have support latter as well.

@shimmyshimmer
Copy link
Collaborator

Yes thanks for being on the lookout! We will mostly likely be implementing this pretty soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants