You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ThunderKittens is an embedded domain-specific language (DSL) within CUDA designed to simplify the development of high-performance AI kernels on GPUs. It provides abstractions for working with small tiles (e.g., 16x16) of data, which aligns well with the capabilities of modern GPU architectures and tensor cores.
Performance: Despite its simplicity, kernels written in ThunderKittens can match or outperform hand-written CUDA kernels. For example, on the H100 GPU, a ThunderKittens implementation of the forward flash attention kernel outperforms FlashAttention-2 by around 30%.
On 4090s and A100s, TK matches FA2 performance in just a few lines of code.
On H100s, TK is faster forward and backward than FA2 by quite a bit -- so there is no tradeoff of clean versus speed (in this case!)
Tiles Seem Pretty General
Coming soon --
ThunderKittens on AMD hardware!
ThunderKittens is an embedded domain-specific language (DSL) within CUDA designed to simplify the development of high-performance AI kernels on GPUs. It provides abstractions for working with small tiles (e.g., 16x16) of data, which aligns well with the capabilities of modern GPU architectures and tensor cores.
Performance: Despite its simplicity, kernels written in ThunderKittens can match or outperform hand-written CUDA kernels. For example, on the H100 GPU, a ThunderKittens implementation of the forward flash attention kernel outperforms FlashAttention-2 by around 30%.
On 4090s and A100s, TK matches FA2 performance in just a few lines of code.
On H100s, TK is faster forward and backward than FA2 by quite a bit -- so there is no tradeoff of clean versus speed (in this case!)
Tiles Seem Pretty General
Coming soon --
ThunderKittens on AMD hardware!
https://hazyresearch.stanford.edu/blog/2024-05-12-tk
https://github.com/HazyResearch/ThunderKittens
This could be alternative to FA2
AMD would have support latter as well.
The text was updated successfully, but these errors were encountered: