All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Expose bias flag for feedforwards, same default as Timm [#220]
- Update eps value for layernormm, same default as torch [#221]
- PreNorm bugfix, only one input was normalized [#233]
- Add DeepNet (DeepNorm) residual path and init [#227]
- Compositional Attention [#41]
- Experimental Ragged attention [#189]
- Mixture of Experts [#181]
- BlockSparseTensor [#202]
- nd-tensor support for triton softmax [#210]
- bugfix Favor, single feature map [#183]
- sanity check blocksparse settings [#207]
- fixed some pickability [#204]
- Much faster fused dropout [#164]
- Fused dropout repeatability [#173]
- Embedding weight tying option [#172]
- Dropout setting not properly passed in many attentions [#123]
- Fix self attention optimization not being triggered, broken residual path [#119]
- Improve speed by not using contiguous Tensors when not needed [#119]
- Attention mask wrapper [#113]
- ViT comparison benchmark [#117]
- Homogenizing the masks, additive or bool [#79][#85][#86]
- Fix causality flag not being respected [#103]
- Enabling FusedLayerNorm by default in the factory if Triton is available
- Fixing Favor with fp16
- Fixing Favor trainability
- Fused dropout/bias/activation layer [#58]
- Fused layernorm used by default in the factory [#92]
- Nystrom causal attention [#75]
- More robust blocksparse [#24]
- Rotary embeddings [#32]
- More flexible layernorm [#50]