Skip to content

Implementation of the transformer architecture from scratch, extending to ViTs (Vision Transformers)

Notifications You must be signed in to change notification settings

cern1710/transformers-from-scratch

Repository files navigation

Transformers from scratch

The implemented Transformer architecture is written from scratch using PyTorch, described in "Attention is all you need".

The Vision Transformer (ViT) model, described in "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", is extended from the base Transformer architecture implemented here.

For the ViT variant, I placed layer normalization inside residual connections (i.e. before attention) instead of placing them between residual blocks, which is the method used in the original Transformers paper. This technique is described in "On Layer Normalization in the Transformer Architecture", and would lead to a smaller gradient as a result.

About

Implementation of the transformer architecture from scratch, extending to ViTs (Vision Transformers)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages