Transformers from scratch

The implemented Transformer architecture is written from scratch using PyTorch, described in "Attention is all you need".

The Vision Transformer (ViT) model, described in "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", is extended from the base Transformer architecture implemented here.

For the ViT variant, I placed layer normalization inside residual connections (i.e. before attention) instead of placing them between residual blocks, which is the method used in the original Transformers paper. This technique is described in "On Layer Normalization in the Transformer Architecture", and would lead to a smaller gradient as a result.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.gitignore		.gitignore
README.md		README.md
data_loader.py		data_loader.py
dataset.py		dataset.py
decoder.py		decoder.py
encoder.py		encoder.py
layer_norm.py		layer_norm.py
multihead_attention.py		multihead_attention.py
positional_encoding.py		positional_encoding.py
scheduler.py		scheduler.py
self_attention.py		self_attention.py
train_cifar10.py		train_cifar10.py
train_seq2seq.py		train_seq2seq.py
transformer.py		transformer.py
vision_transformer.py		vision_transformer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformers from scratch

About

Releases

Packages

Languages

cern1710/transformers-from-scratch

Folders and files

Latest commit

History

Repository files navigation

Transformers from scratch

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages