-
Notifications
You must be signed in to change notification settings - Fork 15
GEMM optimization resources
Mamy Ratsimbazafy edited this page Nov 10, 2018
·
4 revisions
TODO: very WIP
- There is no batched GEMM in OpenBLAS
- BLIS and cublas strided GEMM supports arbitrary strides which avoids a copy to a contiguous tensor. but neither MKL or OpenBLAS supports it. Arbitrary strides are resolved during packing.
- In deep learning, GEMMs and convolutions (which often use GEMM) are always followed by a non-linear activation which is memory-bound. Allowing non-linearity + GEMM fusion would probably increase throughput tremendously.
- https://github.com/hfp/libxsmm
- High-Performance Matrix-Matrix Multiplications of Very Small Matrices
- BLASFEO: basic linear algebra subroutines for embedded optimization