Skip to content

GEMM optimization resources

Mamy Ratsimbazafy edited this page Nov 14, 2018 · 4 revisions

TODO: very WIP

Challenges

  1. There is no batched GEMM in OpenBLAS
  2. BLIS and cublas strided GEMM supports arbitrary strides which avoids a copy to a contiguous tensor. but neither MKL or OpenBLAS supports it. Arbitrary strides are resolved during packing.
  3. In deep learning, GEMMs and convolutions (which often use GEMM) are always followed by a non-linear activation which is memory-bound. Allowing non-linearity + GEMM fusion would probably increase throughput tremendously.

Small GEMMs

Links