Several common methods of matrix multiplication are implemented on CPU and Nvidia GPU using C++11 and CUDA. The performance benefits of each optimization method were simply tested.
- naive
- reordering
- tiling
- strassen
- coppersmith-winograd
- cublas
- naive
- kahan
- shared_memory
- OS: Linux
- Cmake Version: >= 3.8
- GCC Version: >= 4.8
- CUDA Version: 11.4 (best)
- CUDA Driver Version: 470.129.06 (best)
git clone https://github.com/Bruce-Lee-LY/matrix_multiply.git
cd matrix_multiply
./build.sh -t Release -b OFF
./build.sh -t Debug -b ON
./run_sample.sh
- OS: Ubuntu 20.04.4
- CPU: i5-9400F
- GPU: NVIDIA GeForce GTX 1080 Ti
- CUDA Version: 11.4
- CUDA Driver Version: 470.129.06
- Matrix (float): A (512 * 512) * B (512 * 512) = C (512 * 512)
Method | Cost / ms |
---|---|
naive | 1238.647 |
reordering | 984.445 |
tiling | 1000.095 |
strassen | 57429.407 |
coppersmith-winograd | 77668.238 |
Method | Cost / ms |
---|---|
cublas | 0.100 |
naive | 0.613 |
kahan | 0.616 |
shared_memory | 0.153 |