Created as a part of the CSC586C (Data Management on Modern Computer Architectures) course at University of Victoria under supervision of Sean Chester.
Contributors: Ze Shi Li, Rohith Pudari, Haris Smajlovic.
Repository contains performance optimizations for Linear gap Smith-Waterman algorithm.
Note: Check if your CPU supports SSE2/4, AVX2 and/or AVX-512 first. Otherwise the SIMD benchmark will still run, but with no valid results.
So far we have a baseline
, bithacked
, bithacked-striped
, multicore-windowed
, windowed
, simd-alpern
and multicore-alpern
version of the very same algorithm for a CPU, and cuda-alpern
, cuda-antidiagonal
, cuda-hypothetical
, and cuda-windowed
for a GPU:
- Baseline: A straight forward baseline version of the SW algorithm.
- Bithacked: Baseline version with heavy branching replaced with bithacks.
- Bithacked-striped: Bithacked version with an access pattern that is more L1 cache friendly.
- Windowed: For a scenario in which traceback is not needed but only the best match score.
- Multicore windowed: Using the technique above, spreads across multiple CPU cores.
- SIMDed (Alpern technique): A SIMDed baseline using widest registers that your CPU supports and inter-alignment technique from Alpern et al.
- Multicore (Alpern technique): Just a SIMDed technique above spread accross multiple CPU cores.
- CUDA (Alpern technique): A SIMTed baseline using inter-alignment technique akin to SIMD Alpern technique above.
- CUDA windowed: A SIMT implementation of a windowed version above.
- CUDA antidiagonal: A 2-dimentional parallelisation exposing both inter and intra alignment parallelism.
- CUDA hypothetical: A 3-dimentional parallelisation in which all data dependency is ignored and parallelization utilized to full extent.
In order to benchmark the CPU solutions use perf
(for now -- sorry non-linux users). So just compile benchmark.cpp
and then run perf
on the executable.
For testing the GPU solutions just compile and run benchmark.cu
.
Don't forget to provide version string as a CLI argument.
- For baseline set
version=base
in your bash - For bithacked set
version=bithacked
in your bash - For bithacked-striped set
version=bithacked-striped
in your bash - For windowed set
version=windowed
in your bash - For multicore-windowed set
version=multicore-windowed
in your bash - For SIMDed (Alpern technique) set
version=simd-alpern
in your bash - For multicore set
version=multicore-alpern
in your bash
and then do
exe_path=benchmark_${version}.out && \
g++ -D THRD_CNT=2 -march=native -fopenmp -Wall -Og -std=c++17 -o $exe_path benchmark.cpp && \
perf stat -e cycles:u,instructions:u ./$exe_path $version
- For CUDA (Aplern technique) set
version=cuda-alpern
in your bash - For CUDA windowed set
version=cuda-windowed
in your bash - For CUDA antidiagonal set
version=cuda-antidiagonal
in your bash - For CUDA hypothetical set
version=cuda-hypothetical
in your bash
and then do
exe_path=benchmark_${version}.out && \
nvcc -O3 \
-D QUANTITY_SCALE=13 \
-D SIZE_SCALE=10 \
-D XBLOCK_SIZE_SCALE=5 \
-D YBLOCK_SIZE_SCALE=3 \
-D ZBLOCK_SIZE_SCALE=2 \
-D WINDOW_SIZE_SCALE=7 \
-o $exe_path benchmark.cu && \
./$exe_path $version
Version | insn per cycle | Seconds |
---|---|---|
Bithacked-striped (optimised-base) | 2.77 | 41.50 |
SIMD-alpern | 2.95 | 3.76 |
multicore-alpern | 2.26 | 2.45 |