Simplicity done right for SIMDified query processing on CPU and FPGA

This is the corresponding prototype to our SiMoD 2023 submission showing a simple but effective solution idea to port SIMDified query processing code to Intel FPGA cards for acceleration. The main advantage of our approach is that it seamlessly integrates with existing SIMD abstraction libraries that are originally developed to overcome SIMD heterogeneity on x86-processors. Moreover, our approach can be straightforwardly implemented in C++ without the necessity of complex FPGA-specific programming. Our initial results are very promising, demonstrating a novel approach to comprehensively integrate Intel FPGAs into the prevailing SIMDified processing on the CPU with reasonable effort.

Prototype description

build/ (folder for build files)
libs/ (folder contains relevant parts of the SIMD abstraction library TVL with an FPGA backend)
src/ (folder contains evaluation code including SIMD kernels using TVL)
Makefile (basic building infrastructure)

SIMD abstraction library (TVL)

This prototype contains only the relevant primitives for the evaluated SIMD kernels. The actual TVL is to be generated from https://github.com/db-tu-dresden/TVLGen.

For the full library, checkout the generator:

git clone --recurse-submodules [email protected]:db-tu-dresden/TVL.git

The following command generates the library with C++17 support (no concepts):

python3 main.py --targets fpga $(LANG=en;lscpu | grep -i flags | tr ' ' '\n' | egrep -v '^Flags:|^$' | sort -d | tr '\n' ' ') --no-concepts

SIMD kernels

As described in our paper, we conducted two core evaluations: (i) FPGA performance and (ii) acceleration.

FPGA performance

To demonstrate the viability of our approach with templated C++ kernels, we developed two examples. One is a very basic aggregation SIMD-kernel (see src/eval_agg_kernel.cpp), i.e., streaming over the data and adding every element. The second is a filter-count SIMD-kernel (see src/eval_filter_kernel.cpp), which counts the amount of values in a given range.

Aceeleration evaluation

Our next experiment evaluates the achieved bandwidth for concurrent calculations on both the host CPU and the FPGA (see eval/eval_acceleration.cpp). Here, we vary the amount of concurrent threads on the host processor and additionally dispatch the same SIMD-kernel to the FPGA. The host code is executed using 512 bit sized AVX512 registers and the FPGA kernel is also parameterized to use 512 bit for its register width.

Compile and Execute

Possible with the help of Intel DevCloud for oneAPI (https://devcloud.intel.com/oneapi/). Access is free of charge and corresponding Makefile is available. Environment variables may still need to be set.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
build		build
libs/include		libs/include
src		src
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simplicity done right for SIMDified query processing on CPU and FPGA

Prototype description

SIMD abstraction library (TVL)

SIMD kernels

FPGA performance

Aceeleration evaluation

Compile and Execute

About

Releases

Packages

Contributors 2

Languages

License

db-tu-dresden/SiMoD23-SIMD-FPGA

Folders and files

Latest commit

History

Repository files navigation

Simplicity done right for SIMDified query processing on CPU and FPGA

Prototype description

SIMD abstraction library (TVL)

SIMD kernels

FPGA performance

Aceeleration evaluation

Compile and Execute

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages