[discussion] CPU vs GPU Thread Dispatcher #1471

jjfumero · 2024-05-17T13:45:50Z

jjfumero
May 17, 2024

Hi all,
I noticed an interesting thing regarding thread dispatch with TornadoVM running with PoCL on the CPU, and I wanted to share what I have found with the PoCL contributors. Note that the behaviour I am going to describe does not mean PoCL is not correct, it is just something I noticed compared to other OpenCL runtimes.

Some Background

TornadoVM specialises the generated code and the runtime per device (e.g., GPU vs CPU). Regarding the thread-block for example, TornadoVM, selects different global thread size for GPUs and CPUs:

For GPUs: it selects as thread-block the loop-bound. This is probably the most common strategy for OpenCL. For instance, if processing a parallel loop of N iterations, then TornadoVM selects a global size of N.
For CPUs: it selects the maximum number of cores as default, independently of the loop bound. The TornadoVM runtime and the JIT compiler controls the block partitioning.

So, what's the "issue"?

For some applications, such as 2D kernels for Image Processing, the strategy employed by TornadoVM is very slow on CPUs with PoCL. You can find all details here: beehive-lab/TornadoVM#410

It is actually slower than Java sequential. What I noticed was that the application was running on a single core with PoCL. However, running with oneAPI CPU Runtime was fast. To put into perspective: an image processing application running 3888x 5184 pixels was taking ~5 seconds with Intel oneAPI CPU Runtime, and ~46 seconds with PoCL. The number of threads deployed was 20 (20 cores).

How was it fixed?

TornadoVM has this general assumption that controlling the thread-block for CPUs can lead to faster performance. What I noticed is that, with modern OpenCL runtimes, this is not the case. I changed the thread-dispatch for CPUs in TornadoVM, using a similar strategy to the GPUs.

Results:

Same application with PoCL took ~2.9 seconds instead of 46.
Intel CPU Runtime also has some speedupds (~1.5x-2x faster) compared to the previous CPU-dispatcher.

Conclusions

This case might be tight to the TornadoVM compiler. I just wondered if PoCL always expects to run high-number of threads (which makes sense) to get performance, otherwise the thread dispatcher is tight to a single core.

I updated TornadoVM to use what we call fine-grained-scheduler (same for GPU), so this is now not a big deal, but I thought it was worth to share.

pjaaskel · 2024-05-17T14:26:32Z

pjaaskel
May 17, 2024
Maintainer

Indeed, leaving the local NDRange dimension decision to the runtime, if it's not bound by the application logic (basically synch needs through local mem and barriers) is one of the few places where OpenCL runtimes can apply a bit of performance portability.

Good to hear PoCL could decide sensibly here and get good perf. We are tuning the vectorizer so it could improve more. Also it would be interesting to hear if you get better perf. using the TBB or OpenMP CPU drivers in PoCL.

1 reply

jjfumero May 17, 2024
Author

Running via TBB and OpenMP could be very interesting. I will put this in my TODO list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[discussion] CPU vs GPU Thread Dispatcher #1471

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

[discussion] CPU vs GPU Thread Dispatcher #1471

jjfumero May 17, 2024

Some Background

So, what's the "issue"?

How was it fixed?

Conclusions

Replies: 1 comment · 1 reply

pjaaskel May 17, 2024 Maintainer

jjfumero May 17, 2024 Author

jjfumero
May 17, 2024

Replies: 1 comment 1 reply

pjaaskel
May 17, 2024
Maintainer

jjfumero May 17, 2024
Author