Replies: 1 comment 1 reply
-
Indeed, leaving the local NDRange dimension decision to the runtime, if it's not bound by the application logic (basically synch needs through local mem and barriers) is one of the few places where OpenCL runtimes can apply a bit of performance portability. Good to hear PoCL could decide sensibly here and get good perf. We are tuning the vectorizer so it could improve more. Also it would be interesting to hear if you get better perf. using the TBB or OpenMP CPU drivers in PoCL. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all,
I noticed an interesting thing regarding thread dispatch with TornadoVM running with PoCL on the CPU, and I wanted to share what I have found with the PoCL contributors. Note that the behaviour I am going to describe does not mean PoCL is not correct, it is just something I noticed compared to other OpenCL runtimes.
Some Background
TornadoVM specialises the generated code and the runtime per device (e.g., GPU vs CPU). Regarding the thread-block for example, TornadoVM, selects different global thread size for GPUs and CPUs:
N
iterations, then TornadoVM selects a global size ofN
.So, what's the "issue"?
For some applications, such as 2D kernels for Image Processing, the strategy employed by TornadoVM is very slow on CPUs with PoCL. You can find all details here: beehive-lab/TornadoVM#410
It is actually slower than Java sequential. What I noticed was that the application was running on a single core with PoCL. However, running with oneAPI CPU Runtime was fast. To put into perspective: an image processing application running 3888x 5184 pixels was taking ~5 seconds with Intel oneAPI CPU Runtime, and ~46 seconds with PoCL. The number of threads deployed was 20 (20 cores).
How was it fixed?
TornadoVM has this general assumption that controlling the thread-block for CPUs can lead to faster performance. What I noticed is that, with modern OpenCL runtimes, this is not the case. I changed the thread-dispatch for CPUs in TornadoVM, using a similar strategy to the GPUs.
Results:
Conclusions
This case might be tight to the TornadoVM compiler. I just wondered if PoCL always expects to run high-number of threads (which makes sense) to get performance, otherwise the thread dispatcher is tight to a single core.
I updated TornadoVM to use what we call fine-grained-scheduler (same for GPU), so this is now not a big deal, but I thought it was worth to share.
Beta Was this translation helpful? Give feedback.
All reactions