-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF] cuda.parallel: Cache intermediate results to improve performance of cudax.reduce_into
#3001
base: main
Are you sure you want to change the base?
[PERF] cuda.parallel: Cache intermediate results to improve performance of cudax.reduce_into
#3001
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Welcome to the team 🎉
/ok to test |
31ee5a5
to
46e75e9
Compare
46e75e9
to
ee7fcc9
Compare
🟩 CI finished in 15m 30s: Pass: 100%/1 | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 1)
# | Runner |
---|---|
1 | linux-amd64-gpu-v100-latest-1 |
Looks like the build docs CI job is using Python=3.7 which is missing the |
ee7fcc9
to
b350574
Compare
🟩 CI finished in 14m 53s: Pass: 100%/1 | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 1)
# | Runner |
---|---|
1 | linux-amd64-gpu-v100-latest-1 |
Punting on this and just using |
@@ -194,14 +198,15 @@ def _dtype_validation(dt1, dt2): | |||
|
|||
|
|||
class _Reduce: | |||
def __init__(self, d_in, d_out, op, init): | |||
# TODO: constructor shouldn't require concrete `d_in`, `d_out`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I was wondering about, but I didn't get to drilling down.
It might be useful to work together on getting this TODO done.
Already, this PR will have complicated merge conflicts with my #2788, i.e. it might be best to team up working on both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to work together on getting this TODO done.
Definitely! I opened #3008 to track this. Let's tackle it as a follow up to this PR.
Description
This PR uses caching on the Python side to improve the performance of
cuda.experimental.reduce_into
. Specifically:cache
_Reduce
objects. The cache key used here is thedtype
of the input arrays rather than the arrays themselves. I think this is safe to do, and longer term, I'd like to avoid passing the concrete arrays to the_Reduce
constructor.cache the result of the utility function
_type_to_info
.Before this PR:
After this PR:
Additional context
This came up in the initial investigations for #2958.
Checklist