Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] cuda.parallel: Cache intermediate results to improve performance of cudax.reduce_into #3001

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

shwina
Copy link

@shwina shwina commented Dec 2, 2024

Description

This PR uses caching on the Python side to improve the performance of cuda.experimental.reduce_into. Specifically:

  • cache _Reduce objects. The cache key used here is the dtype of the input arrays rather than the arrays themselves. I think this is safe to do, and longer term, I'd like to avoid passing the concrete arrays to the _Reduce constructor.

  • cache the result of the utility function _type_to_info.

Before this PR:

In [4]: d_in = cuda.device_array(1, "int64")

In [5]: d_out = cuda.device_array(1, "int64")

In [6]: h_init = np.asarray([0], "int64")

In [7]: def op(x, y): return x + y

In [8]: %timeit cudax.reduce_into(d_in, d_out, op, h_init)
781 ms ± 42.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit cudax._type_to_info(np.int32)
71.8 μs ± 111 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

After this PR:

In [8]: %timeit cudax.reduce_into(d_in, d_out, op, h_init)
The slowest run took 17.54 times longer than the fastest. This could mean that an intermediate result is being cached.
910 ns ± 1.36 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit cudax._type_to_info(np.int32)
64.8 ns ± 0.0356 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Additional context

This came up in the initial investigations for #2958.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@shwina shwina requested a review from a team as a code owner December 2, 2024 14:55
Copy link

copy-pr-bot bot commented Dec 2, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Collaborator

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome to the team 🎉

@miscco
Copy link
Collaborator

miscco commented Dec 2, 2024

/ok to test

@shwina shwina force-pushed the cuda-parallel-cache-reducer-and-type-info branch 3 times, most recently from 31ee5a5 to 46e75e9 Compare December 2, 2024 16:39
@shwina shwina force-pushed the cuda-parallel-cache-reducer-and-type-info branch from 46e75e9 to ee7fcc9 Compare December 2, 2024 16:41
Copy link
Contributor

github-actions bot commented Dec 2, 2024

🟩 CI finished in 15m 30s: Pass: 100%/1 | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s
  • 🟩 python: Pass: 100%/1 | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-v100-latest-1

@shwina
Copy link
Author

shwina commented Dec 2, 2024

Looks like the build docs CI job is using Python=3.7 which is missing the functools.cache function. Looking into it.

@shwina shwina force-pushed the cuda-parallel-cache-reducer-and-type-info branch from ee7fcc9 to b350574 Compare December 2, 2024 19:47
Copy link
Contributor

github-actions bot commented Dec 2, 2024

🟩 CI finished in 14m 53s: Pass: 100%/1 | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
  • 🟩 python: Pass: 100%/1 | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-v100-latest-1

@shwina
Copy link
Author

shwina commented Dec 2, 2024

Looks like the build docs CI job is using Python=3.7 which is missing the functools.cache function. Looking into it.

Punting on this and just using @functools.lru_cache instead which is supported by Python 3.7.

@shwina shwina mentioned this pull request Dec 2, 2024
1 task
@@ -194,14 +198,15 @@ def _dtype_validation(dt1, dt2):


class _Reduce:
def __init__(self, d_in, d_out, op, init):
# TODO: constructor shouldn't require concrete `d_in`, `d_out`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I was wondering about, but I didn't get to drilling down.

It might be useful to work together on getting this TODO done.

Already, this PR will have complicated merge conflicts with my #2788, i.e. it might be best to team up working on both.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to work together on getting this TODO done.

Definitely! I opened #3008 to track this. Let's tackle it as a follow up to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

3 participants