[PERF] cuda.parallel: Cache intermediate results to improve performance of `cudax.reduce_into` #3001

shwina · 2024-12-02T14:55:31Z

Description

This PR uses caching on the Python side to improve the performance of cuda.experimental.reduce_into. Specifically:

cache _Reduce objects. The cache key used here is the dtype of the input arrays rather than the arrays themselves. I think this is safe to do, and longer term, I'd like to avoid passing the concrete arrays to the _Reduce constructor.
cache the result of the utility function _type_to_info.

Before this PR:

In [4]: d_in = cuda.device_array(1, "int64")

In [5]: d_out = cuda.device_array(1, "int64")

In [6]: h_init = np.asarray([0], "int64")

In [7]: def op(x, y): return x + y

In [8]: %timeit cudax.reduce_into(d_in, d_out, op, h_init)
781 ms ± 42.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit cudax._type_to_info(np.int32)
71.8 μs ± 111 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

After this PR:

In [8]: %timeit cudax.reduce_into(d_in, d_out, op, h_init)
The slowest run took 17.54 times longer than the fastest. This could mean that an intermediate result is being cached.
910 ns ± 1.36 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit cudax._type_to_info(np.int32)
64.8 ns ± 0.0356 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Additional context

This came up in the initial investigations for #2958.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-12-02T14:55:35Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

miscco

Welcome to the team 🎉

miscco · 2024-12-02T16:25:04Z

/ok to test

github-actions · 2024-12-02T17:19:53Z

🟩 CI finished in 15m 30s: Pass: 100%/1 | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s

🟩 python: Pass: 100%/1 | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-v100-latest-1`

shwina · 2024-12-02T17:30:26Z

Looks like the build docs CI job is using Python=3.7 which is missing the functools.cache function. Looking into it.

github-actions · 2024-12-02T21:24:09Z

🟩 CI finished in 14m 53s: Pass: 100%/1 | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s

🟩 python: Pass: 100%/1 | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-v100-latest-1`

shwina · 2024-12-02T21:26:35Z

Looks like the build docs CI job is using Python=3.7 which is missing the functools.cache function. Looking into it.

Punting on this and just using @functools.lru_cache instead which is supported by Python 3.7.

rwgk · 2024-12-02T22:49:35Z

python/cuda_parallel/cuda/parallel/experimental/__init__.py

@@ -194,14 +198,15 @@ def _dtype_validation(dt1, dt2):


 class _Reduce:
-    def __init__(self, d_in, d_out, op, init):
+    # TODO: constructor shouldn't require concrete `d_in`, `d_out`:


That's what I was wondering about, but I didn't get to drilling down.

It might be useful to work together on getting this TODO done.

Already, this PR will have complicated merge conflicts with my #2788, i.e. it might be best to team up working on both.

It might be useful to work together on getting this TODO done.

Definitely! I opened #3008 to track this. Let's tackle it as a follow up to this PR.

shwina requested a review from a team as a code owner December 2, 2024 14:55

shwina requested a review from ericniebler December 2, 2024 14:55

miscco approved these changes Dec 2, 2024

View reviewed changes

shwina force-pushed the cuda-parallel-cache-reducer-and-type-info branch 3 times, most recently from 31ee5a5 to 46e75e9 Compare December 2, 2024 16:39

Rename init -> h_init

566d0ad

shwina force-pushed the cuda-parallel-cache-reducer-and-type-info branch from 46e75e9 to ee7fcc9 Compare December 2, 2024 16:41

shwina added 2 commits December 2, 2024 14:46

Cache result of _type_to_info

5f62a2f

Cache result of _Reduce constructor

b350574

shwina force-pushed the cuda-parallel-cache-reducer-and-type-info branch from ee7fcc9 to b350574 Compare December 2, 2024 19:47

shwina mentioned this pull request Dec 2, 2024

[FEA]: Cache cuda.parallel builds #2590

Open

1 task

rwgk reviewed Dec 2, 2024

View reviewed changes

This was referenced Dec 2, 2024

[REFACTOR] cuda.parallel: Don't require passing input/output arrays to reduce_into and similar algorithms #3008

Open

Initial investigation into using cuda.parallel.reduce_into in CuPy #2960

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] cuda.parallel: Cache intermediate results to improve performance of `cudax.reduce_into` #3001

[PERF] cuda.parallel: Cache intermediate results to improve performance of `cudax.reduce_into` #3001

shwina commented Dec 2, 2024 •

edited

Loading

copy-pr-bot bot commented Dec 2, 2024

miscco left a comment

miscco commented Dec 2, 2024

github-actions bot commented Dec 2, 2024

🟩 python: Pass: 100%/1 | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

shwina commented Dec 2, 2024

github-actions bot commented Dec 2, 2024

🟩 python: Pass: 100%/1 | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

shwina commented Dec 2, 2024

rwgk Dec 2, 2024

shwina Dec 2, 2024

[PERF] cuda.parallel: Cache intermediate results to improve performance of cudax.reduce_into #3001

Are you sure you want to change the base?

[PERF] cuda.parallel: Cache intermediate results to improve performance of cudax.reduce_into #3001

Conversation

shwina commented Dec 2, 2024 • edited Loading

Description

Additional context

Checklist

copy-pr-bot bot commented Dec 2, 2024

miscco left a comment

Choose a reason for hiding this comment

miscco commented Dec 2, 2024

github-actions bot commented Dec 2, 2024

🟩 python: Pass: 100%/1 | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

shwina commented Dec 2, 2024

github-actions bot commented Dec 2, 2024

🟩 python: Pass: 100%/1 | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

shwina commented Dec 2, 2024

rwgk Dec 2, 2024

Choose a reason for hiding this comment

shwina Dec 2, 2024

Choose a reason for hiding this comment

[PERF] cuda.parallel: Cache intermediate results to improve performance of `cudax.reduce_into` #3001

[PERF] cuda.parallel: Cache intermediate results to improve performance of `cudax.reduce_into` #3001

shwina commented Dec 2, 2024 •

edited

Loading