Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[STF] reduce access mode #2830

Draft
wants to merge 53 commits into
base: main
Choose a base branch
from

Conversation

caugonnet
Copy link
Contributor

Description

closes

This PR intends to introduce a reduction access mode to make it much easier to write parallel_for kernels which also perform some reductions to a logical data.

Checklist

  • [ x] New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Nov 15, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@caugonnet
Copy link
Contributor Author

/ok to test

@@ -423,7 +452,7 @@ public:
Fun&& f = mv(::std::get<2>(*p));
const sub_shape_t& shape = ::std::get<3>(*p);

auto explode_coords = [&](size_t i, deps_t... data) {
auto explode_coords = [&](size_t i, typename deps_ops_t::first_type... data) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a comment to say what that type is (we can't use an alias because it's a pack, not a tuple)

@caugonnet caugonnet added the stf Sequential Task Flow programming model label Nov 21, 2024
@caugonnet
Copy link
Contributor Author

/ok to test

@caugonnet
Copy link
Contributor Author

/ok to test

@caugonnet
Copy link
Contributor Author

/ok to test

@caugonnet
Copy link
Contributor Author

/ok to test

// Write the block's result to the output array
if (tid == 0)
{
tuple_set_op<tuple_ops>(redux_buffer[blockIdx.x], per_block_redux_buffer[0].get());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specialize if only one block...

Copy link
Contributor

🟩 CI finished in 36m 00s: Pass: 100%/54 | Total: 11h 41m | Avg: 12m 59s | Max: 20m 16s | Hits: 90%/246
  • 🟩 cudax: Pass: 100%/54 | Total: 11h 41m | Avg: 12m 59s | Max: 20m 16s | Hits: 90%/246

    🟩 cpu
      🟩 amd64              Pass: 100%/50  | Total: 10h 52m | Avg: 13m 03s | Max: 20m 16s | Hits:  90%/246   
      🟩 arm64              Pass: 100%/4   | Total: 48m 31s | Avg: 12m 07s | Max: 13m 07s
    🟩 ctk
      🟩 12.0               Pass: 100%/19  | Total:  3h 57m | Avg: 12m 31s | Max: 20m 16s | Hits:  90%/123   
      🟩 12.5               Pass: 100%/2   | Total: 14m 41s | Avg:  7m 20s | Max:  7m 57s
      🟩 12.6               Pass: 100%/33  | Total:  7h 28m | Avg: 13m 36s | Max: 18m 08s | Hits:  90%/123   
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/19  | Total:  3h 57m | Avg: 12m 31s | Max: 20m 16s | Hits:  90%/123   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 14m 41s | Avg:  7m 20s | Max:  7m 57s
      🟩 nvcc12.6           Pass: 100%/33  | Total:  7h 28m | Avg: 13m 36s | Max: 18m 08s | Hits:  90%/123   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/54  | Total: 11h 41m | Avg: 12m 59s | Max: 20m 16s | Hits:  90%/246   
    🟩 cxx
      🟩 Clang9             Pass: 100%/2   | Total: 24m 34s | Avg: 12m 17s | Max: 12m 24s
      🟩 Clang10            Pass: 100%/2   | Total: 24m 30s | Avg: 12m 15s | Max: 12m 34s
      🟩 Clang11            Pass: 100%/4   | Total: 47m 46s | Avg: 11m 56s | Max: 13m 16s
      🟩 Clang12            Pass: 100%/4   | Total: 50m 54s | Avg: 12m 43s | Max: 15m 16s
      🟩 Clang13            Pass: 100%/4   | Total: 49m 17s | Avg: 12m 19s | Max: 13m 20s
      🟩 Clang14            Pass: 100%/4   | Total: 55m 42s | Avg: 13m 55s | Max: 15m 51s
      🟩 Clang15            Pass: 100%/2   | Total: 30m 58s | Avg: 15m 29s | Max: 18m 08s
      🟩 Clang16            Pass: 100%/4   | Total: 52m 30s | Avg: 13m 07s | Max: 14m 45s
      🟩 Clang17            Pass: 100%/2   | Total: 29m 42s | Avg: 14m 51s | Max: 16m 58s
      🟩 Clang18            Pass: 100%/2   | Total: 30m 01s | Avg: 15m 00s | Max: 15m 35s
      🟩 GCC9               Pass: 100%/2   | Total: 25m 47s | Avg: 12m 53s | Max: 13m 24s
      🟩 GCC10              Pass: 100%/4   | Total: 53m 10s | Avg: 13m 17s | Max: 14m 08s
      🟩 GCC11              Pass: 100%/4   | Total: 52m 54s | Avg: 13m 13s | Max: 14m 18s
      🟩 GCC12              Pass: 100%/7   | Total:  1h 47m | Avg: 15m 20s | Max: 20m 16s
      🟩 GCC13              Pass: 100%/3   | Total: 34m 21s | Avg: 11m 27s | Max: 13m 07s
      🟩 MSVC14.36          Pass: 100%/1   | Total:  8m 05s | Avg:  8m 05s | Max:  8m 05s | Hits:  90%/123   
      🟩 MSVC14.39          Pass: 100%/1   | Total:  9m 04s | Avg:  9m 04s | Max:  9m 04s | Hits:  90%/123   
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 14m 41s | Avg:  7m 20s | Max:  7m 57s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/30  | Total:  6h 35m | Avg: 13m 11s | Max: 18m 08s
      🟩 GCC                Pass: 100%/20  | Total:  4h 33m | Avg: 13m 40s | Max: 20m 16s
      🟩 MSVC               Pass: 100%/2   | Total: 17m 09s | Avg:  8m 34s | Max:  9m 04s | Hits:  90%/246   
      🟩 NVHPC              Pass: 100%/2   | Total: 14m 41s | Avg:  7m 20s | Max:  7m 57s
    🟩 gpu
      🟩 v100               Pass: 100%/54  | Total: 11h 41m | Avg: 12m 59s | Max: 20m 16s | Hits:  90%/246   
    🟩 jobs
      🟩 Build              Pass: 100%/49  | Total: 10h 17m | Avg: 12m 36s | Max: 18m 08s | Hits:  90%/246   
      🟩 Test               Pass: 100%/5   | Total:  1h 23m | Avg: 16m 42s | Max: 20m 16s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  9m 09s | Avg:  9m 09s | Max:  9m 09s
      🟩 90a                Pass: 100%/1   | Total:  9m 43s | Avg:  9m 43s | Max:  9m 43s
    🟩 std
      🟩 17                 Pass: 100%/29  | Total:  6h 11m | Avg: 12m 48s | Max: 20m 16s
      🟩 20                 Pass: 100%/25  | Total:  5h 30m | Avg: 13m 12s | Max: 18m 08s | Hits:  90%/246   
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
+/- CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
+/- CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 54)

# Runner
43 linux-amd64-cpu16
5 linux-amd64-gpu-v100-latest-1
4 linux-arm64-cpu16
2 windows-amd64-cpu16


double X0(int i)
{
return sin((double) i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return sin((double) i);
return sin(double(i));

In fact I think you don't even need the explicit conversion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s probably necessary due to nasty compilers (one day we will support MSVC…)


double Y0(int i)
{
return cos((double) i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return cos((double) i);
return cos(double(i));


int main()
{
context ctx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move down just before its first use

Comment on lines 19 to 21
01-axpy-parallel_for.cu
01-axpy-reduce.cu
01-axpy-launch.cu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
01-axpy-parallel_for.cu
01-axpy-reduce.cu
01-axpy-launch.cu
01-axpy-launch.cu
01-axpy-parallel_for.cu
01-axpy-reduce.cu

i.e. keep the list sorted

// determines whether the character is alphabetical
__host__ __device__ bool is_alpha(const char c)
{
return (c >= 'A' && c <= 'z');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that's incorrect because there are a few nonalpha characters between Z and a.

Suggested change
return (c >= 'A' && c <= 'z');
return c >= 'A' && c <= 'Z' || c >= 'a' && c <= 'z';

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check what the Thrust example does …

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoa chatgpt just suggested this clever thing:

Suggested change
return (c >= 'A' && c <= 'z');
return ((c | 0x20) - 'a') < 26;

ctx.parallel_for(ltext.shape(), ltext.read(), lcnt.reduce(reducer::sum<int>{}))
->*[] _CCCL_DEVICE(size_t i, auto text, int& s) {
/* When we have the beginning of a new word, increment the counter */
if (!is_alpha(text(i)) && is_alpha(text(i + 1)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Careful here with out of bounds accesses.

Suggested change
if (!is_alpha(text(i)) && is_alpha(text(i + 1)))
if (is_alpha(text(i)) && (i == 0 || !is_alpha(text(i - 1))))

int ref_cnt = 0;
for (size_t i = 0; i < sizeof(raw_input) - 1; i++)
{
if (!is_alpha(raw_input[i]) && is_alpha(raw_input[i + 1]))
Copy link
Contributor

@andralex andralex Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!is_alpha(raw_input[i]) && is_alpha(raw_input[i + 1]))
if (is_alpha(raw_input[i]) && (i == 0 || is_alpha(raw_input[i - 1])))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calls for a better choice of the shape instead of the test

@@ -517,6 +517,23 @@ public:
cudaGraphDebugDotPrint(get_graph(), filename.c_str(), flags);
}

template <typename T>
auto transfer_host(cuda::experimental::stf::logical_data<T>& ldata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, this would be best renamed as wait or host_wait. There should be a clear messaging that this is a blocking function (unless most of cudastf).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree wait suggests the main goal is to synchronize with the data, while the goal is to get its content, synchronization is a necessary byproduct

we could also have a wait method on logical data we have missed that for ages

Comment on lines +523 to +524
using valT = typename owning_container_of<T>::type;
valT out;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for the using - you're using the name just once.

Suggested change
using valT = typename owning_container_of<T>::type;
valT out;
typename owning_container_of<T>::type out;

Comment on lines 43 to 44
reduce_no_init = 8,
reduce_do_init = 16,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
reduce_no_init = 8,
reduce_do_init = 16,
reduce_init = 8,
reduce = 16,

I mean the implicit operation is "reduce" and there's a special thing "reduce initialize".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the default should be the reduce, and the less common case would be reduce_no_init, so i'll do that instead

Comment on lines 82 to 85
case access_mode::reduce_no_init:
return "reduce (no init)"; // op ?
case access_mode::reduce_do_init:
return "reduce (do init)"; // op ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case access_mode::reduce_no_init:
return "reduce (no init)"; // op ?
case access_mode::reduce_do_init:
return "reduce (do init)"; // op ?
case access_mode::reduce:
return "reduce"; // op ?
case access_mode::reduce_init:
return "reduce (initialization stage)"; // op ?

@caugonnet
Copy link
Contributor Author

/ok to test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example should be reworked as a DOT product instead, to make it look like more useful

default:
assert(false);
abort();
}
}

/**
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this tag type belongs here ?

size_t input_cnt) override
{
cudaGraphNode_t dummy;
cuda_safe_call(cudaGraphAddEmptyNode(&dummy, graph, input_nodes, input_cnt));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not implemented ... Perhaps we need to add helpers ?

Copy link
Contributor

🟩 CI finished in 32m 26s: Pass: 100%/54 | Total: 12h 23m | Avg: 13m 46s | Max: 19m 46s | Hits: 90%/256
  • 🟩 cudax: Pass: 100%/54 | Total: 12h 23m | Avg: 13m 46s | Max: 19m 46s | Hits: 90%/256

    🟩 cpu
      🟩 amd64              Pass: 100%/50  | Total: 11h 28m | Avg: 13m 45s | Max: 19m 46s | Hits:  90%/256   
      🟩 arm64              Pass: 100%/4   | Total: 55m 48s | Avg: 13m 57s | Max: 15m 39s
    🟩 ctk
      🟩 12.0               Pass: 100%/19  | Total:  4h 10m | Avg: 13m 11s | Max: 19m 46s | Hits:  90%/128   
      🟩 12.5               Pass: 100%/2   | Total: 17m 22s | Avg:  8m 41s | Max:  9m 07s
      🟩 12.6               Pass: 100%/33  | Total:  7h 55m | Avg: 14m 25s | Max: 16m 28s | Hits:  90%/128   
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/19  | Total:  4h 10m | Avg: 13m 11s | Max: 19m 46s | Hits:  90%/128   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 17m 22s | Avg:  8m 41s | Max:  9m 07s
      🟩 nvcc12.6           Pass: 100%/33  | Total:  7h 55m | Avg: 14m 25s | Max: 16m 28s | Hits:  90%/128   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/54  | Total: 12h 23m | Avg: 13m 46s | Max: 19m 46s | Hits:  90%/256   
    🟩 cxx
      🟩 Clang9             Pass: 100%/2   | Total: 26m 02s | Avg: 13m 01s | Max: 13m 45s
      🟩 Clang10            Pass: 100%/2   | Total: 27m 41s | Avg: 13m 50s | Max: 14m 58s
      🟩 Clang11            Pass: 100%/4   | Total: 51m 34s | Avg: 12m 53s | Max: 13m 11s
      🟩 Clang12            Pass: 100%/4   | Total: 53m 56s | Avg: 13m 29s | Max: 14m 48s
      🟩 Clang13            Pass: 100%/4   | Total: 52m 02s | Avg: 13m 00s | Max: 13m 47s
      🟩 Clang14            Pass: 100%/4   | Total: 57m 33s | Avg: 14m 23s | Max: 14m 54s
      🟩 Clang15            Pass: 100%/2   | Total: 29m 18s | Avg: 14m 39s | Max: 14m 59s
      🟩 Clang16            Pass: 100%/4   | Total: 56m 29s | Avg: 14m 07s | Max: 15m 49s
      🟩 Clang17            Pass: 100%/2   | Total: 31m 48s | Avg: 15m 54s | Max: 16m 28s
      🟩 Clang18            Pass: 100%/2   | Total: 30m 32s | Avg: 15m 16s | Max: 15m 44s
      🟩 GCC9               Pass: 100%/2   | Total: 28m 01s | Avg: 14m 00s | Max: 14m 53s
      🟩 GCC10              Pass: 100%/4   | Total: 55m 08s | Avg: 13m 47s | Max: 15m 19s
      🟩 GCC11              Pass: 100%/4   | Total:  1h 02m | Avg: 15m 38s | Max: 19m 46s
      🟩 GCC12              Pass: 100%/7   | Total:  1h 42m | Avg: 14m 36s | Max: 16m 07s
      🟩 GCC13              Pass: 100%/3   | Total: 40m 10s | Avg: 13m 23s | Max: 15m 39s
      🟩 MSVC14.36          Pass: 100%/1   | Total:  8m 53s | Avg:  8m 53s | Max:  8m 53s | Hits:  90%/128   
      🟩 MSVC14.39          Pass: 100%/1   | Total: 12m 27s | Avg: 12m 27s | Max: 12m 27s | Hits:  90%/128   
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 17m 22s | Avg:  8m 41s | Max:  9m 07s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/30  | Total:  6h 56m | Avg: 13m 53s | Max: 16m 28s
      🟩 GCC                Pass: 100%/20  | Total:  4h 48m | Avg: 14m 24s | Max: 19m 46s
      🟩 MSVC               Pass: 100%/2   | Total: 21m 20s | Avg: 10m 40s | Max: 12m 27s | Hits:  90%/256   
      🟩 NVHPC              Pass: 100%/2   | Total: 17m 22s | Avg:  8m 41s | Max:  9m 07s
    🟩 gpu
      🟩 v100               Pass: 100%/54  | Total: 12h 23m | Avg: 13m 46s | Max: 19m 46s | Hits:  90%/256   
    🟩 jobs
      🟩 Build              Pass: 100%/49  | Total: 11h 06m | Avg: 13m 35s | Max: 19m 46s | Hits:  90%/256   
      🟩 Test               Pass: 100%/5   | Total:  1h 17m | Avg: 15m 33s | Max: 16m 07s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  9m 32s | Avg:  9m 32s | Max:  9m 32s
      🟩 90a                Pass: 100%/1   | Total: 10m 19s | Avg: 10m 19s | Max: 10m 19s
    🟩 std
      🟩 17                 Pass: 100%/29  | Total:  6h 29m | Avg: 13m 26s | Max: 15m 49s
      🟩 20                 Pass: 100%/25  | Total:  5h 54m | Avg: 14m 09s | Max: 19m 46s | Hits:  90%/256   
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
+/- CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
+/- CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 54)

# Runner
43 linux-amd64-cpu16
5 linux-amd64-gpu-v100-latest-1
4 linux-arm64-cpu16
2 windows-amd64-cpu16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stf Sequential Task Flow programming model
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants