-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[STF] reduce access mode #2830
base: main
Are you sure you want to change the base?
[STF] reduce access mode #2830
Conversation
/ok to test |
@@ -423,7 +452,7 @@ public: | |||
Fun&& f = mv(::std::get<2>(*p)); | |||
const sub_shape_t& shape = ::std::get<3>(*p); | |||
|
|||
auto explode_coords = [&](size_t i, deps_t... data) { | |||
auto explode_coords = [&](size_t i, typename deps_ops_t::first_type... data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a comment to say what that type is (we can't use an alias because it's a pack, not a tuple)
…n access mode, and start to implement all the mechanisms for reductions in parallel_for
… to cuda::std::tuple
/ok to test |
/ok to test |
/ok to test |
/ok to test |
cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh
Outdated
Show resolved
Hide resolved
// Write the block's result to the output array | ||
if (tid == 0) | ||
{ | ||
tuple_set_op<tuple_ops>(redux_buffer[blockIdx.x], per_block_redux_buffer[0].get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specialize if only one block...
cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh
Outdated
Show resolved
Hide resolved
🟩 CI finished in 36m 00s: Pass: 100%/54 | Total: 11h 41m | Avg: 12m 59s | Max: 20m 16s | Hits: 90%/246
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
python | |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 54)
# | Runner |
---|---|
43 | linux-amd64-cpu16 |
5 | linux-amd64-gpu-v100-latest-1 |
4 | linux-arm64-cpu16 |
2 | windows-amd64-cpu16 |
|
||
double X0(int i) | ||
{ | ||
return sin((double) i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return sin((double) i); | |
return sin(double(i)); |
In fact I think you don't even need the explicit conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s probably necessary due to nasty compilers (one day we will support MSVC…)
|
||
double Y0(int i) | ||
{ | ||
return cos((double) i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return cos((double) i); | |
return cos(double(i)); |
|
||
int main() | ||
{ | ||
context ctx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move down just before its first use
cudax/examples/stf/CMakeLists.txt
Outdated
01-axpy-parallel_for.cu | ||
01-axpy-reduce.cu | ||
01-axpy-launch.cu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
01-axpy-parallel_for.cu | |
01-axpy-reduce.cu | |
01-axpy-launch.cu | |
01-axpy-launch.cu | |
01-axpy-parallel_for.cu | |
01-axpy-reduce.cu |
i.e. keep the list sorted
// determines whether the character is alphabetical | ||
__host__ __device__ bool is_alpha(const char c) | ||
{ | ||
return (c >= 'A' && c <= 'z'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually that's incorrect because there are a few nonalpha characters between Z
and a
.
return (c >= 'A' && c <= 'z'); | |
return c >= 'A' && c <= 'Z' || c >= 'a' && c <= 'z'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check what the Thrust example does …
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoa chatgpt just suggested this clever thing:
return (c >= 'A' && c <= 'z'); | |
return ((c | 0x20) - 'a') < 26; |
ctx.parallel_for(ltext.shape(), ltext.read(), lcnt.reduce(reducer::sum<int>{})) | ||
->*[] _CCCL_DEVICE(size_t i, auto text, int& s) { | ||
/* When we have the beginning of a new word, increment the counter */ | ||
if (!is_alpha(text(i)) && is_alpha(text(i + 1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Careful here with out of bounds accesses.
if (!is_alpha(text(i)) && is_alpha(text(i + 1))) | |
if (is_alpha(text(i)) && (i == 0 || !is_alpha(text(i - 1)))) |
int ref_cnt = 0; | ||
for (size_t i = 0; i < sizeof(raw_input) - 1; i++) | ||
{ | ||
if (!is_alpha(raw_input[i]) && is_alpha(raw_input[i + 1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!is_alpha(raw_input[i]) && is_alpha(raw_input[i + 1])) | |
if (is_alpha(raw_input[i]) && (i == 0 || is_alpha(raw_input[i - 1]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This calls for a better choice of the shape instead of the test
@@ -517,6 +517,23 @@ public: | |||
cudaGraphDebugDotPrint(get_graph(), filename.c_str(), flags); | |||
} | |||
|
|||
template <typename T> | |||
auto transfer_host(cuda::experimental::stf::logical_data<T>& ldata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, this would be best renamed as wait
or host_wait
. There should be a clear messaging that this is a blocking function (unless most of cudastf).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree wait suggests the main goal is to synchronize with the data, while the goal is to get its content, synchronization is a necessary byproduct
we could also have a wait method on logical data we have missed that for ages
using valT = typename owning_container_of<T>::type; | ||
valT out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the using
- you're using the name just once.
using valT = typename owning_container_of<T>::type; | |
valT out; | |
typename owning_container_of<T>::type out; |
reduce_no_init = 8, | ||
reduce_do_init = 16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reduce_no_init = 8, | |
reduce_do_init = 16, | |
reduce_init = 8, | |
reduce = 16, |
I mean the implicit operation is "reduce" and there's a special thing "reduce initialize".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the default should be the reduce, and the less common case would be reduce_no_init, so i'll do that instead
case access_mode::reduce_no_init: | ||
return "reduce (no init)"; // op ? | ||
case access_mode::reduce_do_init: | ||
return "reduce (do init)"; // op ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case access_mode::reduce_no_init: | |
return "reduce (no init)"; // op ? | |
case access_mode::reduce_do_init: | |
return "reduce (do init)"; // op ? | |
case access_mode::reduce: | |
return "reduce"; // op ? | |
case access_mode::reduce_init: | |
return "reduce (initialization stage)"; // op ? |
…init to user, then use true_type and false_type internally. Also rename reduce_do_init to reduce for clarity, as this is the most common case.
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example should be reworked as a DOT product instead, to make it look like more useful
default: | ||
assert(false); | ||
abort(); | ||
} | ||
} | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do this tag type belongs here ?
size_t input_cnt) override | ||
{ | ||
cudaGraphNode_t dummy; | ||
cuda_safe_call(cudaGraphAddEmptyNode(&dummy, graph, input_nodes, input_cnt)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not implemented ... Perhaps we need to add helpers ?
🟩 CI finished in 32m 26s: Pass: 100%/54 | Total: 12h 23m | Avg: 13m 46s | Max: 19m 46s | Hits: 90%/256
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
python | |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 54)
# | Runner |
---|---|
43 | linux-amd64-cpu16 |
5 | linux-amd64-gpu-v100-latest-1 |
4 | linux-arm64-cpu16 |
2 | windows-amd64-cpu16 |
Description
closes
This PR intends to introduce a reduction access mode to make it much easier to write parallel_for kernels which also perform some reductions to a logical data.
Checklist