[FEA]: Support fancy iterators in cuda.parallel #2479

gevtushenko · 2024-09-29T20:07:29Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

General CCCL

Is your feature request related to a problem? Please describe.

Compared to existing solutions in PyTorch and CuPy, one of the distinguishing features of cuda.parallel is flexibility. Part of that flexibility is coming from support of user-defined data types and operators. But compared to CUDA C++ solution, cuda.parallel API is still limited. We are still missing fancy iterators and cache modified iterators in cuda.parallel.

Describe the solution you'd like

Given that fancy iterators support might require rather invasive changes to cuda.parallel and CCCL/c libraries, we should design fancy iterators before introducing more algorithms to cuda.parallel.

Describe alternatives you've considered

No response

Additional context

No response

rwgk · 2024-10-15T20:53:32Z

Keeping track of a link I got from Georgii: #2335

rwgk · 2024-10-15T22:40:59Z

Tracking for easy future reference:

https://numba.pydata.org/numba-doc/dev/user/jitclass.html

Limitations

Support for jitclasses are available on CPU only. (Note: Support for GPU devices is planned for a future release.)

rwgk · 2024-10-15T23:04:39Z

I'm currently doing work on this branch:

https://github.com/rwgk/cccl/tree/python_random_access_iterators

Last commit rwgk@d1c4816 — Use TransformRAI to implement constant, counting, arbitrary RAIs.

Current thinking:

With that approach we can handle arbitrary RAIs via a Python unary op.
We should be able to JIT that op with numba and process it similarly to how the existing binary reduction op here is processed: https://github.com/NVIDIA/cccl/blob/e5229f2c7509ced5a830be4ae884d9e1639e8951/python/cuda_parallel/cuda/parallel/experimental/__init__.py#L259C30-L259C32

rwgk · 2024-10-16T19:00:34Z

Some points for the upcoming meeting, to capture the state of my understanding:

We have a Python side and a C++ side, each with their conventions, standard approaches, and standard terminology.
We want to bridge between them.
On the C++ side we want to feed into cub::DeviceReduce, which requires a C++ random access iterator (input). — Actual calls in Georgii's existing code: cub::DeviceReduceSingleTileKernel, cub::DeviceReduceKernel

What Python API we want exactly is TBD. But we can figure out the core requirement which is simply:

scalar_with_certain_dtype = RAI_object[random_access_array_index]

The RAI_object is limited in the same way as any Python function compiled withnumba.cuda.jit, with nopython. — I.e. we don't want to run Python code on the GPU.
RAI_object[random_access_array_index] needs to run in constant time: https://en.cppreference.com/w/cpp/iterator/random_access_iterator

On the C++ side we need to build an actual C++ random access iterator, which Georgii has implemented like this:

cccl/c/parallel/src/reduce.cu

Lines 395 to 417 in e5229f2

    
           "extern \"C\" __device__ {3} {4}(const void *self_ptr);\n" 
        
           "extern \"C\" __device__ void {5}(void *self_ptr, {0} offset);\n" 
        
           "struct __align__({2}) input_iterator_state_t {{\n" 
        
           "  using iterator_category = cuda::std::random_access_iterator_tag;\n" 
        
           "  using value_type = {3};\n" 
        
           "  using difference_type = {0};\n" 
        
           "  using pointer = {3}*;\n" 
        
           "  using reference = {3}&;\n" 
        
           "  __device__ value_type operator*() const {{ return {4}(this); }}\n" 
        
           "  __device__ input_iterator_state_t& operator+=(difference_type diff) {{\n" 
        
           "      {5}(this, diff);\n" 
        
           "      return *this;\n" 
        
           "  }}\n" 
        
           "  __device__ value_type operator[](difference_type diff) const {{\n" 
        
           "      return *(*this + diff);\n" 
        
           "  }}\n" 
        
           "  __device__ input_iterator_state_t operator+(difference_type diff) const {{\n" 
        
           "      input_iterator_state_t result = *this;\n" 
        
           "      result += diff;\n" 
        
           "      return result;\n" 
        
           "  }}\n" 
        
           "  char data[{1}];\n" 
        
           "}};\n",

That code formats C++ code to be compiled with nvrtc a little further down.

At this stage I'm very conservative: I'm aiming for minimal changes to Georgii's code to get the desired behavior passing

a Python object that behaves like Python's repeat() / C++ thrust::constant_iterator,
Python's count() / C++ thrust::counting_iterator,
Arbitrary Python callables meeting the above requirements / limitations, which one might implement using thrust::transform_iterator in C++.

Non-goal at this stage: cache-modified C++ input iterator. — I still needs to learn what exactly is needed. In a follow-on step (to the work above), I want to solve this also in a conservative fashion.

When that is done I want to take a step back to review:

User-facing Python API. — Ideally with customers involved.
Do we want to continue to work with ctypes?
Do we want to build the nvrtc input strings in C++, or could we use pynvrtc?

rwgk · 2024-10-17T17:02:22Z

Tracking progress:

I just created this Draft PR: #2595, currently @ commit 5ba7a0f413123cc05e6eb9f3690e8b571659c670

Copy-pasting the current PR description:

Goal: Any unary_op(distance) that can be compiled by numba can be passes as reduce_into(d_in)

Current status:

test_reduce.py input_generator: ["constant", "counting", "arbitrary", "nested_inner"] tests PASS.
input_generator: "nested_global" test FAILS because numba.cuda.compile() fails for this code:

        def other_unary_op(distance):
            permutation = (4, 2, 0, 3, 1)
            return permutation[distance % len(permutation)]
        def input_unary_op(distance):
            return 2 * other_unary_op(distance)

The error is:

E       numba.core.errors.TypingError: Failed in cuda mode pipeline (step: nopython frontend)
E       Untyped global name 'other_unary_op': Cannot determine Numba type of <class 'function'>
E       
E       File "tests/test_reduce.py", line 117:
E               def input_unary_op(distance):
E                   return 2 * other_unary_op(distance)
E                   ^

gevtushenko · 2024-10-18T17:16:45Z

Let's take a look at some of the API that we need in cuda.parallel.itertools.
Below you can find a proof-of-concept implementation for count, repeat, map, and cache.

num_items = 4
output_array = numba.cuda.device_array(num_items, dtype=np.int32)


# Returns 42 N times
r = repeat(42)
parallel_algorithm(r, num_items, output_array)
print("expect: 42 42 42 42;  get: ", " ".join([str(x) for x in output_array.copy_to_host()]))


# Returns an integer sequence starting at 42
c = count(42)
parallel_algorithm(c, num_items, output_array)
print("expect: 42 43 44 45;  get: ", " ".join([str(x) for x in output_array.copy_to_host()]))


# Multiplies 42 (coming from repeat) by 2
def mult(x):
    return x * 2

mult_42_by_2 = map(r, mult)
parallel_algorithm(mult_42_by_2, num_items, output_array)
print("expect: 84 84 84 84;  get: ", " ".join([str(x) for x in output_array.copy_to_host()]))


# Adds 10 to result of multiplication of repeat by 2
def add(x):
    return x + 10

mult_42_by_2_plus10 = map(mult_42_by_2, add)
parallel_algorithm(mult_42_by_2_plus10, num_items, output_array)
print("expect: 94 94 94 94;  get: ", " ".join([str(x) for x in output_array.copy_to_host()]))


# Same as above, but for count
mult_count_by_2 = map(c, mult)
parallel_algorithm(mult_count_by_2, num_items, output_array)
print("expect: 84 86 88 90;  get: ", " ".join([str(x) for x in output_array.copy_to_host()]))

mult_count_by_2_and_add_10 = map(mult_count_by_2, add)
parallel_algorithm(mult_count_by_2_and_add_10, num_items, output_array)
print("expect: 94 96 98 100; get:", " ".join([str(x) for x in output_array.copy_to_host()]))


# Example of how combinational iterators can wrap pointer in a generic way
input_array = numba.cuda.to_device(np.array([4, 3, 2, 1], dtype=np.int32))
ptr = pointer(input_array) # TODO this transformation should be hidden on the transform implementation side
parallel_algorithm(ptr, num_items, output_array)
print("expect:  4  3  2 1  ; get:", " ".join([str(x) for x in output_array.copy_to_host()]))

input_array = numba.cuda.to_device(np.array([4, 3, 2, 1], dtype=np.int32))
ptr = pointer(input_array) # TODO this transformation should be hidden on the transform implementation side
tptr = map(ptr, mult)
parallel_algorithm(tptr, num_items, output_array)
print("expect:  8  6  4 2  ; get:", " ".join([str(x) for x in output_array.copy_to_host()]))


# Example of caching iterator
streamed_input = cache(input_array, 'stream')
parallel_algorithm(streamed_input, num_items, output_array)
print("expect:  4  3  2 1  ; get:", " ".join([str(x) for x in output_array.copy_to_host()]))

main.py

from numba.core import cgutils
from llvmlite import ir
from numba import types
from numba.core.typing import signature
from numba.core.extending import intrinsic, overload
from numba.core.errors import NumbaTypeError

import ctypes
import numba
import numba.cuda
import numpy as np


class RawPointer:
    def __init__(self, ptr): # TODO Showcasing the case of int32, need dtype with at least primitive types, ideally any numba type
        self.val = ctypes.c_void_p(ptr)
        self.ltoirs = [numba.cuda.compile(RawPointer.pointer_advance, sig=numba.types.void(numba.types.CPointer(numba.types.uint64), numba.types.int32), output='ltoir'),
                       numba.cuda.compile(RawPointer.pointer_dereference, sig=numba.types.int32(numba.types.CPointer(numba.types.CPointer(numba.types.int32))), output='ltoir')]
        self.prefix = 'pointer'

    def pointer_advance(this, distance):
        this[0] = this[0] + numba.types.uint64(4 * distance) # TODO Showcasing the case of int32, need dtype with at least primitive types, ideally any numba type

    def pointer_dereference(this):
        return this[0][0]
    
    def host_address(self):
        return ctypes.byref(self.val)
    
    def size(self):
        return 8 # TODO should be using numba for user-defined types support
    
    def alignment(self):
        return 8 # TODO should be using numba for user-defined types support


def pointer(container):
    return RawPointer(container.device_ctypes_pointer.value)


@intrinsic
def ldcs(typingctx, base):
    signature = types.int32(types.CPointer(types.int32))

    def codegen(context, builder, sig, args):
        int32 = ir.IntType(32)
        int32_ptr = int32.as_pointer()
        ldcs_type = ir.FunctionType(int32, [int32_ptr])
        ldcs = ir.InlineAsm(ldcs_type, "ld.global.cs.b32 $0, [$1];", "=r, l")
        return builder.call(ldcs, args)

    return signature, codegen


class CacheModifiedPointer:
    def __init__(self, ptr): # TODO Showcasing the case of int32, need dtype with at least primitive types, ideally any numba type
        self.val = ctypes.c_void_p(ptr)
        self.ltoirs = [numba.cuda.compile(CacheModifiedPointer.cache_advance, sig=numba.types.void(numba.types.CPointer(numba.types.uint64), numba.types.int32), output='ltoir'),
                       numba.cuda.compile(CacheModifiedPointer.cache_dereference, sig=numba.types.int32(numba.types.CPointer(numba.types.CPointer(numba.types.int32))), output='ltoir')]
        self.prefix = 'cache'

    def cache_advance(this, distance):
        this[0] = this[0] + numba.types.uint64(4 * distance) # TODO Showcasing the case of int32, need dtype with at least primitive types, ideally any numba type

    def cache_dereference(this):
        return ldcs(this[0])
    
    def host_address(self):
        return ctypes.byref(self.val)
    
    def size(self):
        return 8 # TODO should be using numba for user-defined types support
    
    def alignment(self):
        return 8 # TODO should be using numba for user-defined types support


def cache(container, modifier='stream'):
    if modifier != 'stream':
        raise NotImplementedError("Only stream modifier is supported")
    return CacheModifiedPointer(container.device_ctypes_pointer.value)


class ConstantIterator:
    def __init__(self, val): # TODO Showcasing the case of int32, need dtype with at least primitive types, ideally any numba type
        thisty = numba.types.CPointer(numba.types.int32)
        self.val = ctypes.c_int32(val)
        self.ltoirs = [numba.cuda.compile(ConstantIterator.constant_int32_advance, sig=numba.types.void(thisty, numba.types.int32), output='ltoir'),
                       numba.cuda.compile(ConstantIterator.constant_int32_dereference, sig=numba.types.int32(thisty), output='ltoir')]
        self.prefix = 'constant_int32'

    def constant_int32_advance(this, _):
        pass

    def constant_int32_dereference(this):
        return this[0]
    
    def host_address(self):
        # TODO should use numba instead for support of user-defined types
        return ctypes.byref(self.val)
    
    def size(self):
        return ctypes.sizeof(self.val) # TODO should be using numba for user-defined types support
    
    def alignment(self):
        return ctypes.alignment(self.val) # TODO should be using numba for user-defined types support


def repeat(value):
    return ConstantIterator(value)


class CountingIterator:
    def __init__(self, count): # TODO Showcasing the case of int32, need dtype
        thisty = numba.types.CPointer(numba.types.int32)
        self.count = ctypes.c_int32(count)
        self.ltoirs = [numba.cuda.compile(CountingIterator.count_int32_advance, sig=numba.types.void(thisty, numba.types.int32), output='ltoir'),
                       numba.cuda.compile(CountingIterator.count_int32_dereference, sig=numba.types.int32(thisty), output='ltoir')]
        self.prefix = 'count_int32'

    def count_int32_advance(this, diff):
        this[0] += diff

    def count_int32_dereference(this):
        return this[0]
    
    def host_address(self):
        return ctypes.byref(self.count)
    
    def size(self):
        return ctypes.sizeof(self.count) # TODO should be using numba for user-defined types support
    
    def alignment(self):
        return ctypes.alignment(self.count) # TODO should be using numba for user-defined types support


def count(offset):
    return CountingIterator(offset)


def map(it, op):
    def source_advance(it_state_ptr, diff):
        pass

    def make_advance_codegen(name):
        retty = types.void
        statety = types.CPointer(types.int8)
        distty = types.int32

        def codegen(context, builder, sig, args):
            state_ptr, dist = args
            fnty = ir.FunctionType(ir.VoidType(), (ir.PointerType(ir.IntType(8)), ir.IntType(32)))
            fn = cgutils.get_or_insert_function(builder.module, fnty, name)
            builder.call(fn, (state_ptr, dist))

        return signature(retty, statety, distty), codegen


    def advance_codegen(func_to_overload, name):
        @intrinsic
        def intrinsic_impl(typingctx, it_state_ptr, diff):
            return make_advance_codegen(name)

        @overload(func_to_overload, target='cuda')
        def impl(it_state_ptr, diff):
            def impl(it_state_ptr, diff):
                return intrinsic_impl(it_state_ptr, diff)
            return impl

    def source_dereference(it_state_ptr):
        pass

    def make_dereference_codegen(name):
        retty = types.int32
        statety = types.CPointer(types.int8)

        def codegen(context, builder, sig, args):
            state_ptr, = args
            fnty = ir.FunctionType(ir.IntType(32), (ir.PointerType(ir.IntType(8)),))
            fn = cgutils.get_or_insert_function(builder.module, fnty, name)
            return builder.call(fn, (state_ptr,))

        return signature(retty, statety), codegen


    def dereference_codegen(func_to_overload, name):
        @intrinsic
        def intrinsic_impl(typingctx, it_state_ptr):
            return make_dereference_codegen(name)

        @overload(func_to_overload, target='cuda')
        def impl(it_state_ptr):
            def impl(it_state_ptr):
                return intrinsic_impl(it_state_ptr)
            return impl

    def make_op_codegen(name):
        retty = types.int32
        valty = types.int32

        def codegen(context, builder, sig, args):
            val, = args
            fnty = ir.FunctionType(ir.IntType(32), (ir.IntType(32),))
            fn = cgutils.get_or_insert_function(builder.module, fnty, name)
            return builder.call(fn, (val,))

        return signature(retty, valty), codegen


    def op_codegen(func_to_overload, name):
        @intrinsic
        def intrinsic_impl(typingctx, val):
            return make_op_codegen(name)

        @overload(func_to_overload, target='cuda')
        def impl(val):
            def impl(val):
                return intrinsic_impl(val)
            return impl

    advance_codegen(source_advance, f"{it.prefix}_advance")
    dereference_codegen(source_dereference, f"{it.prefix}_dereference")
    op_codegen(op, op.__name__)


    class TransformIterator:
        def __init__(self, it, op):
            self.it = it # TODO support row pointers
            self.op = op
            self.prefix = f'transform_{it.prefix}_{op.__name__}'
            self.ltoirs = it.ltoirs + [numba.cuda.compile(TransformIterator.transform_advance, sig=numba.types.void(numba.types.CPointer(numba.types.char), numba.types.int32), output='ltoir', abi_info={"abi_name": f"{self.prefix}_advance"}),
                                       numba.cuda.compile(TransformIterator.transform_dereference, sig=numba.types.int32(numba.types.CPointer(numba.types.char)), output='ltoir', abi_info={"abi_name": f"{self.prefix}_dereference"}),
                                       numba.cuda.compile(op, sig=numba.types.int32(numba.types.int32), output='ltoir')]
        
        def transform_advance(it_state_ptr, diff):
            source_advance(it_state_ptr, diff) # just a function call
        
        def transform_dereference(it_state_ptr):
            return op(source_dereference(it_state_ptr)) # just a function call

        def host_address(self):
            return self.it.host_address() # TODO support stateful operators

        def size(self):
            return self.it.size() # TODO fix for stateful op
        
        def alignment(self):
            return self.it.alignment() # TODO fix for stateful op

    
    return TransformIterator(it, op)


def parallel_algorithm(d_in, num_items, d_out):
    ltoirs = [ltoir[0] for ltoir in d_in.ltoirs]

    LTOIRArrayType = ctypes.c_char_p * len(ltoirs)
    LTOIRSizesArrayType = ctypes.c_int * len(ltoirs)
    ltoir_pointers = [ctypes.c_char_p(ltoir) for ltoir in ltoirs]
    ltoir_sizes = [len(ltoir) for ltoir in ltoirs]
    input_ltoirs = LTOIRArrayType(*ltoir_pointers)
    input_ltoir_sizes = LTOIRSizesArrayType(*ltoir_sizes)

    bindings = ctypes.CDLL('build/libkernel.so')
    bindings.host_code.argtypes = [ctypes.c_int,  # size
                                   ctypes.c_int,  # alignment
                                   ctypes.c_void_p, # input_pointer
                                   ctypes.c_char_p, # prefix
                                   ctypes.c_int,    # num_items
                                   ctypes.c_void_p, # output_pointer
                                   ctypes.POINTER(ctypes.c_char_p), 
                                   ctypes.POINTER(ctypes.c_int),
                                   ctypes.c_int]

    output_pointer = output_array.device_ctypes_pointer.value
    bindings.host_code(d_in.size(), d_in.alignment(), d_in.host_address(), d_in.prefix.encode('utf-8'), num_items, output_pointer, input_ltoirs, input_ltoir_sizes, len(ltoirs))


# User Code

num_items = 4
output_array = numba.cuda.device_array(num_items, dtype=np.int32)

## Repeat
r = repeat(42)
parallel_algorithm(r, num_items, output_array)
print("expect: 42 42 42 42;  get: ", " ".join([str(x) for x in output_array.copy_to_host()]))

## Count
c = count(42)
parallel_algorithm(c, num_items, output_array)
print("expect: 42 43 44 45;  get: ", " ".join([str(x) for x in output_array.copy_to_host()]))

## Transform
def mult(x):
    return x * 2

mult_42_by_2 = map(r, mult)
parallel_algorithm(mult_42_by_2, num_items, output_array)
print("expect: 84 84 84 84;  get: ", " ".join([str(x) for x in output_array.copy_to_host()]))

def add(x):
    return x + 10

mult_42_by_2_plus10 = map(mult_42_by_2, add)
parallel_algorithm(mult_42_by_2_plus10, num_items, output_array)
print("expect: 94 94 94 94;  get: ", " ".join([str(x) for x in output_array.copy_to_host()]))

mult_count_by_2 = map(c, mult)
parallel_algorithm(mult_count_by_2, num_items, output_array)
print("expect: 84 86 88 90;  get: ", " ".join([str(x) for x in output_array.copy_to_host()]))

mult_count_by_2_and_add_10 = map(mult_count_by_2, add)
parallel_algorithm(mult_count_by_2_and_add_10, num_items, output_array)
print("expect: 94 96 98 100; get:", " ".join([str(x) for x in output_array.copy_to_host()]))

input_array = numba.cuda.to_device(np.array([4, 3, 2, 1], dtype=np.int32))
ptr = pointer(input_array) # TODO this transformation should be hidden on the transform implementation side
parallel_algorithm(ptr, num_items, output_array)
print("expect:  4  3  2 1  ; get:", " ".join([str(x) for x in output_array.copy_to_host()]))

input_array = numba.cuda.to_device(np.array([4, 3, 2, 1], dtype=np.int32))
ptr = pointer(input_array) # TODO this transformation should be hidden on the transform implementation side
tptr = map(ptr, mult)
parallel_algorithm(tptr, num_items, output_array)
print("expect:  8  6  4 2  ; get:", " ".join([str(x) for x in output_array.copy_to_host()]))

streamed_input = cache(input_array, 'stream')
parallel_algorithm(streamed_input, num_items, output_array)
print("expect:  4  3  2 1  ; get:", " ".join([str(x) for x in output_array.copy_to_host()]))

CMakeLists.txt

cmake_minimum_required(VERSION 3.21)

project(CCCL_C_Parallel LANGUAGES CUDA CXX)

add_library(kernel SHARED kernel.cpp)
set_property(TARGET kernel PROPERTY POSITION_INDEPENDENT_CODE ON)
set_property(TARGET kernel PROPERTY CXX_STANDARD 20)

find_package(CUDAToolkit REQUIRED)

# TODO Use static versions of cudart, nvrtc, and nvJitLink
target_link_libraries(kernel PRIVATE
  CUDA::cudart
  CUDA::nvrtc
  CUDA::nvJitLink
  CUDA::cuda_driver)

kernel.cpp

#include <string>
#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>
#include <nvJitLink.h>
#include <nvrtc.h>
#include <memory>

void check(nvrtcResult result)
{
  if (result != NVRTC_SUCCESS)
  {
    throw std::runtime_error(std::string("NVRTC error: ") + nvrtcGetErrorString(result));
  }
}

void check(CUresult result)
{
  if (result != CUDA_SUCCESS)
  {
    const char* str = nullptr;
    cuGetErrorString(result, &str);
    throw std::runtime_error(std::string("CUDA error: ") + str);
  }
}

void check(nvJitLinkResult result)
{
  if (result != NVJITLINK_SUCCESS)
  {
    throw std::runtime_error(std::string("nvJitLink error: ") + std::to_string(result));
  }
}


extern "C" void host_code(int iterator_size,
                          int iterator_alignment,
                          void *pointer_to_cpu_bytes_storing_value, 
                          const char *prefix,
                          int num_items,
                          void *pointer_to_to_gpu_memory,
                          const void **input_ltoirs, 
                          const int* input_ltoir_sizes,
                          int num_input_ltoirs)
{
  std::string deref = std::string("#define DEREF ") + prefix + "_dereference\n" + "extern \"C\" __device__ int DEREF(const char *state); \n";
  std::string adv = std::string("#define ADV ") + prefix + "_advance\n" + "extern \"C\" __device__ void ADV(char *state, int distance); \n";
  std::string state = std::string("struct __align__(") + std::to_string(iterator_alignment) + R"XXX() iterator_t { 
    // using iterator_category = cuda::std::random_access_iterator_tag; // TODO add include to libcu++
    using value_type = int;
    using difference_type = int;
    using pointer = int;
    using reference = int;
    __device__ value_type operator*() const { return DEREF(data); }
    __device__ iterator_t& operator+=(difference_type diff) {
        ADV(data, diff);
        return *this;
    }
    __device__ value_type operator[](difference_type diff) const {
        return *(*this + diff);
    }
    __device__ iterator_t operator+(difference_type diff) const {
        iterator_t result = *this;
        result += diff;
        return result;
    }
    char data[)XXX" + std::to_string(iterator_size) + "]; };\n";

  // CUB kernel accepts an iterator and does some of the following operations on it
  std::string kernel_source = deref + adv + state + R"XXX(
    extern "C" __global__ void device_code(int num_items, iterator_t iterator, int *pointer) {
      iterator_t it = iterator + blockIdx.x; 
      for (int i = 0; i < num_items; i++) {
        pointer[i] = it[i];
      }
    }
  )XXX";

  nvrtcProgram prog;
  const char *name = "test_kernel";
  nvrtcCreateProgram(&prog, kernel_source.c_str(), name, 0, nullptr, nullptr);

  cudaDeviceProp deviceProp;
  cudaGetDeviceProperties(&deviceProp, 0);

  const int cc_major = deviceProp.major;
  const int cc_minor = deviceProp.minor;
  const std::string arch = std::string("-arch=sm_") + std::to_string(cc_major) + std::to_string(cc_minor);

  const char* args[] = { arch.c_str(), "-rdc=true", "-dlto" };
  const int num_args = sizeof(args) / sizeof(args[0]);

  std::size_t log_size{};
  nvrtcResult compile_result = nvrtcCompileProgram(prog, num_args, args);

  check(nvrtcGetProgramLogSize(prog, &log_size));

  std::unique_ptr<char[]> log{new char[log_size]};
  check(nvrtcGetProgramLog(prog, log.get()));

  if (log_size > 1)
  {
    std::cerr << log.get() << std::endl;
  }

  check(compile_result);

  std::size_t ltoir_size{};
  check(nvrtcGetLTOIRSize(prog, &ltoir_size));
  std::unique_ptr<char[]> ltoir{new char[ltoir_size]};
  check(nvrtcGetLTOIR(prog, ltoir.get()));
  check(nvrtcDestroyProgram(&prog));

  nvJitLinkHandle handle;
  const char* lopts[] = {"-lto", arch.c_str()};
  check(nvJitLinkCreate(&handle, 2, lopts));

  check(nvJitLinkAddData(handle, NVJITLINK_INPUT_LTOIR, ltoir.get(), ltoir_size, name));

  for (int ltoir_id = 0; ltoir_id < num_input_ltoirs; ltoir_id++) {
    check(nvJitLinkAddData(handle, NVJITLINK_INPUT_LTOIR, input_ltoirs[ltoir_id], input_ltoir_sizes[ltoir_id], name));
  }

  check(nvJitLinkComplete(handle));

  std::size_t cubin_size{};
  check(nvJitLinkGetLinkedCubinSize(handle, &cubin_size));
  std::unique_ptr<char[]> cubin{new char[cubin_size]};
  check(nvJitLinkGetLinkedCubin(handle, cubin.get()));
  check(nvJitLinkDestroy(&handle));

  CUlibrary library;
  CUkernel kernel;
  cuLibraryLoadData(&library, cubin.get(), nullptr, nullptr, 0, nullptr, nullptr, 0);
  check(cuLibraryGetKernel(&kernel, library, "device_code"));

  void *pointer_to_cpu_bytes_storing_pointer_to_gpu_memory = &pointer_to_to_gpu_memory;
  void *kernel_args[] = { &num_items, pointer_to_cpu_bytes_storing_value, pointer_to_cpu_bytes_storing_pointer_to_gpu_memory };

  check(cuLaunchKernel((CUfunction)kernel, 1, 1, 1, 1, 1, 1, 0, 0, kernel_args, nullptr));
  check(cuStreamSynchronize(0));
  check(cuLibraryUnload(library));
}

Before this proof-of-concept is merged, we have to address the following:

Each iterator has to have value type. We'll have to add dtype parameter to each of the API above, something along the lines of repeat(42, value_type=numba.int32) (maybe call it dtype for consistency). With this, we'll be able to extend proof-of-concept to all primitive types supported by numba. This will potentially unblock implementation of zip iterator.
Iterator member functions have to be mangled to include dtype. Without this, we'll have a problem whith algorithms accpeting more than one input iterator (like reduce by key or merge sort pairs). The issue will happen if merge_pairs(map(count(0, value_type=int32)), map(count(0, value_type=float32))). Without mangling symbol names, dereferencing member functions of count(int32) and count(float64) would have the same symbol name count_dereference. This would lead to incorrect results or linkage failure.

Future work that can be addressed after initial support of primitive types:

map has to support stateful operators. This means that we could capture a runtime value, say an int or a pointer and use it inside the unary operator. After [FEA]: Research stateful operators for cuda.parallel #2538 is addressed, API might look like map(count(0), lambda i,d_input=d_input: return d_input[i]). In this example, d_input is a runtime parameter. This would require passing an array of state pointers and forming a single buffer, like we do in device for today (

cccl/c/parallel/src/for/for_op_helper.cpp

Lines 201 to 242 in 084cd53

    
           for_each_kernel_state make_for_kernel_state(cccl_op_t op, cccl_iterator_t iterator) 
        
           { 
        
             // Iterator is either a pointer or a stateful object, allocate space according to its size or alignment 
        
             size_t iter_size     = (cccl_iterator_kind_t::iterator == iterator.type) ? iterator.size : sizeof(void*); 
        
             void* iterator_state = (cccl_iterator_kind_t::iterator == iterator.type) ? iterator.state : &iterator.state; 
        
             // Do we need to valid user input? Alignments larger than the provided size? 
        
             size_t user_size  = (cccl_op_kind_t::stateful == op.type) ? op.size : 0; 
        
             size_t user_align = (cccl_op_kind_t::stateful == op.type) ? op.alignment : 0; 
        
             auto [min_size, user_op_offset] = calculate_kernel_state_sizes(iter_size, user_size, user_align); 
        
             for_each_default local_buffer{}; 
        
             char* iter_start = (char*) &local_buffer; 
        
             // Check if local blueprint provides enough space 
        
             bool use_allocated_storage = sizeof(for_each_default) < min_size; 
        
             if (use_allocated_storage) 
        
             { 
        
               // Allocate required space 
        
               iter_start = new char[min_size]; 
        
             } 
        
             // Memcpy into either local or allocated buffer 
        
             memcpy(iter_start, iterator_state, iter_size); 
        
             if (cccl_op_kind_t::stateful == op.type) 
        
             { 
        
               char* user_start = iter_start + user_op_offset; 
        
               memcpy(user_start, op.state, user_size); 
        
             } 
        
             // Return either local buffer or unique_ptr 
        
             if (use_allocated_storage) 
        
             { 
        
               return for_each_kernel_state{std::unique_ptr<char[]>{iter_start}, user_op_offset}; 
        
             } 
        
             else 
        
             { 
        
               return for_each_kernel_state{local_buffer, user_op_offset}; 
        
             } 
        
           }

).

Support user-defined types. The POC above relies on ctypes (looks like numba is doing this as well) to get CPU pointer to kernel launch parameters. This is problematic for user-defined types registered in Numba type system. Ideally, we'd reuse machinery that Numba relies on to pack kernel parameters instead. That'd enable dtype of iterators to be anything that Numba understands as opposed to limiting this set to what ctypes knows about.

rwgk · 2024-11-03T22:50:23Z

Discussion items for sync when Georgii returns from PTO:

https://github.com/rwgk/cccl/tree/georgii_poc_2479

Transferred the files from the 2479 comment
Added Notes with commands for how to build and run
assert_expected_output_array("84 84 84 84")

https://github.com/rwgk/numba/tree/misc_doc_fixes — By-product of systematically working through Numba documentation top-to-bottom.

Will need a few days probably to extract and see through proper PRs
Want to discuss the doc updates with Graham

georgii_poc_2479 main.py

@intrinsic is used in ldcs() and map()
ldcs = ir.InlineAsm(ldcs_type, "ld.global.cs.b32 $0, [$1];", "=r, l")
CacheModifiedPointer vs cub::CacheModifiedInputIterator
- Could the cub iterator be reused (vs reimplemented)?
Minor
- main.py map(it, op) is the reverse of Python's map(func, *iterables)

@register_jitable fixes the kind of error I asked about on October 17 (PR #2595):

October 17: Cannot determine Numba type of <class 'function'>
Add @register_jitable (tiny change)
Much simpler map(func, rai) implementation: works with plain Python and as input for reduce_into(d_in)
- repeat, count, map implementation (currently inlined in test)
- test code

Unanswered question slack cccl-tm: @numba.cuda.reduce vs cub::DeviceReduce

Why is the cuda.cooperative implementation so different (is it?) from cuda.parallel?

cuda.cooperative: ctypes not used, cuda.nvrtc (from cuda-python), does not seem to import any .so
cuda.parallel: ctypes, C++ nvrtc, imports libcccl.c.parallel.so
OK to spend time tinkering with the cuda.cooperative implementation?
Should I learn how @intrinsic works? What is a good starting point?

Is there anything we can use or learn from cupyx.jit.cub, cupyx.jit.thrust?

Leo pointed me there when I asked how closely cuda.parallel is meant to be to numba. Leo said eventually cuda.parallel should be independent of numba and work for other jit systems. Then I asked what other Python jit systems there are.
They also have Cython bindings for cub and thrust here.
They also have iterators (jit) here.

rwgk · 2024-11-09T00:11:31Z

I managed to combine Georgii's POC ConstantIterator and CountingIterator with the code I had under #2595:

rwgk@26920a9

It's pretty simple, and I got most of the way there pretty quickly, too, but then it took me several hours to get my head around the handling of the cccl_iterator_t::state.

(Next I'll try to plug in the POC map() function.)

rwgk · 2024-11-12T20:29:55Z

Georgii's POC map() function is now also integrated into the fancy_iterators branch (under the name cu_map(), to not shadow a Python built-in function). All existing tests and the new cu_map() test pass at this commit:

rwgk@e046ed8

eriknw · 2024-11-20T04:24:28Z

I have a quick comment regarding map.

I share concerns about shadowing Python builtins, but I think map is singularly special in a functional library. It is likely to be heavily used in code, examples, and documentations. Hence, it is possible to reinforce the conventions for how to use map.

I would prefer to keep the name map if we can establish two conventions such as:

from cuda.parallel.iterators import map as cumap

and

import cuda.parallel.iterators as cuit

cuit.map(...)

Also, I prefer cumap to cu_map. We have cuml, cudf, cugraph, cupy, cucim, etc. libraries. I don't see other functions with cu_ prefix.

gevtushenko added the feature request New feature or request. label Sep 29, 2024

github-project-automation bot added this to CCCL Sep 29, 2024

github-project-automation bot moved this to Todo in CCCL Sep 29, 2024

rwgk moved this from Todo to In Progress in CCCL Oct 15, 2024

jollylili assigned rwgk Oct 21, 2024

rwgk mentioned this issue Nov 12, 2024

Add nvrtc_sm_top_level::add_link_list() and use in c/parallel/src/reduce.cu #2781

Merged

rwgk linked a pull request Nov 13, 2024 that will close this issue

[WIP] Support fancy iterators in cuda.parallel #2788

Draft

cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Nov 13, 2024

cccl-authenticator-app bot moved this from In Review to In Progress in CCCL Nov 14, 2024

jollylili added the 2.8.0 target for 2.8.0 release label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Support fancy iterators in cuda.parallel #2479

[FEA]: Support fancy iterators in cuda.parallel #2479

gevtushenko commented Sep 29, 2024

rwgk commented Oct 15, 2024

rwgk commented Oct 15, 2024

rwgk commented Oct 15, 2024

rwgk commented Oct 16, 2024 •

edited

Loading

rwgk commented Oct 17, 2024

gevtushenko commented Oct 18, 2024

rwgk commented Nov 3, 2024 •

edited

Loading

rwgk commented Nov 9, 2024

rwgk commented Nov 12, 2024

eriknw commented Nov 20, 2024

[FEA]: Support fancy iterators in cuda.parallel #2479

[FEA]: Support fancy iterators in cuda.parallel #2479

Comments

gevtushenko commented Sep 29, 2024

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

rwgk commented Oct 15, 2024

rwgk commented Oct 15, 2024

rwgk commented Oct 15, 2024

rwgk commented Oct 16, 2024 • edited Loading

rwgk commented Oct 17, 2024

gevtushenko commented Oct 18, 2024

rwgk commented Nov 3, 2024 • edited Loading

rwgk commented Nov 9, 2024

rwgk commented Nov 12, 2024

eriknw commented Nov 20, 2024

rwgk commented Oct 16, 2024 •

edited

Loading

rwgk commented Nov 3, 2024 •

edited

Loading