CUB, torch, cuda_fp16 and __CUDA_NO_HALF_OPERATORS__ #1727

Artem-B · 2024-05-08T18:55:24Z

Artem-B
May 8, 2024

We've encountered another interesting conundrum, trying to build torch.

Torch builds with a bunch of __CUDA_NO_{HALF/BFLOAT16}... defines https://github.com/pytorch/pytorch/blob/faf0015052ee37db718bc5efa6673e0c25be1e8d/cmake/Dependencies.cmake#L1609

It works well enough for them, at least up until the CUB version included with cuda-12.4 (v2.3.0?).
However, attempting to build it with cub v2.3.2 runs into an issue. Apparently the newer version of CUB needs some of the fp16/bf16 overloads disabled by the torch build flags.

In file included from third_party/py/torch/aten/src/ATen/cuda/cub.cu:2:
In file included from third_party/py/torch/aten/src/ATen/cuda/cub.cuh:15:
In file included from third_party/gpus/cccl/cub/cub/cub.cuh:64:
In file included from third_party/gpus/cccl/cub/cub/device/device_histogram.cuh:52:
third_party/gpus/cccl/cub/cub/device/dispatch/dispatch_histogram.cuh:779:35: error: invalid operands to binary expression ('__half' and '__half')
  779 |                    (return sample >= min_level && sample < max_level;),
      |                            ~~~~~~ ^  ~~~~~~~~~
...
blaze-out/k8-opt-cuda12/bin/third_party/gpus/cuda/_virtual_includes/_stage/third_party/gpus/cuda/include/cuda_bf16.hpp:310:52: note: candidate function not viable: no known conversion from '__half' to 'const __nv_bfloat16' for 1st argument
  310 | __CUDA_HOSTDEVICE__ __CUDA_BF16_FORCEINLINE__ bool operator>=(const __nv_bfloat16 &lh, const __nv_bfloat16 &rh) { return __hge(lh, rh); }
      |                                                    ^          ~~~~~~~~~~~~~~~~~~~~~~~
blaze-out/k8-opt-cuda12/bin/third_party/gpus/cuda/_virtual_includes/_stage/third_party/gpus/cuda/include/cuda_bf16.hpp:367:52: note: candidate function not viable: no known conversion from '__half' to 'const __nv_bfloat162' for 1st argument
  367 | __CUDA_HOSTDEVICE__ __CUDA_BF16_FORCEINLINE__ bool operator>=(const __nv_bfloat162 &lh, const __nv_bfloat162 &rh) { return __hbge2(lh, rh); }
      |                                                    ^          ~~~~~~~~~~~~~~~~~~~~~~~~```

Note that the ambiguous overload here is a red herring. The real problem is that the correct overloads were not found because they were disabled.

If CUDA-provided overloads are enabled, then torch build fails:
https://github.com/pytorch/pytorch/actions/runs/8991128640/job/24733299645?pr=125707

[ 89%] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu.o
/var/lib/jenkins/workspace/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu(23): error: more than one operator ">=" matches these operands:
            built-in operator "arithmetic >= arithmetic"
            function "operator>=(const __half &, const __half &)" (declared at line 311 of /usr/local/cuda/include/cuda_fp16.hpp)
            operand types are: scalar_t >= c10::Half

So, we somehow need to have the cake (have CUDA-provided overloads), and eat it (have those overloads disabled), to keep both cub and torch happy.

Ideally it would be great if CUDA headers would stash the overloads in some known namespace, which would allow cub to pull them in into its own namespace, but it's not an option, those overloads are just preprocessed away.

Edit: Not sure if it's possible, though, as the __CUDA_NO_* macros also disable member functions, not just free-standing ones.

Fixing torch would also be nice, but at the moment things work fine for them, and as far as torch is concerned, it's not a supported build configuration. They are still building with CUDA-12.1 and very recent versions of CUDA and related libraries are not their problem yet.

Considering that we're missing a relatively few overload functions, I wonder if it would make sense for CUB to carry its own set, and either always use them, or fall back to them if it's used in a build which disables those overloads in CUDA headers.

@jrhemstad @miscco @voznesenskym @malfet

jrhemstad · 2024-05-13T16:33:34Z

jrhemstad
May 13, 2024
Maintainer

So we've run into issues like this before and CUB actually already has some logic to optionally disable any mention of the half/bfloat types.

cccl/cub/cub/util_type.cuh

Lines 49 to 66 in f8a26b2

    
           #if !_NVHPC_CUDA 
        
           #  include <cuda_fp16.h> 
        
           #endif 
        
           #if !_NVHPC_CUDA && !defined(CUB_DISABLE_BF16_SUPPORT) 
        
           #  include <cuda_bf16.h> 
        
           // cuda_fp8.h transitively includes cuda_fp16.h, so we have to include the header under !CUB_DISABLE_BF16_SUPPORT 
        
           #  if _CCCL_CUDACC_VER >= 1108000 
        
           // cuda_fp8.h resets default for C4127, so we have to guard the inclusion 
        
           #    if defined(_CCCL_COMPILER_MSVC) 
        
           #      pragma warning(push) 
        
           #    endif 
        
           #    include <cuda_fp8.h> 
        
           #    if defined(_CCCL_COMPILER_MSVC) 
        
           #      pragma warning(pop) 
        
           #    endif 
        
           #  endif 
        
           #endif

Can you try adding -DCUB_DISABLE_BF16_SUPPORT=1 to the build?

5 replies

jrhemstad May 13, 2024
Maintainer

Actually, it looks like this is only for bfloat. We don't have a macro to disable cuda_fp16.h, but I think that's what you'd want here.

Artem-B May 13, 2024
Author

Considering that we're missing a relatively few overload functions, I wonder if it would make sense for CUB to carry its own set, and either always use them, or fall back to them if it's used in a build which disables those overloads in CUDA headers.

I've tried adding just free-standing operators for __half (the ones excluded by __CUDA_NO_HALF...) to the cub's namespace, and it appears to solve the pytorch build issue, but I suspect it's a half-measure that does not make cub completely functional when the user builds with __CUDA_NO....

We can live for a while with such a local patch, but it may help to move things forward if we disambiguate the situation here.

On the cccl side, if cccl can be made to work with __CUDA_NO* defined, that would solve the problem. If compilation with __CUDA_NO_* is not supported it should probably be explicitly documented. It would be useful to add an explicit check for the macro and an #error so the failure is obvious. Chasing secondary C++ errors is not fun for anyone.

As for pytorch, if cub does not support __CUDA_NO_*, they will have to eventually fix their code, either their use of __half, or their use of cub, otherwise they will not be able to build with newer versions of CUDA. Having the explicit error above, would make it clear that the issue is on their side.

For a bit more context, I've found an ancient thread mentioning the origin of this issue in torch. Looks like the macros there were added when CUDA introduced the overloads for half when torch already had their own, and disabling CUDA-side overloads worked well enough for them. https://discuss.pytorch.org/t/cuda-no-half2-operators-for-cuda-9-2/18365

This dates quite a while back, so I might miss some things but If I remember it correctly, CUDA9 added half operators in its half header, while Torch (Torch7 at this time) already shipped with its own.
The flags are used to keep the half definitions from the CUDA header, while not compiling the operators.

jrhemstad May 13, 2024
Maintainer

if cccl can be made to work with __CUDA_NO* defined, that would solve the problem

tbh, this is the first I learned the __CUDA_NO* macros existed. Without having looked into it at all, I agree that it would be the ideal solution to just make CCCL work when these macros are defined.

voznesenskym May 13, 2024

@malfet Can you comment on this from torch side?

Artem-B May 21, 2024
Author

Torch bug: pytorch/pytorch#126804

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUB, torch, cuda_fp16 and __CUDA_NO_HALF_OPERATORS__ #1727

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

CUB, torch, cuda_fp16 and __CUDA_NO_HALF_OPERATORS__ #1727

Artem-B May 8, 2024

Replies: 1 comment · 5 replies

jrhemstad May 13, 2024 Maintainer

jrhemstad May 13, 2024 Maintainer

Artem-B May 13, 2024 Author

jrhemstad May 13, 2024 Maintainer

voznesenskym May 13, 2024

Artem-B May 21, 2024 Author

Artem-B
May 8, 2024

Replies: 1 comment 5 replies

jrhemstad
May 13, 2024
Maintainer

jrhemstad May 13, 2024
Maintainer

Artem-B May 13, 2024
Author

jrhemstad May 13, 2024
Maintainer

Artem-B May 21, 2024
Author