Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace thrust::swap by cuda::std::swap #2985

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

bernhardmgruber
Copy link
Contributor

Fixes: #2948

Comment on lines 201 to 211
// FIXME(bgruber): swap_allocators already swaps m_allocator, so we are swapping twice here !!
swap_allocators(integral_constant<bool, allocator_traits<Alloc>::propagate_on_container_swap::value>(),
x.m_allocator);

thrust::swap(m_allocator, x.m_allocator);
// FIXME(bgruber): this should use ADL-two-step swap, but this creates an ambiguity with std::swap until
// https://github.com/NVIDIA/cccl/issues/2984 is resolved.
// swap(m_allocator, x.m_allocator);
::cuda::std::swap(m_allocator, x.m_allocator);
Copy link
Contributor Author

@bernhardmgruber bernhardmgruber Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this looks like a bug to me. swap_allocators was introduced in 403effb, but did not replace swapping the allocator below. @griwes, I assume the ::cuda::std::swap(m_allocator, x.m_allocator); is not needed/wrong here?

Comment on lines +346 to 351
// FIXME(bgruber): it is really concerning, that swapping an allocator can throw. swap() should be noexcept in
// general.
NV_IF_TARGET(NV_IS_DEVICE,
(
// allocators must be equal when swapping containers with allocators that propagate on swap
assert(!is_allocator_not_equal(other));),
(if (is_allocator_not_equal(other)) { throw allocator_mismatch_on_swap(); }));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no ideas about the depths of allocators, but throwing an exception when swapping allocators strikes me as odd.

The reference for std::vector<T>::swap says:

[...] if get_allocator() != other.get_allocator(), the behavior is undefined

I would rather prefer to have a hard fault here, so we can mark all swap functions of thrust vectors as noexcept.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean swapping is move construction + move assignment so it can always potentially throw.

I would say it should not throw from our side, but especially allocators are potentially throwing a ton.

As an example think about a block allocator that reserves a huge slab of memory on construction and hands out slices of it. That could easily fail with an OOM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean swapping is move construction + move assignment so it can always potentially throw.

For a generic implementation, yes. In the long term, relocation could be used to avoid the moved from state, which could simplify things.

As an example think about a block allocator that reserves a huge slab of memory on construction and hands out slices of it. That could easily fail with an OOM

How can swapping two block allocators using memory slabs fail with an OOM? I would still just swap pointers and some bookkeeping?

Copy link
Contributor

github-actions bot commented Dec 2, 2024

🟨 CI finished in 1h 45m: Pass: 96%/224 | Total: 6d 12h | Avg: 41m 58s | Max: 1h 16m | Hits: 40%/10436
  • 🟨 thrust: Pass: 92%/111 | Total: 2d 16h | Avg: 34m 42s | Max: 1h 16m | Hits: 29%/7408

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  92%/103 | Total:  2d 11h | Avg: 34m 50s | Max:  1h 16m | Hits:  29%/7408  
      🟩 arm64              Pass: 100%/8   | Total:  4h 25m | Avg: 33m 10s | Max: 38m 20s
    🔍 ctk: 12.6 🔍
      🟩 11.1               Pass: 100%/15  | Total:  8h 16m | Avg: 33m 05s | Max:  1h 06m | Hits:  29%/1852  
      🟩 11.8               Pass: 100%/3   | Total:  1h 57m | Avg: 39m 02s | Max: 44m 53s
      🟩 12.5               Pass: 100%/4   | Total:  4h 29m | Avg:  1h 07m | Max:  1h 16m
      🔍 12.6               Pass:  91%/89  | Total:  2d 01h | Avg: 33m 22s | Max:  1h 12m | Hits:  29%/5556  
    🔍 cudacxx: nvcc12.6 🔍
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  1h 57m | Avg: 29m 26s | Max: 31m 09s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  8h 16m | Avg: 33m 05s | Max:  1h 06m | Hits:  29%/1852  
      🟩 nvcc11.8           Pass: 100%/3   | Total:  1h 57m | Avg: 39m 02s | Max: 44m 53s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  4h 29m | Avg:  1h 07m | Max:  1h 16m
      🔍 nvcc12.6           Pass:  90%/85  | Total:  1d 23h | Avg: 33m 33s | Max:  1h 12m | Hits:  29%/5556  
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/4   | Total:  1h 57m | Avg: 29m 26s | Max: 31m 09s
      🔍 nvcc               Pass:  92%/107 | Total:  2d 14h | Avg: 34m 54s | Max:  1h 16m | Hits:  29%/7408  
    🟨 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  3h 14m | Avg: 32m 27s | Max: 37m 47s
      🟩 Clang10            Pass: 100%/3   | Total:  1h 47m | Avg: 35m 53s | Max: 37m 34s
      🟩 Clang11            Pass: 100%/4   | Total:  2h 12m | Avg: 33m 03s | Max: 35m 18s
      🟩 Clang12            Pass: 100%/4   | Total:  2h 16m | Avg: 34m 00s | Max: 36m 18s
      🟩 Clang13            Pass: 100%/4   | Total:  2h 20m | Avg: 35m 10s | Max: 37m 11s
      🟩 Clang14            Pass: 100%/4   | Total:  2h 20m | Avg: 35m 05s | Max: 38m 30s
      🟩 Clang15            Pass: 100%/4   | Total:  2h 16m | Avg: 34m 11s | Max: 36m 54s
      🟩 Clang16            Pass: 100%/4   | Total:  2h 17m | Avg: 34m 29s | Max: 37m 22s
      🟩 Clang17            Pass: 100%/4   | Total:  2h 24m | Avg: 36m 02s | Max: 39m 24s
      🟨 Clang18            Pass:  81%/11  | Total:  5h 00m | Avg: 27m 21s | Max: 37m 06s
      🟩 GCC6               Pass: 100%/2   | Total:  1h 00m | Avg: 30m 26s | Max: 33m 12s
      🟩 GCC7               Pass: 100%/6   | Total:  3h 09m | Avg: 31m 36s | Max: 36m 58s
      🟩 GCC8               Pass: 100%/6   | Total:  3h 16m | Avg: 32m 40s | Max: 38m 42s
      🟩 GCC9               Pass: 100%/6   | Total:  3h 18m | Avg: 33m 06s | Max: 37m 24s
      🟩 GCC10              Pass: 100%/4   | Total:  2h 18m | Avg: 34m 44s | Max: 37m 31s
      🟩 GCC11              Pass: 100%/7   | Total:  4h 21m | Avg: 37m 22s | Max: 44m 53s
      🟩 GCC12              Pass: 100%/4   | Total:  2h 27m | Avg: 36m 53s | Max: 42m 24s
      🟨 GCC13              Pass:  68%/16  | Total:  6h 20m | Avg: 23m 45s | Max: 38m 20s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 21m | Avg: 47m 19s | Max: 53m 14s
      🟩 MSVC14.16          Pass: 100%/1   | Total:  1h 06m | Avg:  1h 06m | Max:  1h 06m | Hits:  29%/1852  
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 18m | Avg:  1h 09m | Max:  1h 12m | Hits:  29%/3704  
      🟨 MSVC14.39          Pass:  50%/2   | Total:  1h 31m | Avg: 45m 47s | Max:  1h 06m | Hits:  29%/1852  
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  4h 29m | Avg:  1h 07m | Max:  1h 16m
    🟨 cxx_family
      🟨 Clang              Pass:  95%/48  | Total:  1d 02h | Avg: 32m 44s | Max: 39m 24s
      🟨 GCC                Pass:  90%/51  | Total:  1d 02h | Avg: 30m 51s | Max: 44m 53s
      🟩 Intel              Pass: 100%/3   | Total:  2h 21m | Avg: 47m 19s | Max: 53m 14s
      🟨 MSVC               Pass:  80%/5   | Total:  4h 57m | Avg: 59m 27s | Max:  1h 12m | Hits:  29%/7408  
      🟩 NVHPC              Pass: 100%/4   | Total:  4h 29m | Avg:  1h 07m | Max:  1h 16m
    🟨 jobs
      🟩 Build              Pass: 100%/103 | Total:  2d 14h | Avg: 36m 24s | Max:  1h 16m | Hits:  29%/7408  
      🟥 TestCPU            Pass:   0%/4   | Total: 47m 08s | Avg: 11m 47s | Max: 24m 38s
      🟥 TestGPU            Pass:   0%/4   | Total: 55m 57s | Avg: 13m 59s | Max: 18m 27s
    🟨 std
      🟨 11                 Pass:  93%/30  | Total: 14h 25m | Avg: 28m 50s | Max: 59m 36s
      🟩 14                 Pass: 100%/29  | Total: 18h 12m | Avg: 37m 41s | Max:  1h 06m | Hits:  29%/3704  
      🟩 17                 Pass: 100%/27  | Total: 17h 44m | Avg: 39m 24s | Max:  1h 12m | Hits:  29%/1852  
      🟨 20                 Pass:  78%/23  | Total: 13h 07m | Avg: 34m 13s | Max:  1h 16m | Hits:  29%/1852  
    🟨 cmake_options
      🟨 -DTHRUST_DISPATCH_TYPE=Force32bit Pass:  50%/2   | Total: 43m 35s | Avg: 21m 47s | Max: 31m 57s
    🟨 gpu
      🟨 v100               Pass:  92%/111 | Total:  2d 16h | Avg: 34m 42s | Max:  1h 16m | Hits:  29%/7408  
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  1h 57m | Avg: 39m 02s | Max: 44m 53s
      🟩 90a                Pass: 100%/4   | Total:  1h 28m | Avg: 22m 13s | Max: 25m 42s
    
  • 🟩 cub: Pass: 100%/110 | Total: 3d 20h | Avg: 50m 14s | Max: 1h 13m | Hits: 66%/3028

    🟩 cpu
      🟩 amd64              Pass: 100%/102 | Total:  3d 12h | Avg: 49m 54s | Max:  1h 13m | Hits:  66%/3028  
      🟩 arm64              Pass: 100%/8   | Total:  7h 15m | Avg: 54m 26s | Max: 57m 07s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total: 11h 40m | Avg: 46m 41s | Max:  1h 01m | Hits:  66%/757   
      🟩 11.8               Pass: 100%/3   | Total:  3h 34m | Avg:  1h 11m | Max:  1h 13m
      🟩 12.5               Pass: 100%/4   | Total:  4h 05m | Avg:  1h 01m | Max:  1h 03m
      🟩 12.6               Pass: 100%/88  | Total:  3d 00h | Avg: 49m 36s | Max:  1h 04m | Hits:  66%/2271  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  3h 48m | Avg: 57m 03s | Max: 58m 29s
      🟩 nvcc11.1           Pass: 100%/15  | Total: 11h 40m | Avg: 46m 41s | Max:  1h 01m | Hits:  66%/757   
      🟩 nvcc11.8           Pass: 100%/3   | Total:  3h 34m | Avg:  1h 11m | Max:  1h 13m
      🟩 nvcc12.5           Pass: 100%/4   | Total:  4h 05m | Avg:  1h 01m | Max:  1h 03m
      🟩 nvcc12.6           Pass: 100%/84  | Total:  2d 20h | Avg: 49m 14s | Max:  1h 04m | Hits:  66%/2271  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total:  3h 48m | Avg: 57m 03s | Max: 58m 29s
      🟩 nvcc               Pass: 100%/106 | Total:  3d 16h | Avg: 49m 58s | Max:  1h 13m | Hits:  66%/3028  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  4h 49m | Avg: 48m 13s | Max: 51m 36s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 42m | Avg: 54m 04s | Max: 56m 42s
      🟩 Clang11            Pass: 100%/4   | Total:  3h 31m | Avg: 52m 46s | Max: 54m 17s
      🟩 Clang12            Pass: 100%/4   | Total:  3h 30m | Avg: 52m 39s | Max: 57m 19s
      🟩 Clang13            Pass: 100%/4   | Total:  3h 32m | Avg: 53m 10s | Max: 57m 39s
      🟩 Clang14            Pass: 100%/4   | Total:  3h 35m | Avg: 53m 51s | Max: 57m 11s
      🟩 Clang15            Pass: 100%/4   | Total:  3h 21m | Avg: 50m 16s | Max: 51m 44s
      🟩 Clang16            Pass: 100%/4   | Total:  3h 34m | Avg: 53m 38s | Max: 56m 20s
      🟩 Clang17            Pass: 100%/4   | Total:  3h 34m | Avg: 53m 41s | Max: 57m 07s
      🟩 Clang18            Pass: 100%/11  | Total:  8h 59m | Avg: 49m 02s | Max: 58m 29s
      🟩 GCC6               Pass: 100%/2   | Total:  1h 26m | Avg: 43m 08s | Max: 43m 12s
      🟩 GCC7               Pass: 100%/6   | Total:  4h 50m | Avg: 48m 25s | Max: 51m 34s
      🟩 GCC8               Pass: 100%/6   | Total:  4h 56m | Avg: 49m 28s | Max: 56m 13s
      🟩 GCC9               Pass: 100%/6   | Total:  5h 12m | Avg: 52m 07s | Max:  1h 01m
      🟩 GCC10              Pass: 100%/4   | Total:  3h 40m | Avg: 55m 05s | Max: 57m 21s
      🟩 GCC11              Pass: 100%/7   | Total:  7h 03m | Avg:  1h 00m | Max:  1h 13m
      🟩 GCC12              Pass: 100%/4   | Total:  3h 40m | Avg: 55m 14s | Max: 57m 59s
      🟩 GCC13              Pass: 100%/16  | Total:  9h 08m | Avg: 34m 16s | Max: 57m 07s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 55m | Avg: 58m 29s | Max:  1h 00m
      🟩 MSVC14.16          Pass: 100%/1   | Total: 52m 09s | Avg: 52m 09s | Max: 52m 09s | Hits:  66%/757   
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 57m | Avg: 58m 54s | Max: 59m 44s | Hits:  66%/1514  
      🟩 MSVC14.39          Pass: 100%/1   | Total:  1h 04m | Avg:  1h 04m | Max:  1h 04m | Hits:  66%/757   
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  4h 05m | Avg:  1h 01m | Max:  1h 03m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  1d 17h | Avg: 51m 29s | Max: 58m 29s
      🟩 GCC                Pass: 100%/51  | Total:  1d 15h | Avg: 47m 02s | Max:  1h 13m
      🟩 Intel              Pass: 100%/3   | Total:  2h 55m | Avg: 58m 29s | Max:  1h 00m
      🟩 MSVC               Pass: 100%/4   | Total:  3h 54m | Avg: 58m 32s | Max:  1h 04m | Hits:  66%/3028  
      🟩 NVHPC              Pass: 100%/4   | Total:  4h 05m | Avg:  1h 01m | Max:  1h 03m
    🟩 gpu
      🟩 v100               Pass: 100%/110 | Total:  3d 20h | Avg: 50m 14s | Max:  1h 13m | Hits:  66%/3028  
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total:  3d 17h | Avg: 52m 32s | Max:  1h 13m | Hits:  66%/3028  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 18m 08s | Avg: 18m 08s | Max: 18m 08s
      🟩 GraphCapture       Pass: 100%/1   | Total: 15m 03s | Avg: 15m 03s | Max: 15m 03s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 04m | Avg: 21m 21s | Max: 25m 33s
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 09m | Avg: 23m 08s | Max: 26m 08s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  3h 34m | Avg:  1h 11m | Max:  1h 13m
      🟩 90a                Pass: 100%/4   | Total:  1h 32m | Avg: 23m 08s | Max: 24m 31s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total:  1d 00h | Avg: 48m 52s | Max:  1h 10m
      🟩 14                 Pass: 100%/29  | Total:  1d 01h | Avg: 53m 07s | Max:  1h 11m | Hits:  66%/1514  
      🟩 17                 Pass: 100%/27  | Total: 23h 54m | Avg: 53m 08s | Max:  1h 13m | Hits:  66%/757   
      🟩 20                 Pass: 100%/24  | Total: 18h 04m | Avg: 45m 10s | Max:  1h 04m | Hits:  66%/757   
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 23s | Avg: 4m 41s | Max: 7m 05s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 23s | Avg:  4m 41s | Max:  7m 05s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 23s | Avg:  4m 41s | Max:  7m 05s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 23s | Avg:  4m 41s | Max:  7m 05s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 23s | Avg:  4m 41s | Max:  7m 05s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 23s | Avg:  4m 41s | Max:  7m 05s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 23s | Avg:  4m 41s | Max:  7m 05s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 23s | Avg:  4m 41s | Max:  7m 05s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 18s | Avg:  2m 18s | Max:  2m 18s
      🟩 Test               Pass: 100%/1   | Total:  7m 05s | Avg:  7m 05s | Max:  7m 05s
    
  • 🟩 python: Pass: 100%/1 | Total: 14m 13s | Avg: 14m 13s | Max: 14m 13s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 14m 13s | Avg: 14m 13s | Max: 14m 13s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 14m 13s | Avg: 14m 13s | Max: 14m 13s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 14m 13s | Avg: 14m 13s | Max: 14m 13s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 14m 13s | Avg: 14m 13s | Max: 14m 13s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 14m 13s | Avg: 14m 13s | Max: 14m 13s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 14m 13s | Avg: 14m 13s | Max: 14m 13s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 14m 13s | Avg: 14m 13s | Max: 14m 13s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 14m 13s | Avg: 14m 13s | Max: 14m 13s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
+/- Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 224)

# Runner
185 linux-amd64-cpu16
16 linux-arm64-cpu16
14 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16

@bernhardmgruber
Copy link
Contributor Author

Great, the changes so far break this:

  thrust::device_reference<T> ref1 = ...;
  thrust::device_reference<T> ref2 = ...;
thrust::swap(ref1, ref2);

It works with unqualified swap. However, the tests also fail for std::swap and cuda::std::swap. This needs more investigation.

@bernhardmgruber
Copy link
Contributor Author

I discussed this with @miscco today and we agreed that the proper way to swap anything in CCCL is:

using ::cuda::std::swap;
swap(a, b);

This must be made to work for any types of a and b and do the right thing.

Furthermore, thrust::swap should be deprecated+removed, since swapping two values is a primitive operation that belongs to the standard library.

A proper solution requires us to fix: #2984

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

[BUG]: Suboptimal swap performance on universal vectors
2 participants