BlockLoad/BlockStore to/from shared memory #885

pauleonix · 2022-04-12T12:20:36Z

For some algorithms it makes no sense to have a whole block of data in registers at once. For others a local buffer is bad due to dynamic indexing. For these it would be great to have versions of cub::BlockLoad and cub::BlockStore that skip the step of loading the values from shared memory into registers and just provide them in shared memory s.t. each thread of a warp has all its values on the same bank.

One idea would be to create versions of the Load / Store methods that don't take a local array and then have a device function or method to access data in TempStorage, s.t. the user doesn't have to care about the explicit memory layout and implementation details can be changed without breaking user code.

Naturally all this would only work for algorithms that store all data in shared memory at the same time. But as the core implementation is already existing, the implementation should hopefully be straightforward?

The text was updated successfully, but these errors were encountered:

gevtushenko · 2022-04-26T15:07:11Z

For some algorithms it makes no sense to have a whole block of data in registers at once. For others a local buffer is bad due to dynamic indexing.

Thank you for this request! I agree that there is value in this kind of functionality. Block load provides vectorized loads and shared memory padding. While vectorized loads might be used outside of block load facility (cub::LoadDirectBlockedVectorized), the memory padding wasn't abstracted out and is only available as an implementation detail of cub::BlockExchange facility.

One idea would be to create versions of the Load / Store methods that don't

I don't think this alternative API should be exposed as part of the existing cub::BlockLoad facility. Not all load algorithms utilize shared memory. For instance, specifying BLOCK_LOAD_DIRECT would lead to an empty shared memory object. Therefore, for some tuning, the proposed functionality would be impossible to implement within cub::BlockLoad. Apart from that, having a shared memory referred to as temporary storage in the documentation and source codes might create a bit of confusion in the mentioned use case. It won't act as temporary storage in the cub sense. It'll act as actual storage instead.

But as the core implementation is already existing, the implementation should hopefully be straightforward?

Although the implementation might be straightforward, the API for exposing it is not. We are currently investigating a more generic container type for cub that should address your use case. I'll keep this issue updated with any developments in this area.

pauleonix · 2022-04-26T16:58:31Z

Thanks for the input. I was originally looking into doing a PR myself, but I agree that these API design decisions should be carefully made by the maintainers. Looking forward to 1.17 then!

alliepiper · 2022-04-26T17:17:11Z

Just to chime in -- it won't be in place for 1.17, and it may be a while before the containers George described are ready. But there are some interesting developments planned in this area, so stay tuned 🙂

pauleonix · 2022-04-26T18:06:35Z

@allisonvacanti Too bad... Maybe I will just still play around with a BlockLoadToShared on a fork ;)

pauleonix · 2023-06-21T17:38:10Z

I think I have a better understanding of CUBs shared memory philosophy by now. I would still be interested to hear about that container project, but right now I have a different question that, if I would post it as a new issue, would end up with a permutation of this issue's title, so I decided to ask it here:

After creating a histogram in shared memory using cub::BlockHistogram, I want to scan the histogram. I.e. I need to load values from shared memory into a blocked arrangement.

Is loading from shared memory in scope for cub::BlockLoad or should it only be used to load from global memory? The docs seem to use "memory" instead of specifying "global memory", but at the same time the term "coalescing" is used which is normally used for global memory access. Can I expect it to work and keep working?

Clearly some combinations of types, algorithms and bank size will cause bank conflicts, so the algorithm has to be chosen with care.

gevtushenko · 2023-06-21T19:45:44Z

@pauleonix unless you provide something like cub::CacheModifiedInputIterator, I think there are no assumption made by block load on the memory space, so it should work. That's said, it's currently tested and used only for the case of global memory.

bernhardmgruber · 2024-08-02T10:29:21Z

Newer GPUs offer bulk copy operations from global to shared memory, e.g., LDGSTS on Ampere, and TMA operations on Hopper. See for example the Tensor Memory Accelerator section in this blog post: https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth#asynchronous_execution. While we may employ some of these features in selected places, it is definitely worthwhile to generalize those loading faciliities into new CUB BlockLoad algorithms. So we have to extend our software architecture to support shared memory as a destination. I will look into this at some point.

bernhardmgruber · 2024-12-02T17:30:17Z

I have been playing around with BlockLoad a bit today, trying to fit bulk copies from GM -> SM. Those are asynchronous and the destination is shared memory. So the existing API does not fit nicely, since depending on BlockLoadAlgorithm we want to use a shared-memory-backed array or a register-backed array, so we cannot let the user create it and pass it as argument. So I figured we would want to hide that array inside the implementation of BlockLoad and only give access to it.

  using block_load_t = cub::BlockLoad<T, ThreadsInBlock, ItemsPerThread, LoadAlgorithm>;
  using storage_t    = typename block_load_t::TempStorage;

  __shared__ storage_t storage;     // may store a shared memory array
  block_load_t block_load(storage); // may store a register array

  span<T, ItemsPerThread> data = block_load.Load(iterator);

Depending on LoadAlgorithm, the span returned from Load would either point to an array in registers or shared memory.

Later I wondered, why we would expose an array at all and not just a range from where to pull individual elements. This would give us more implementation freedom to load several times in smaller chunks and avoid large arrays in registers. But I don't know whether we would actually need that.

Also, several algorithms actually need the entire loaded block per thread in registers, because they later index into that array and don't process item by item.

However, since bulk copies are asynchronous, I need to wait for completion inside Load. But this unnecessary if multiple block loads are needed by an algorithm, where only a single synchronization after issuing all loads is necessary. I still have to figure out a good design for this, since it should not complicate the existing one or add overhead to synchronous load implementations. Maybe something simple as a flag would be a good compromise:

  using block_load_keys_t = cub::BlockLoad<KeyT, ThreadsInBlock, ItemsPerThread, LoadAlgorithm>;
  using block_load_keys_t = cub::BlockLoad<ValT, ThreadsInBlock, ItemsPerThread, LoadAlgorithm>;
  using storage_keys_t    = typename block_load_keys_t::TempStorage;
  using storage_vals_t    = typename block_load_vals_t::TempStorage;

  __shared__ storage_keys_t storage_keys;
  __shared__ storage_vals_t storage_vals;
  block_load_keys_t block_load_keys(storage_keys);
  block_load_vals_t block_load_vals(storage_vals);

  span<KeyT, ItemsPerThread> keys = block_load_keys.Load(key_iterator, cub::no_sync);
  span<ValT, ItemsPerThread> vals = block_load_vals.Load(val_iterator); // sync only issued here

It feels a bit like a hack, but it's a lightweight solution.

alliepiper assigned gevtushenko Apr 25, 2022

jrhemstad added the cub For all items related to CUB label Feb 22, 2023

jarmak-nv assigned miscco and unassigned gevtushenko Feb 23, 2023

github-project-automation bot added this to CCCL Nov 8, 2023

jarmak-nv transferred this issue from NVIDIA/cub Nov 8, 2023

github-project-automation bot moved this to Todo in CCCL Nov 8, 2023

miscco removed their assignment Dec 6, 2023

bernhardmgruber self-assigned this Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BlockLoad/BlockStore to/from shared memory #885

BlockLoad/BlockStore to/from shared memory #885

pauleonix commented Apr 12, 2022

gevtushenko commented Apr 26, 2022

pauleonix commented Apr 26, 2022

alliepiper commented Apr 26, 2022

pauleonix commented Apr 26, 2022

pauleonix commented Jun 21, 2023 •

edited

Loading

gevtushenko commented Jun 21, 2023

bernhardmgruber commented Aug 2, 2024

bernhardmgruber commented Dec 2, 2024

BlockLoad/BlockStore to/from shared memory #885

BlockLoad/BlockStore to/from shared memory #885

Comments

pauleonix commented Apr 12, 2022

gevtushenko commented Apr 26, 2022

pauleonix commented Apr 26, 2022

alliepiper commented Apr 26, 2022

pauleonix commented Apr 26, 2022

pauleonix commented Jun 21, 2023 • edited Loading

gevtushenko commented Jun 21, 2023

bernhardmgruber commented Aug 2, 2024

bernhardmgruber commented Dec 2, 2024

pauleonix commented Jun 21, 2023 •

edited

Loading