Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BlockLoad/BlockStore to/from shared memory #885

Open
pauleonix opened this issue Apr 12, 2022 · 8 comments
Open

BlockLoad/BlockStore to/from shared memory #885

pauleonix opened this issue Apr 12, 2022 · 8 comments
Assignees
Labels
cub For all items related to CUB

Comments

@pauleonix
Copy link
Contributor

For some algorithms it makes no sense to have a whole block of data in registers at once. For others a local buffer is bad due to dynamic indexing. For these it would be great to have versions of cub::BlockLoad and cub::BlockStore that skip the step of loading the values from shared memory into registers and just provide them in shared memory s.t. each thread of a warp has all its values on the same bank.

One idea would be to create versions of the Load / Store methods that don't take a local array and then have a device function or method to access data in TempStorage, s.t. the user doesn't have to care about the explicit memory layout and implementation details can be changed without breaking user code.

Naturally all this would only work for algorithms that store all data in shared memory at the same time. But as the core implementation is already existing, the implementation should hopefully be straightforward?

@gevtushenko
Copy link
Collaborator

For some algorithms it makes no sense to have a whole block of data in registers at once. For others a local buffer is bad due to dynamic indexing.

Thank you for this request! I agree that there is value in this kind of functionality. Block load provides vectorized loads and shared memory padding. While vectorized loads might be used outside of block load facility (cub::LoadDirectBlockedVectorized), the memory padding wasn't abstracted out and is only available as an implementation detail of cub::BlockExchange facility.

One idea would be to create versions of the Load / Store methods that don't

I don't think this alternative API should be exposed as part of the existing cub::BlockLoad facility. Not all load algorithms utilize shared memory. For instance, specifying BLOCK_LOAD_DIRECT would lead to an empty shared memory object. Therefore, for some tuning, the proposed functionality would be impossible to implement within cub::BlockLoad. Apart from that, having a shared memory referred to as temporary storage in the documentation and source codes might create a bit of confusion in the mentioned use case. It won't act as temporary storage in the cub sense. It'll act as actual storage instead.

But as the core implementation is already existing, the implementation should hopefully be straightforward?

Although the implementation might be straightforward, the API for exposing it is not. We are currently investigating a more generic container type for cub that should address your use case. I'll keep this issue updated with any developments in this area.

@pauleonix
Copy link
Contributor Author

Thanks for the input. I was originally looking into doing a PR myself, but I agree that these API design decisions should be carefully made by the maintainers. Looking forward to 1.17 then!

@alliepiper
Copy link
Collaborator

Just to chime in -- it won't be in place for 1.17, and it may be a while before the containers George described are ready. But there are some interesting developments planned in this area, so stay tuned 🙂

@pauleonix
Copy link
Contributor Author

@allisonvacanti Too bad... Maybe I will just still play around with a BlockLoadToShared on a fork ;)

@jrhemstad jrhemstad added the cub For all items related to CUB label Feb 22, 2023
@jarmak-nv jarmak-nv assigned miscco and unassigned gevtushenko Feb 23, 2023
@pauleonix
Copy link
Contributor Author

pauleonix commented Jun 21, 2023

I think I have a better understanding of CUBs shared memory philosophy by now. I would still be interested to hear about that container project, but right now I have a different question that, if I would post it as a new issue, would end up with a permutation of this issue's title, so I decided to ask it here:

After creating a histogram in shared memory using cub::BlockHistogram, I want to scan the histogram. I.e. I need to load values from shared memory into a blocked arrangement.

Is loading from shared memory in scope for cub::BlockLoad or should it only be used to load from global memory? The docs seem to use "memory" instead of specifying "global memory", but at the same time the term "coalescing" is used which is normally used for global memory access. Can I expect it to work and keep working?

Clearly some combinations of types, algorithms and bank size will cause bank conflicts, so the algorithm has to be chosen with care.

@gevtushenko
Copy link
Collaborator

@pauleonix unless you provide something like cub::CacheModifiedInputIterator, I think there are no assumption made by block load on the memory space, so it should work. That's said, it's currently tested and used only for the case of global memory.

@jarmak-nv jarmak-nv transferred this issue from NVIDIA/cub Nov 8, 2023
@github-project-automation github-project-automation bot moved this to Todo in CCCL Nov 8, 2023
@miscco miscco removed their assignment Dec 6, 2023
@bernhardmgruber bernhardmgruber self-assigned this Aug 2, 2024
@bernhardmgruber
Copy link
Contributor

Newer GPUs offer bulk copy operations from global to shared memory, e.g., LDGSTS on Ampere, and TMA operations on Hopper. See for example the Tensor Memory Accelerator section in this blog post: https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth#asynchronous_execution. While we may employ some of these features in selected places, it is definitely worthwhile to generalize those loading faciliities into new CUB BlockLoad algorithms. So we have to extend our software architecture to support shared memory as a destination. I will look into this at some point.

@bernhardmgruber
Copy link
Contributor

I have been playing around with BlockLoad a bit today, trying to fit bulk copies from GM -> SM. Those are asynchronous and the destination is shared memory. So the existing API does not fit nicely, since depending on BlockLoadAlgorithm we want to use a shared-memory-backed array or a register-backed array, so we cannot let the user create it and pass it as argument. So I figured we would want to hide that array inside the implementation of BlockLoad and only give access to it.

  using block_load_t = cub::BlockLoad<T, ThreadsInBlock, ItemsPerThread, LoadAlgorithm>;
  using storage_t    = typename block_load_t::TempStorage;

  __shared__ storage_t storage;     // may store a shared memory array
  block_load_t block_load(storage); // may store a register array

  span<T, ItemsPerThread> data = block_load.Load(iterator);

Depending on LoadAlgorithm, the span returned from Load would either point to an array in registers or shared memory.

Later I wondered, why we would expose an array at all and not just a range from where to pull individual elements. This would give us more implementation freedom to load several times in smaller chunks and avoid large arrays in registers. But I don't know whether we would actually need that.

Also, several algorithms actually need the entire loaded block per thread in registers, because they later index into that array and don't process item by item.

However, since bulk copies are asynchronous, I need to wait for completion inside Load. But this unnecessary if multiple block loads are needed by an algorithm, where only a single synchronization after issuing all loads is necessary. I still have to figure out a good design for this, since it should not complicate the existing one or add overhead to synchronous load implementations. Maybe something simple as a flag would be a good compromise:

  using block_load_keys_t = cub::BlockLoad<KeyT, ThreadsInBlock, ItemsPerThread, LoadAlgorithm>;
  using block_load_keys_t = cub::BlockLoad<ValT, ThreadsInBlock, ItemsPerThread, LoadAlgorithm>;
  using storage_keys_t    = typename block_load_keys_t::TempStorage;
  using storage_vals_t    = typename block_load_vals_t::TempStorage;

  __shared__ storage_keys_t storage_keys;
  __shared__ storage_vals_t storage_vals;
  block_load_keys_t block_load_keys(storage_keys);
  block_load_vals_t block_load_vals(storage_vals);

  span<KeyT, ItemsPerThread> keys = block_load_keys.Load(key_iterator, cub::no_sync);
  span<ValT, ItemsPerThread> vals = block_load_vals.Load(val_iterator); // sync only issued here

It feels a bit like a hack, but it's a lightweight solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cub For all items related to CUB
Projects
Status: Todo
Development

No branches or pull requests

6 participants