-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BlockLoad/BlockStore to/from shared memory #885
Comments
Thank you for this request! I agree that there is value in this kind of functionality. Block load provides vectorized loads and shared memory padding. While vectorized loads might be used outside of block load facility (
I don't think this alternative API should be exposed as part of the existing
Although the implementation might be straightforward, the API for exposing it is not. We are currently investigating a more generic container type for cub that should address your use case. I'll keep this issue updated with any developments in this area. |
Thanks for the input. I was originally looking into doing a PR myself, but I agree that these API design decisions should be carefully made by the maintainers. Looking forward to 1.17 then! |
Just to chime in -- it won't be in place for 1.17, and it may be a while before the containers George described are ready. But there are some interesting developments planned in this area, so stay tuned 🙂 |
@allisonvacanti Too bad... Maybe I will just still play around with a |
I think I have a better understanding of CUBs shared memory philosophy by now. I would still be interested to hear about that container project, but right now I have a different question that, if I would post it as a new issue, would end up with a permutation of this issue's title, so I decided to ask it here: After creating a histogram in shared memory using Is loading from shared memory in scope for Clearly some combinations of types, algorithms and bank size will cause bank conflicts, so the algorithm has to be chosen with care. |
@pauleonix unless you provide something like |
Newer GPUs offer bulk copy operations from global to shared memory, e.g., LDGSTS on Ampere, and TMA operations on Hopper. See for example the Tensor Memory Accelerator section in this blog post: https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth#asynchronous_execution. While we may employ some of these features in selected places, it is definitely worthwhile to generalize those loading faciliities into new CUB BlockLoad algorithms. So we have to extend our software architecture to support shared memory as a destination. I will look into this at some point. |
I have been playing around with BlockLoad a bit today, trying to fit bulk copies from GM -> SM. Those are asynchronous and the destination is shared memory. So the existing API does not fit nicely, since depending on using block_load_t = cub::BlockLoad<T, ThreadsInBlock, ItemsPerThread, LoadAlgorithm>;
using storage_t = typename block_load_t::TempStorage;
__shared__ storage_t storage; // may store a shared memory array
block_load_t block_load(storage); // may store a register array
span<T, ItemsPerThread> data = block_load.Load(iterator); Depending on Later I wondered, why we would expose an array at all and not just a range from where to pull individual elements. This would give us more implementation freedom to load several times in smaller chunks and avoid large arrays in registers. But I don't know whether we would actually need that. Also, several algorithms actually need the entire loaded block per thread in registers, because they later index into that array and don't process item by item. However, since bulk copies are asynchronous, I need to wait for completion inside using block_load_keys_t = cub::BlockLoad<KeyT, ThreadsInBlock, ItemsPerThread, LoadAlgorithm>;
using block_load_keys_t = cub::BlockLoad<ValT, ThreadsInBlock, ItemsPerThread, LoadAlgorithm>;
using storage_keys_t = typename block_load_keys_t::TempStorage;
using storage_vals_t = typename block_load_vals_t::TempStorage;
__shared__ storage_keys_t storage_keys;
__shared__ storage_vals_t storage_vals;
block_load_keys_t block_load_keys(storage_keys);
block_load_vals_t block_load_vals(storage_vals);
span<KeyT, ItemsPerThread> keys = block_load_keys.Load(key_iterator, cub::no_sync);
span<ValT, ItemsPerThread> vals = block_load_vals.Load(val_iterator); // sync only issued here It feels a bit like a hack, but it's a lightweight solution. |
For some algorithms it makes no sense to have a whole block of data in registers at once. For others a local buffer is bad due to dynamic indexing. For these it would be great to have versions of
cub::BlockLoad
andcub::BlockStore
that skip the step of loading the values from shared memory into registers and just provide them in shared memory s.t. each thread of a warp has all its values on the same bank.One idea would be to create versions of the
Load
/Store
methods that don't take a local array and then have a device function or method to access data inTempStorage
, s.t. the user doesn't have to care about the explicit memory layout and implementation details can be changed without breaking user code.Naturally all this would only work for algorithms that store all data in shared memory at the same time. But as the core implementation is already existing, the implementation should hopefully be straightforward?
The text was updated successfully, but these errors were encountered: