Question: ScanTileState reliance on longlong2 vector type for reading / writing coherently. #1069
Unanswered
IlyaGrebnov
asked this question in
CUB
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
ScanTileState (as well as ReduceByKeyScanTileState) currently rely on 128 bit longlong2 vector type for reading / writing descriptor state coherently without using __threadfence or splitting state into aggregate + status word. This is understandable for atomic types, but 128 bit types are not atomic. From "Parallel Thread Execution ISA Version 8.0":
8.2.3. Memory Operations on Vector Data Types The memory consistency model relates operations executed on memory locations with scalar data types, which have a maximum size and alignment of 64 bits. Memory operations with a vector data type are modeled as a set of equivalent memory operations with a scalar data type, executed in an unspecified order on the elements in the vector.
Similarly libcu++ suggests that "atomic_ref does not support types larger than 8 bytes."
So, how this currently works in CUB? Does SASS actually support 128 bit atomics, but it is not guaranteed by PTX and could break in future hardware?
Beta Was this translation helpful? Give feedback.
All reactions