-
Hi All, Code in questionIm trying to understand why wrapLDSBufferForStore should need a "vectorDim" as an argument ? It seems like it is used to change thread write access layout to the LDS buffer here (under wrapLDSBufferForStore) : IIUC, that is there to just match the logic here (under wrapMatrixForGlobalLoad) : So my question is why do we change thread layout (i.e. how tid maps to indices) based on the vectorization dimension ? Proposed alternativeMy alternative/challenger for this design is as follows : We make thread layout agnostic to vectorization dimension of register buffer. splitId.merge({"k_thread", dThreadName}, {4, 5}, "tid", {kThreads, dThreads});
if (vectorDim == GemmDimension::K) {
splitId.merge({dIterName, "k_iter"}, {6, 7}, "iter",
{dPerThread, kPerThread});
} else {
splitId.merge({"k_iter", dIterName}, {6, 7}, "iter",
{kPerThread, dPerThread});
} Why ?
This would allow wrapLDSBufferForStore to always use tidIter.merge("tid", 0, {"k_thread", dThreadName, kpack_thread}); resulting in making wrapLDSBufferForStore not needing a global vectorization dim as an argument. I'd appreciate any inputs, especially explaining why we shouldn't do this (if any). |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 7 replies
-
I know (I think) how to answer that one. It's based on coalescence. If your physical layout is (say) MxK, then K is the contiguous dimension and you want thread 0 on element (0,0) thread 1 on element (0,1). If you see here: https://github.com/ROCmSoftwarePlatform/rocMLIR/pull/996/files we were originally always distributing the thread ids on the K dimension. But if the matrix was KxM (or KxN) this would create non-coalesced access, i.e., the thread would access the matrix in a strided fashion. So I think that doing :
Would bring us back to the way it was before the PR I mentioned. I think the point is that I used vectorization dimension as a measure of contiguity. I.e., if vector dim is K, that means that strideK==1 which is the dimension where should distribute our threads. Please note that "thread" here means workItem. |
Beta Was this translation helpful? Give feedback.
-
I'll note that, for your subsequent GEMM, that vectorization dimension will be forced by the MFMA layout |
Beta Was this translation helpful? Give feedback.
I know (I think) how to answer that one. It's based on coalescence. If your physical layout is (say) MxK, then K is the contiguous dimension and you want thread 0 on element (0,0) thread 1 on element (0,1). If you see here: https://github.com/ROCmSoftwarePlatform/rocMLIR/pull/996/files we were originally always distributing the thread ids on the K dimension. But if the matrix was KxM (or KxN) this would create non-coalesced access, i.e., the thread would access the matrix in a strided fashion. So I think that doing :