Shouldn't wrapLDSBufferForStore be global vectorDim agnostic ? #1158

manupak · 2023-07-17T15:56:23Z

manupak
Jul 17, 2023
Collaborator

Hi All,

Code in question

Im trying to understand why wrapLDSBufferForStore should need a "vectorDim" as an argument ?

https://github.com/ROCmSoftwarePlatform/rocMLIR/blob/879b202465bf687f7f55777790654a8551f74c9f/mlir/lib/Dialect/Rock/Transforms/GridwiseGemmToBlockwise.cpp#L248-L259

It seems like it is used to change thread write access layout to the LDS buffer here (under wrapLDSBufferForStore) :

https://github.com/ROCmSoftwarePlatform/rocMLIR/blob/879b202465bf687f7f55777790654a8551f74c9f/mlir/lib/Dialect/Rock/Transforms/GridwiseGemmToBlockwise.cpp#L313-L317

IIUC, that is there to just match the logic here (under wrapMatrixForGlobalLoad) :

https://github.com/ROCmSoftwarePlatform/rocMLIR/blob/879b202465bf687f7f55777790654a8551f74c9f/mlir/lib/Dialect/Rock/Transforms/GridwiseGemmToBlockwise.cpp#L221-L231

So my question is why do we change thread layout (i.e. how tid maps to indices) based on the vectorization dimension ?

Proposed alternative

My alternative/challenger for this design is as follows :

We make thread layout agnostic to vectorization dimension of register buffer.
i.e.

splitId.merge({"k_thread", dThreadName}, {4, 5}, "tid", {kThreads, dThreads});
 if (vectorDim == GemmDimension::K) {
    splitId.merge({dIterName, "k_iter"}, {6, 7}, "iter",
                  {dPerThread, kPerThread});
  } else {
    splitId.merge({"k_iter", dIterName}, {6, 7}, "iter",
                  {kPerThread, dPerThread});
  }

Why ?

Each thread is going to access at most a kpack (strictly speaking kbase but lets mix that conversation into here) amount of data.
Then all thread in the block is going to need to access dThread data (at least in the LDS)
- which is also why we organize LDS as <kOuter * d * kpack * sizeof(T) x i8, isn't it ?

This would allow wrapLDSBufferForStore to always use

tidIter.merge("tid", 0, {"k_thread", dThreadName, kpack_thread});

resulting in making wrapLDSBufferForStore not needing a global vectorization dim as an argument.

I'd appreciate any inputs, especially explaining why we shouldn't do this (if any).

Answered by giuseros

Jul 17, 2023

So my question is why do we change thread layout (i.e. how tid maps to indices) based on the vectorization dimension ?

I know (I think) how to answer that one. It's based on coalescence. If your physical layout is (say) MxK, then K is the contiguous dimension and you want thread 0 on element (0,0) thread 1 on element (0,1). If you see here: https://github.com/ROCmSoftwarePlatform/rocMLIR/pull/996/files we were originally always distributing the thread ids on the K dimension. But if the matrix was KxM (or KxN) this would create non-coalesced access, i.e., the thread would access the matrix in a strided fashion. So I think that doing :

splitId.merge({"k_thread", dThreadName}, {4, 5}, "tid…

View full answer

manupak · 2023-07-17T15:56:48Z

manupak
Jul 17, 2023
Collaborator Author

cc : @giuseros @krzysz00 @sjw36

0 replies

giuseros · 2023-07-17T16:11:52Z

giuseros
Jul 17, 2023
Collaborator

So my question is why do we change thread layout (i.e. how tid maps to indices) based on the vectorization dimension ?

I know (I think) how to answer that one. It's based on coalescence. If your physical layout is (say) MxK, then K is the contiguous dimension and you want thread 0 on element (0,0) thread 1 on element (0,1). If you see here: https://github.com/ROCmSoftwarePlatform/rocMLIR/pull/996/files we were originally always distributing the thread ids on the K dimension. But if the matrix was KxM (or KxN) this would create non-coalesced access, i.e., the thread would access the matrix in a strided fashion. So I think that doing :

splitId.merge({"k_thread", dThreadName}, {4, 5}, "tid", {kThreads, dThreads});

Would bring us back to the way it was before the PR I mentioned. I think the point is that I used vectorization dimension as a measure of contiguity. I.e., if vector dim is K, that means that strideK==1 which is the dimension where should distribute our threads. Please note that "thread" here means workItem.

5 replies

manupak Jul 17, 2023
Collaborator Author

I see.. you want to fill the Kouter and Kpack dims simultaneously if the K is fastest changing dimension in the global memory.

manupak Jul 17, 2023
Collaborator Author

and LDS storing needs to know that...

giuseros Jul 17, 2023
Collaborator

I think so, because you need to know "where" to position in LDS as well

manupak Jul 17, 2023
Collaborator Author

So its not a LDS related concern but more of global memory coalescing concern.

giuseros Jul 17, 2023
Collaborator

Yep, I think so

krzysz00 · 2023-07-17T16:36:19Z

krzysz00
Jul 17, 2023
Maintainer

I'll note that, for your subsequent GEMM, that vectorization dimension will be forced by the MFMA layout

2 replies

manupak Jul 17, 2023
Collaborator Author

Yes -- I think what is missing for WrapLDSBufferForStore is the block-level view of data in the regs;
WrapLDSBufferForStore dont really have a choice in how tid should be mapped... (it has already been decided by wrapMatrixForGlobalLoad)

It right now has an implicit coupling on how it was read-into and vehicle for that information is "vectorization dimension" -- which seems not generic enough.

Therefore, Im thinking of making the function accept a blockwise view of the regs. i.e. (tid. iter) --> [lower dims of the tile] being loaded.
This should capture MFMA layouts. Then packtoStore & wrapLDSforStore functions are only allowed to change how iter gets mapped.

krzysz00 Jul 17, 2023
Maintainer

Yeah, I could see that working

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shouldn't wrapLDSBufferForStore be global vectorDim agnostic ? #1158

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Shouldn't wrapLDSBufferForStore be global vectorDim agnostic ? #1158

manupak Jul 17, 2023 Collaborator

Code in question

Proposed alternative

Replies: 3 comments · 7 replies

manupak Jul 17, 2023 Collaborator Author

giuseros Jul 17, 2023 Collaborator

manupak Jul 17, 2023 Collaborator Author

manupak Jul 17, 2023 Collaborator Author

giuseros Jul 17, 2023 Collaborator

manupak Jul 17, 2023 Collaborator Author

giuseros Jul 17, 2023 Collaborator

krzysz00 Jul 17, 2023 Maintainer

manupak Jul 17, 2023 Collaborator Author

krzysz00 Jul 17, 2023 Maintainer

manupak
Jul 17, 2023
Collaborator

Replies: 3 comments 7 replies

manupak
Jul 17, 2023
Collaborator Author

giuseros
Jul 17, 2023
Collaborator

manupak Jul 17, 2023
Collaborator Author

manupak Jul 17, 2023
Collaborator Author

giuseros Jul 17, 2023
Collaborator

manupak Jul 17, 2023
Collaborator Author

giuseros Jul 17, 2023
Collaborator

krzysz00
Jul 17, 2023
Maintainer

manupak Jul 17, 2023
Collaborator Author

krzysz00 Jul 17, 2023
Maintainer