[AMD] Relax the restriction of dot shape >= 16 #3908

giuseros · 2024-05-14T16:57:38Z

This is my first PR in Triton, and it is trying to fix the limitation on dot to support sizes bigger than (M,N,K)==(16,16,16).

I modified semantic.py to relax the tt.dot limitations on the size of the matrices (for gfx9 architectures). Please note that there is a supportMFMA function that only accepts MN sizes multiple of 16 and K sizes multiple of 8.

Based on that, I relaxed the restriction in semantics.py to support (M,N,K)>=(16,16,8). This is the minimal change.

If we want to push further, we would need change supportMFMA and add tests for smaller layouts (many of those smaller layouts are broadcast layouts. Do we support this in the AMD backend?)

Please note: for now, if I try to feed a smaller layout (e.g., 8x8x8) the test fails by mismatches.

giuseros · 2024-05-14T16:58:52Z

cc @zhanglx13 @binarman

python/triton/language/semantic.py

binarman · 2024-05-15T13:01:51Z

@giuseros

many of those smaller layouts are broadcast layouts. Do we support this in the AMD backend

Do you mean slice layout?
If so, answer is yes, we do. There are some issues with WMMA at this point, but I think @joviliast is working on this issue at the moment.

YixinSong-e · 2024-05-15T13:02:25Z

Nice! Do you have plans for support 888?

giuseros · 2024-05-15T13:07:59Z

@giuseros

many of those smaller layouts are broadcast layouts. Do we support this in the AMD backend

Do you mean slice layout? If so, answer is yes, we do. There are some issues with WMMA at this point, but I think @joviliast is working on this issue at the moment.

I am not sure what is a slice layout, but instructions like mfma_4x4x1_16B work on 16 Blocks. You can broadcast rows of A (or columns of B) to make this work as a single GEMM. Is this what @joviliast is working on?

giuseros · 2024-05-15T13:12:02Z

Nice! Do you have plans for support 8_8_8?

Hi @YixinSong-e , I think we should support any size in the front-end and let the backend decide how to lower it down. But we need other people to agree with this :)

joviliast · 2024-05-15T13:18:33Z

@giuseros

many of those smaller layouts are broadcast layouts. Do we support this in the AMD backend

Do you mean slice layout? If so, answer is yes, we do. There are some issues with WMMA at this point, but I think @joviliast is working on this issue at the moment.

I am not sure what is a slice layout, but instructions like mfma_4x4x1_16B work on 16 Blocks. You can broadcast rows of A (or columns of B) to make this work as a single GEMM. Is this what @joviliast is working on?

I believe slices for wmma layouts are completely supported

binarman · 2024-05-15T15:05:42Z

@giuseros

I am not sure what is a slice layout, but instructions like mfma_4x4x1_16B work on 16 Blocks. You can broadcast rows of A (or columns of B) to make this work as a single GEMM. Is this what @joviliast is working on?

Ah, I see, thanks.

FYI: I've experimented with 3 types of layouts that use mfma4x4 in ROCm fork.
They had following tile sizes (A(MxK)*B(KxN)):

4(M) x 4(N) x 64(K)
4(M) x 64(N) x 4(K)
4(M) x 64(N) x 64(K)

So far, the most promising layout is the third one, but it's use is limited, because of large difference in size between first and second operand

giuseros · 2024-05-15T15:17:22Z

@giuseros

I am not sure what is a slice layout, but instructions like mfma_4x4x1_16B work on 16 Blocks. You can broadcast rows of A (or columns of B) to make this work as a single GEMM. Is this what @joviliast is working on?

Ah, I see, thanks.

FYI: I've experimented with 3 types of layouts that use mfma4x4 in ROCm fork. They had following tile sizes (A(MxK)*B(KxN)):

4(M) x 4(N) x 64(K)

4(M) x 64(N) x 4(K)

4(M) x 64(N) x 64(K)

So far, the most promising layout is the third one, but it's use is limited, because of large difference in size between first and second operand

So I guess my point is that there are two natural next steps to this PR:

First, support every size for the non-accelerated layout
Second, introduce broadcasts layouts and supports those at least for the cases when the size is too small to support any other mfma

I think the second point is not super important, because many frameworks simply use reduction mfma. It might be that we also want to skip the first point if there is higher priority work to do

ThomasRaoux · 2024-05-15T16:06:15Z

Can you provide a bit more info on the motivation? It sounds like this breaks portability

python/test/unit/language/test_core.py

giuseros · 2024-05-15T17:01:39Z

Can you provide a bit more info on the motivation? It sounds like this breaks portability

Hi @ThomasRaoux , the point is that the AMD backend can accelerate smaller sizes than 16x16x16, that's why we are trying to add this relaxation in the frontend.

antiagainst · 2024-05-15T23:26:45Z

Can you provide a bit more info on the motivation? It sounds like this breaks portability

Do we require all implementation to support same set of shapes? I think that'd be hard right? Various ways to accelerate different dot variants are very important "innovations" these days. And we have different supporting levels for various element types anyway.

I feel it might make sense to be less restrictive here and let backend to decide how to best lower and/or reject if cannot support?

python/test/unit/language/test_core.py

python/triton/language/semantic.py

scxiao · 2024-05-30T14:35:45Z

python/triton/language/semantic.py

+ else:
+ assert lhs.shape[-2].value >= 16 and lhs.shape[-1].value >= 16 \
+ and rhs.shape[-1].value >= 16, \
+ f"All non-batch values in both first input shape ({lhs.shape}) and second input shape ({rhs.shape}) must be >= 16!"
 if lhs.type.scalar.is_int():


the int type requirement is specific for cuda, so we should add remove it for amd backend.

scxiao · 2024-05-30T14:37:00Z

python/triton/language/semantic.py

@@ -1319,6 +1319,9 @@ def _str_to_dot_input_precision(input_precision, builder):
 def dot(lhs: tl.tensor, rhs: tl.tensor, acc: tl.tensor, input_precision: Optional[str], max_num_imprecise_acc: int,
 out_dtype: tl.dtype, builder: ir.builder) -> tl.tensor:

+ def support_m16n16k8():


For fp8 and int8 on MI300, the mfma instructions are 32X32X16 and 16X16X32 which is not applicable here

giuseros requested a review from ptillet as a code owner May 14, 2024 16:57

jlebar requested a review from antiagainst May 14, 2024 16:58

zhanglx13 marked this pull request as draft May 14, 2024 17:01

binarman reviewed May 15, 2024

View reviewed changes

python/triton/language/semantic.py Outdated Show resolved Hide resolved

joviliast reviewed May 15, 2024

View reviewed changes

python/test/unit/language/test_core.py Outdated Show resolved Hide resolved

giuseros added 4 commits May 16, 2024 15:00

[AMD] Relax the restriction of dot shape >= 16

0a08dc6

Fix formatting

8fd2150

Address review feedbacks

cf66be8

Address the issue that arch might be an integer if the target is CUDA

941243d

giuseros force-pushed the main branch from bc93342 to 941243d Compare May 17, 2024 06:24

zhanglx13 reviewed May 17, 2024

View reviewed changes

python/test/unit/language/test_core.py Outdated Show resolved Hide resolved

zhanglx13 reviewed May 17, 2024

View reviewed changes

python/test/unit/language/test_core.py Outdated Show resolved Hide resolved

Fix interpreter tests

0863417

zhanglx13 reviewed May 23, 2024

View reviewed changes

python/triton/language/semantic.py Outdated Show resolved Hide resolved

Address review feedbacks - 2

ed583bd

zhanglx13 mentioned this pull request May 30, 2024

[AMD backend] remove the block_k limit on hip backend #4021

Closed

scxiao suggested changes May 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Relax the restriction of dot shape >= 16 #3908

[AMD] Relax the restriction of dot shape >= 16 #3908

giuseros commented May 14, 2024

giuseros commented May 14, 2024

binarman commented May 15, 2024

YixinSong-e commented May 15, 2024

giuseros commented May 15, 2024

giuseros commented May 15, 2024

joviliast commented May 15, 2024

binarman commented May 15, 2024

giuseros commented May 15, 2024

ThomasRaoux commented May 15, 2024

giuseros commented May 15, 2024

antiagainst commented May 15, 2024

scxiao May 30, 2024

scxiao May 30, 2024

[AMD] Relax the restriction of dot shape >= 16 #3908

Are you sure you want to change the base?

[AMD] Relax the restriction of dot shape >= 16 #3908

Conversation

giuseros commented May 14, 2024

giuseros commented May 14, 2024

binarman commented May 15, 2024

YixinSong-e commented May 15, 2024

giuseros commented May 15, 2024

giuseros commented May 15, 2024

joviliast commented May 15, 2024

binarman commented May 15, 2024

giuseros commented May 15, 2024

ThomasRaoux commented May 15, 2024

giuseros commented May 15, 2024

antiagainst commented May 15, 2024

scxiao May 30, 2024

Choose a reason for hiding this comment

scxiao May 30, 2024

Choose a reason for hiding this comment