-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to lower a set between bDID
and b
.
#3488
Comments
cc @naoyam, @cowanmeg and @samnordmann |
Can you show the actual scheduling results of these tensors? |
See T1_g_float. |
Thanks. Does the second pattern work fine with change of For the first pattern, how is the parallelization interpreted? Since |
That's right. I forgot where this pattern actually happened. Probably in dropout which is replicated until sequence parallel is enabled. |
Yes. I guess this is because the pointwise scheduler picked z the reference TV, which has DID in it. So the schedule it proposes skips DID. |
(I ran into this issue incidentally but haven't tried to reduce the repros or identify the reasons.)
Symptoms
Below are two minimal repros. Both run the following definition but with different parallelizations. The first test shards y but not x or z, and the second test shards x and z but not y.
Both tests fail to execute and throw errors like
Reasons for failure
Currently, isResharding doesn't map
bDID
and b. As a result, after InsertReshardingsPass, a set was added betweeny
andz
of two different shardings. This set was lowered to either an Allgather or a Scatter, both of which failed to execute. The failed Allgather tried to concatenate D input tensors of shape[1]
to an output tensor of shape[1]
. The failed Scatter tried to split an input tensor of[1]
toD
devices.Failed attempts
My first reaction is to let isResharding ignore the DID on broadcast dimensions.
This was able to avoid the
set
and therefore the communication. However, the first test failed with a different error:This is because the pointerwise scheduler
z
as the reference TensorView, which isn't sharded,z
on intra-GPU parallel types,x
in the same way and produced multiple DIDx dimensions in its loop domain.Potential solutions
b(DID)
in favor ofb
.set
betweenb(DID)
andb
, e.g., lower that instead to an alias operation.The text was updated successfully, but these errors were encountered: