Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow allocation to be a split and different from loop. #3479

Open
wujingyue opened this issue Nov 26, 2024 · 9 comments · May be fixed by #3621
Open

Allow allocation to be a split and different from loop. #3479

wujingyue opened this issue Nov 26, 2024 · 9 comments · May be fixed by #3621
Assignees

Comments

@wujingyue
Copy link
Collaborator

wujingyue commented Nov 26, 2024

This is a spin-off from #3458 (comment).

For a multi-GPU fusion to take a sharded input tensor, the allocation domain has to be a split of logical. For example,

logical: iM, iN
allocation: iDIDx{D}, iM/D, iN

The loop domain, ideally, shouldn't be set or used because a fusion/segment input comes from outside and is not generated by a loop. Below is an ideal usage, which I also committed to https://github.com/NVIDIA/Fuser/tree/bug3479.

TEST_F(AllocationDomainTest, InputAllocationIsSplit_Concrete) {
  auto fusion = std::make_unique<Fusion>();
  FusionGuard fg(fusion.get());

  TensorView* in = makeContigConcreteTensor({6});
  TensorView* out = set(in);
  fusion->addInput(in);
  fusion->addOutput(out);

  auto [outer, inner] = IterDomain::split(
      in->axis(0), IrBuilder::create<Val>(2, DataType::Index), true);
  in->setAllocationDomain({outer, inner}, true);

  FusionExecutorCache executor_cache(std::move(fusion));
  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA);
  at::Tensor in_tensor = at::randn({6}, options);
  auto out_tensors = executor_cache.runFusionWithInputs({in_tensor});

  testValidate(
      executor_cache.fusion(), out_tensors, {in_tensor}, __LINE__, __FILE__);
}

This repro currently fails with the following error:

C++ exception with description " INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/ir/utils.cpp":965, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. dom0 has unreachable IDs. dom0: iS7{6}. dom1:

Code around

auto replay_CasP = BestEffortReplay(
new_IDs,
producer->getLoopDomain(),
logical_map.mapProducerToConsumer(producer->domain(), replayed));
tries to propagate allocation from the producer to the consumer. However, the BestEffortReplay created fails to map the producer's allocation domain. As a result,
if (auto it = p2c_map.find(id); it != p2c_map.end()) {
fails to add anything to the consumer's allocation domain, leading to an empty allocation domain that doesn't post-dominate logical.

While I've worked around this problem by setting loop to be the same as allocation, @naoyam and I discussed some potential solutions in the original thread. There's a line of thoughts on improving replayCasP, and there's a line of thoughts on propagating allocation domain within a kernel through a different mechanism as this is currently only needed for matmul. cc @zasdfgbnm

wujingyue added a commit that referenced this issue Nov 26, 2024
@wujingyue wujingyue changed the title Allocation can't be a split unless it's same as loop. Allow allocation to be a split and different from loop. Nov 27, 2024
@naoyam
Copy link
Collaborator

naoyam commented Dec 6, 2024

Let me see if I understand the issue correctly.

  • replayCasP has an option to propagate the allocation domain but assumes all IDs of the allocation domain are found between the root to loop domains. If not, the propagation of the allocation domain fails.
  • We could extend replayCasP to properly support the case like the above.
  • We could also just disable the propagation, however, the matmul scheduler relies on it.

It seems to me that we should change the matmul scheduler rather than replayCasP. I'd generally believe each scheduling primitive should do one task only. If we need to schedule the allocation domain, it should probably be done separately from the propagation of the loop domain.

Since this is only a problem with the matmul scheduler, I feel that we should extend the matmul scheduler to do scheduling of the allocation domain more explicitly. What do you think? @jacobhinkle @rdspring1 @protonu @zasdfgbnm

@jacobhinkle
Copy link
Collaborator

My .02: when I encountered this in the matmul scheduler I was surprised it propagated allocation domain. I think we should change that behavior since explicit is better than implicit.

@zasdfgbnm
Copy link
Collaborator

Isn't this guarded by a flag?

if (!opt.replay_allocation) {

Would a quick unblock be to just set this flag to false wherever it is used? I believe it was used in cacheBefore/cacheAfter.

But on the other hand, I do think replaying allocation domain (or at least stride order) is what we want. For example, if you have a fusion input NCHW stride order NHWC, and when you cacheAfter this fusion input, you do want to have NHWC as allocation domain in the cache because otherwise you will not be able to vectorize. I think it is not matmul vs other schedulers. It is only used by matmul because matmul is the only scheduler that is stride-order-aware today. Other schedulers will just schedule a bad kernel with no vectorization, which is what we should change in the future.

@naoyam
Copy link
Collaborator

naoyam commented Dec 6, 2024

@zasdfgbnm For cacheAfter, I understand that we need to have a proper allocation order, but more strictly speaking, the allocation domain propagated from the input isn't actually the real allocation domain, right? The cache tensor should have some splits for parallelization that don't exist for the input tensor. So, the automatic propagation done by replayCasP isn't actually enough.

Yes, setting false to the flag of replayCasP would unblock the issue for now, but I think the allocation domain really needs actual scheduling rather than just done automatically behind the loop scheduling.

@wujingyue
Copy link
Collaborator Author

Yes, setting false to the flag of replayCasP would unblock the issue for now

We have to do that conditionally -- propagate only for the matmul scheduler

@zasdfgbnm
Copy link
Collaborator

zasdfgbnm commented Dec 7, 2024

but more strictly speaking, the allocation domain propagated from the input isn't actually the real allocation domain, right? The cache tensor should have some splits for parallelization that don't exist for the input tensor. So, the automatic propagation done by replayCasP isn't actually enough.

It depend on how we define what is a correct allocation domain, which we do not have an idea yet. For example, if a register tensor is:

Logical domain: [I0, I1]
Allocation domain: [I0, I1]
Loop domain: [BIDx{I0*I1/1024}, TIDx{256}, Vectorize{4}]

Then do we consider the allocation of this tensor correct? The answer could be:

  1. Yes, although I0 and I1 is partially allocated, during lowering, we have the capability of inferring its actual extent, so it is totally valid.
  2. No, the allocation domain should always faithfully represent how a tensor is allocated. Because we are not allocating like I0, I1, but like 4, this allocation is wrong.

1 is what we currently have in the legacy indexer. If in the new indexer, we still decide to keep 1, then I believe we should also allow having an allocation of [I1, I0]. If this is the design decision, then the allocation domain propagated from input should pretty much sufficient for scheduling.

If our design decision is 2, then of course, we should schedule allocation domain separately, instead of propagating it along with the loop domain.

I personally don't have preference over these two options, because to me both have pros and cons. I like 1 because it makes writing a scheduler easier, because in 2, the scheduler needs to book keep the allocation domain. I dislike 1 because, what is the point of having an allocation domain, if we are not faithfully respecting it? I like 2 because it makes the concept "allocation domain" mentally easier to think about, and I dislike it because it is more difficult to write a scheduler. For example, if all I want is to say "[I1, I0] is the correct stride order", I have to do all the heavy work of analyzing the loop domain and think about what is the correct allocation domain, instead of just naively set it as [I1, I0] as in 1.

@naoyam
Copy link
Collaborator

naoyam commented Dec 10, 2024

I see your point. From the scheduler's point of view, what matters is usually just the ordering, so it would make sense to #1 would be good enough for the scheduler. I think we would need to make sure that the lowering can infer the true allocation domain without any ambiguity, but I think that'd be the case.

In any case, I still prefer this approach:

Would a quick unblock be to just set this flag to false wherever it is used? I believe it was used in cacheBefore/cacheAfter.

And make the required ordering of the allocation domain for matmul done explicitly.

If we all agree on this, can someone volunteer to change the matmul scheduler?

@wujingyue
Copy link
Collaborator Author

If we all agree on this, can someone volunteer to change the matmul scheduler?

Add a knob like replay_allocation to cacheInputs and set that to true only in the matmul scheduler? If yes, I'm happy to take that.

@naoyam
Copy link
Collaborator

naoyam commented Dec 10, 2024

@wujingyue Yes, that should be good enough, although I still think we should do something with the matmul scheduler since even @jacobhinkle was surprised with the automatic reordering.

@wujingyue wujingyue assigned wujingyue and unassigned naoyam Dec 10, 2024
wujingyue added a commit that referenced this issue Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants