-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Support New Arguments for Expert Routing Policies. #17
Comments
Hi @jacklanda, |
Exactly, the code has done with the expert selection, but it seems to force every experts from the In conclusion, forwarding every experts means it is a dense activation in fact. |
Only the top K experts will undergo a forward pass, and this top K can vary for different MoE blocks. This iteration over Here, we create indexes used for the expert forward pass. To summarize, we iterate and select an expert E, then select the token indexes that need to undergo the forward pass of expert E, and perform the forward pass.
|
Thanks for your reply! Will this call cause extra useless computation? I believe only the selected k experts should call the corresponding FFN module to compute the returned tensor of the input token. For comparable implementation, the mixtral modeling does the same thing as A and B. I think it is just a tiny bug on development, not an error on design :) |
If by "useless computation" you mean extra forward passes, then no, the forward pass will only be done for the indexes that require it. Line 78 is responsible for selecting the batch IDs and token IDs that need the forward pass of a specific expert, so the tensor If by "useless computation" you are referring to preparing the expert mask as shown here, it could be implemented. However, I believe that indexing is not an expensive operation. |
Note that the Let's break it down:
|
Understood. In the beginning, I am concerned that some tokens may not need any computation by experts. However, in this case, the input tensor should be empty and it does not cause any useless computation. Thanks for all your help. |
To respond to the title of this issue, I think it is also helpful to allow the users to select routing policies dynamically as they want. On one hand, for the Top-k scenario, the users can pass an argument like Users could not pass the |
On the other hand, as I know, there exist many useful routing policies such as ``sequence-level routing''. Hence, it will be great to support additional policies like that. |
Hi there, thanks mergoo, an amazing code base for MoE model construction.
A crucial feature that may need to be implemented is that mergoo should let the user select the basic routing policy when constructing the MoE layer.
Specifically, I think the
forward
method shown here should be concerned with refactoring to adapt the policy selection (an argument passed by the user). As far as I know, the current code will construct a fully-activated MoE model, not a real sparse MoE model.I am delighted to share my code for this feature and file a PR for it 🤗.
Would you have any thoughts to share about it?
The text was updated successfully, but these errors were encountered: