-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using multi-node with LocalExecutor #130
Comments
Hey @LopezGG, thanks for the issue. Can you share a sample command you run with torchrun? We can add extra options based on that to support multi-node via Currently |
Thank you for the quick reply @hemildesai . Usually, with AML I use something like
https://pytorch.org/docs/stable/elastic/run.html $NODE_RANK, $MASTER_ADDR, and $MASTER_PORT are set automatically by AML or Slurm or can be set manually. Change in your script might look like ( I need to test it though) Changing num_nodes will feed into NeMo-Run/src/nemo_run/core/execution/base.py Lines 183 to 188 in b4e2258
and can be called from |
Sounds good, we will make num nodes configurable |
@hemildesai : Please Hold off on this. I had to make some changes in torchrun file as well. I'll write back when I have it working |
I noticed LocalExecutor has a hard-coded value for nnodes.
NeMo-Run/src/nemo_run/core/execution/local.py
Lines 53 to 54 in b4e2258
Is there a reason multi-nodes are disabled ? It feeds into torch_run which seems to support multi-nodes
NeMo-Run/src/nemo_run/run/torchx_backend/components/torchrun.py
Lines 104 to 124 in b4e2258
Asking because I am using this with AML where I can usually get multi-node working with torchrun
The text was updated successfully, but these errors were encountered: