Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make it work with torch DDP #813

Open
fecet opened this issue Jun 22, 2023 · 4 comments
Open

How to make it work with torch DDP #813

fecet opened this issue Jun 22, 2023 · 4 comments

Comments

@fecet
Copy link

fecet commented Jun 22, 2023

Debugging code for parallel training is very painful, and it would be very appealing to be able to do it in a notebook.

The most relevant thing I found is https://github.com/philtrade/Ddip and https://github.com/Bluefog-Lib/bluefog, but it seems that they are no longer being maintained.

@mosesnyamai
Copy link

mosesnyamai commented Jun 23, 2023 via email

@minrk
Copy link
Member

minrk commented Jun 23, 2023

Is there a task here? I'm not sure why this has been opened as an issue on this repo.

You can use IPython Parallel for a certain kind of debugging in parallel (it's not a debugger and certainly not a parallel debugger, which is very challenging), but it can be used to get an interactive interpreter in each of your worker processes for poking around.

@fecet
Copy link
Author

fecet commented Jun 23, 2023

Having an interactive interpreter during parallel training is already quite appealing, especially now that deep learning increasingly relies on a significant amount of resources. This could be a scenario where ipyparallel can shine.

This issue primarily serves as a feature request or question because initiating torch ddp requires some additional setup, which remains challenging for regular users like myself.

@minrk
Copy link
Member

minrk commented Jun 23, 2023

Got it. I don't think there's any feature request here, but perhaps an example or some docs. I don't know much about torch or what ddp is, but if you can provide some examples of how workers are started with it, I may be able to give a hint to how to get them into IPython Parallel. Usually the easiest way is to start IPython Parallel and then run whatever startup code the workers use in the IPython session.

The second way is to use some code injection to launch an IPython interpreter in each worker after they are launched in whatever way your tool usually does. That's all the information I can give without knowing anything about the situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants