You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TL;DR: How to share information between processes before lightning class init_process_group?
My train scripts goes like this:
load config
create train folder (with date & time)
create data module
create model
create logger & callbacks (uses train folder)
train
test (uses train folder)
This works fine with 1 GPU, but with multiple GPUs (DDP strategy) the train folder causes problems. The model checkpoint, the logger, and even the test code uses this train folder, but with DDP every process creates its own folder, and I need one common folder. I tried the broadcasting and barrier methods I found here, but I need to share this information with the other processes before the Trainer class is initialized.
Is there a way to share information before the Trainer init, or am I going about this the wrong way?
(Currently stuck on version 1.6.5, but I am willing to upgrade to solve this problem,)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello!
TL;DR: How to share information between processes before lightning class init_process_group?
My train scripts goes like this:
This works fine with 1 GPU, but with multiple GPUs (DDP strategy) the train folder causes problems. The model checkpoint, the logger, and even the test code uses this train folder, but with DDP every process creates its own folder, and I need one common folder. I tried the broadcasting and barrier methods I found here, but I need to share this information with the other processes before the
Trainer
class is initialized.Is there a way to share information before the
Trainer
init, or am I going about this the wrong way?(Currently stuck on version 1.6.5, but I am willing to upgrade to solve this problem,)
Beta Was this translation helpful? Give feedback.
All reactions