You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Deep learning frameworks support multi-process data loading, such as num_worker option of DataLoader in PyTorch, MultiprocessIterator in Chainer, etc.
They use multiprocessing module to launch worker processes using fork by default (in Linux).
When using PFIO, in case an HDFS connection is established before the fork, information of the connections are also copied to child processes. They are eventually destroyed when one of the workers has completed its work (this happens at the end of each epoch in PyTorch DataLoader). However remaining worker processes still want to keep in touch with HDFS, but since the connection is unexpectedly and uncontrollably closed, they will break.
As far as I know, the actual error message or phenomenon that users face may be different depending on the situation (such as freezing, some strange error like RuntimeError: threads can only be started once, etc), and this makes the troubleshooting even more difficult.
V2 API introduced a proactive fork detection before entering PyArrow functions by checking process ids, and when fork detected, it raises an exception by default. With vanilla Hdfs() class used, developers are now able to detect fork-after-hdfs-init as a bug, and then fix their code and introduce forkserver. What do you think?
Deep learning frameworks support multi-process data loading, such as
num_worker
option ofDataLoader
in PyTorch,MultiprocessIterator
in Chainer, etc.They use multiprocessing module to launch worker processes using
fork
by default (in Linux).When using PFIO, in case an HDFS connection is established before the fork, information of the connections are also copied to child processes. They are eventually destroyed when one of the workers has completed its work (this happens at the end of each epoch in PyTorch
DataLoader
). However remaining worker processes still want to keep in touch with HDFS, but since the connection is unexpectedly and uncontrollably closed, they will break.As far as I know, the actual error message or phenomenon that users face may be different depending on the situation (such as freezing, some strange error like
RuntimeError: threads can only be started once
, etc), and this makes the troubleshooting even more difficult.The workaround for this issue is to set multiprocessing module
forkserver
mode before having access to HDFS.Due to a similar reason (prevent MPI context being broken after fork), ChainerCV and Chainer examples apply the same workaround, and it works for PFIO+HDFS case, too.
https://github.com/chainer/chainercv/blob/master/examples/classification/train_imagenet_multi.py#L96-L100
https://github.com/chainer/chainer/blob/df53bff3f36920dfea6b07a5482297d27b31e5b7/examples/chainermn/imagenet/train_imagenet.py#L145-L148
The text was updated successfully, but these errors were encountered: