Known issue: setting forkserver mode in multiprocessing module is needed for parallel data load from HDFS #123

belltailjp · 2020-05-07T09:28:58Z

Deep learning frameworks support multi-process data loading, such as num_worker option of DataLoader in PyTorch, MultiprocessIterator in Chainer, etc.
They use multiprocessing module to launch worker processes using fork by default (in Linux).
When using PFIO, in case an HDFS connection is established before the fork, information of the connections are also copied to child processes. They are eventually destroyed when one of the workers has completed its work (this happens at the end of each epoch in PyTorch DataLoader). However remaining worker processes still want to keep in touch with HDFS, but since the connection is unexpectedly and uncontrollably closed, they will break.

As far as I know, the actual error message or phenomenon that users face may be different depending on the situation (such as freezing, some strange error like RuntimeError: threads can only be started once, etc), and this makes the troubleshooting even more difficult.

The workaround for this issue is to set multiprocessing module forkserver mode before having access to HDFS.
Due to a similar reason (prevent MPI context being broken after fork), ChainerCV and Chainer examples apply the same workaround, and it works for PFIO+HDFS case, too.
https://github.com/chainer/chainercv/blob/master/examples/classification/train_imagenet_multi.py#L96-L100
https://github.com/chainer/chainer/blob/df53bff3f36920dfea6b07a5482297d27b31e5b7/examples/chainermn/imagenet/train_imagenet.py#L145-L148

The text was updated successfully, but these errors were encountered:

belltailjp · 2020-05-07T09:35:03Z

related issue: #81

kuenishi · 2021-03-05T09:19:03Z

V2 API introduced a proactive fork detection before entering PyArrow functions by checking process ids, and when fork detected, it raises an exception by default. With vanilla Hdfs() class used, developers are now able to detect fork-after-hdfs-init as a bug, and then fix their code and introduce forkserver. What do you think?

kuenishi · 2021-03-05T09:20:01Z

Example of checking proc id is like this: https://github.com/pfnet/pfio/pull/151/files#diff-4e49c0f20764e59a31322473b893e889d1163bc77c47758e50c11107f878d498R149

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Known issue: setting forkserver mode in multiprocessing module is needed for parallel data load from HDFS #123

Known issue: setting forkserver mode in multiprocessing module is needed for parallel data load from HDFS #123

belltailjp commented May 7, 2020 •

edited

Loading

belltailjp commented May 7, 2020

kuenishi commented Mar 5, 2021

kuenishi commented Mar 5, 2021

Known issue: setting forkserver mode in multiprocessing module is needed for parallel data load from HDFS #123

Known issue: setting forkserver mode in multiprocessing module is needed for parallel data load from HDFS #123

Comments

belltailjp commented May 7, 2020 • edited Loading

belltailjp commented May 7, 2020

kuenishi commented Mar 5, 2021

kuenishi commented Mar 5, 2021

belltailjp commented May 7, 2020 •

edited

Loading