Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use multiprocess if available #368

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tbekolay
Copy link
Contributor

@tbekolay tbekolay commented Aug 4, 2020

Windows doesn't have a true fork implementation, so multiprocessing and pickling are limited compared to other platforms. There are also some situations in Linux / Mac OS X where we would like to pickle something, but it does not work without some code changes. This PR makes pickling for mulitprocessing work in more situations by using the multiprocess library, if it's available.multiprocess is a fork of multiprocessing that uses dill instead of pickle.

This PR is set up such that we do not require the user to have multiprocess installed, but if it is installed we'll use it -- except in Windows, where pickling is so limited that it will make things easier to require it. There are a few other routes we could go:

  1. Don't require it in Windows: This makes things simpler for Windows users who don't want to use multiprocessing. Installing the multiprocess package should work on all platforms, but if there's a platform where there are likely to be issues, it's Windows.

  2. Require it for all platforms: This will likely be a net benefit to everyone, but it introduces a new dependency. Also, dill will always be a little bit slower than pickle since it's able to pickle more things. However, if we're not pickling anything too large anyway, performance shouldn't be an issue, and it would be possible to simplify some parts of the codebase by using dill instead of cloudpickle and removing some of the hacks that we currently use to get Windows to pickle things that it normally can't.

I'm happy to modify this PR to implement either option if you want! I've been using doit for a relatively large data science project and it's been a huge help.

@schettino72
Copy link
Member

Thanks.

  • On runner.py MRunner there is a check if multiprocessing is available. That should be taken into account...

  • In that spirit I guess it would be better to NOT make multiprocess required on Windows.

  • Also it would be nice to consolidate try/catch of the imports. Maybe on "compat.py" file.

  • Thanks for taking care of CHANGES file. Please also mention this in the "install.rst" file.

  • Finally, need to think how to deal this on CI.

@schettino72
Copy link
Member

Require it for all platforms: This will likely be a net benefit to everyone

could you please elaborate

@Kwpolska
Copy link
Contributor

Kwpolska commented Aug 5, 2020

It will also be useful on macOS, where spawn became the default in 3.8, and fork doesn’t always work properly.

@schettino72
Copy link
Member

schettino72 commented Sep 5, 2020

uqfoundation/multiprocess#65

It seems multiprocess package forced a "regression" because one of the features added (Pool) does not work well with "spawn" method. This leads me to think that multiprocess could be accepted as a multiprocessing replacement but should not be installed by default even for Windows/MAC.

@tbekolay
Copy link
Contributor Author

Hi @schettino72, since I was touching the doit code for #377 I updated this one also to take your feedback into account.

On runner.py MRunner there is a check if multiprocessing is available. That should be taken into account...

Done, I implemented the check in compat.py and call it from MRunner now.

In that spirit I guess it would be better to NOT make multiprocess required on Windows.

Yes, that makes sense. I instead made it an optional dependency that you can install with pip install doit[multiprocess].

Also it would be nice to consolidate try/catch of the imports. Maybe on "compat.py" file.

Done, the imports are done in compat.py now.

Thanks for taking care of CHANGES file. Please also mention this in the "install.rst" file.

Done!

Finally, need to think how to deal this on CI.

I didn't add anything to CI, but you could either modify one of the existing builds in your build matrix to install with multiprocess, by changing

      - run: pip install . -r dev_requirements.txt

to

      - run: pip install .[multiprocess] -r dev_requirements.txt

for that build (might need quotes, like pip install '.[multiprocess]', not sure what kind of shell Github actions uses). Or you could add a new build to the matrix that does the above with the latest version of Python. Those are my ideas anyhow; running the tests locally with multiprocess works correctly.

If you like, I can try to add a test that would fail with multiprocessing but succeed with multiprocess.

from multiprocess import Process, Queue as MQueue
HAS_MULTIPROCESS = True
except ImportError:
from multiprocessing import Process, Queue as MQueue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this will break for BSD users without any of them installed.

except ImportError:
from multiprocessing import Process, Queue as MQueue
HAS_MULTIPROCESS = False
Process # pyflakes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this should be top level.

Copy link
Member

@schettino72 schettino72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I would suggest that instead of having two copies of the import try/except we have only one version. get_multiprocess_lib().

If None that just means "not available".

And it should take an optional parameter start_method, if set you call:

multiprocess.set_start_method('spawn')

If the problem could be solved just by changing the "start_method" we could add a configuration parameter for that (not required on this PR).

@schettino72
Copy link
Member

If you like, I can try to add a test that would fail with multiprocessing but succeed with multiprocess.

That would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants