Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn ParallelAnalysisBase into dask custom collection #136

Open
wants to merge 36 commits into
base: master
Choose a base branch
from

Conversation

yuxuanzhuang
Copy link
Contributor

@yuxuanzhuang yuxuanzhuang commented Aug 17, 2020

Fixes #135

Note the only file changes from #132 is parallel.py You can read https://github.com/yuxuanzhuang/pmda/pull/1/files to get the actual changes.

Changes made in this Pull Request:

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@pep8speaks
Copy link

pep8speaks commented Aug 17, 2020

Hello @yuxuanzhuang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 305:80: E501 line too long (80 > 79 characters)
Line 306:70: E128 continuation line under-indented for visual indent
Line 306:80: E501 line too long (80 > 79 characters)

Line 16:80: E501 line too long (84 > 79 characters)
Line 16:84: W504 line break after binary operator
Line 58:80: E501 line too long (104 > 79 characters)
Line 69:80: E501 line too long (115 > 79 characters)

Comment last updated at 2020-08-20 10:39:58 UTC

@yuxuanzhuang yuxuanzhuang changed the title Joblist Turn ParallelAnalysisBase into dask custom collection Aug 17, 2020
@orbeckst
Copy link
Member

This PR really depends on PR #132 so we should look at that one first. Then you can rebase this one and it will become much cleaner.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks like a really interesting way to move forward. This, together with the notebook, is a good study for how the next version of PMDA could look like.

I have a bunch of initial questions/comments inline.

Also note that we would first need to merge PR #132 before really moving forward here.

We would also need to remove Python 2 as soon as we become dependent on MDA 2.0.0 (but that's for PR #132).

Tests will obviously be needed...

return self._keys

# it uses multiprocessing scheduler in default
__dask_scheduler__ = staticmethod(dask.multiprocessing.get)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though multiprocessing is the default scheduler, one can still use distributed, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it can either be a global dask config, a context manager, or an arg in self.compute(). (https://docs.dask.org/en/latest/scheduler-overview.html#configuring-the-schedulers)

pmda/parallel.py Outdated
np.array([el[5] for el in res]))

# this is crucial if the analysis does not iterate over
# the whole trajectory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this crucial? What would happen? Add more comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def __dask_postpersist__(self):
# we don't need persist implementation.
raise NotImplementedError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it not be possible to persist?

Presumably, that would have been possible previously if we had chosen persist in run instead of compute().

pmda/parallel.py Outdated
times_io), np.sum(times_compute)

@staticmethod
def _reduce(res, result_single_frame):
""" 'append' action for a time series"""
res.append(result_single_frame)
return res

def __getstate__(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does DaskMixin require the whole class to be picklable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes...I mean in the old implementation, the whole class has to be picklable as well.

FYI, the code here is not needed anymore after MDAnalysis/mdanalysis#2893 is merged

pmda/parallel.py Outdated
@@ -284,6 +281,69 @@ def _single_frame(self, ts, atomgroups):
"""
raise NotImplementedError

def prepare_jobs(self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepare_jobs sounds confusing to me – what "jobs"? If it's part of the documented workflow then it could just be prepare.

prepare_dask would be more explicit but also a bit pointless because PMDA is fully intertwined with dask so that's the only thing we would be preparing for. create_dask_graph is too long and really talks to much about implementation details.

All in all, I'd just call it prepare and add more docs stating clearly what is being prepared and under which circumstances a user needs to run it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Turn ParallelAnalysisBase into dask custom collection
3 participants