-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel read and preprocess the data #371
base: master
Are you sure you want to change the base?
Parallel read and preprocess the data #371
Conversation
…tps://github.com/alchemistry/alchemlyb into 359-speed-up-the-readpreprocess-in-abfe-workflow
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #371 +/- ##
=======================================
Coverage 98.83% 98.84%
=======================================
Files 28 28
Lines 1895 1899 +4
Branches 407 408 +1
=======================================
+ Hits 1873 1877 +4
Misses 2 2
Partials 20 20 ☔ View full report in Codecov by Sentry. |
0f9ddae
to
d5fe5dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks like a powerful new feature for the workflow. Do you have some simple performance benchmarks?
My primary concerns are (see comments)
- Making sure that joblib is explicitly installed.
- Is the new default
n_jobs=-1
safe? - add docs
@@ -13,14 +13,17 @@ The rules for this file: | |||
* release numbers follow "Semantic Versioning" https://semver.org | |||
|
|||
------------------------------------------------------------------------------ | |||
??/??/2024 orbeckst | |||
??/??/2024 orbeckst, xiki-tempula |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge & resolve conflicts and set up 2.5.0
@@ -11,3 +11,4 @@ dependencies: | |||
- pyarrow | |||
- matplotlib | |||
- loguru | |||
- joblib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also update any other places where installation is needed (pyproject,toml, CI devtools/conda-envs/test_env.yaml, RTD, later the cf feedstock)
import pytest | ||
from alchemtest.amber import load_bace_example | ||
from alchemtest.gmx import load_ABFE | ||
from joblib import parallel_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just import joblib
so that it's clearer below what comes from joblib? In this way any un-qualified functions and classes are alchemlyb and everything else is external.
suffix="xvg", | ||
T=310, | ||
) | ||
with parallel_config(backend="threading"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find
with parallel_config(backend="threading"): | |
with joblib.parallel_config(backend="threading"): |
clearer when quickly reading the code.
@@ -125,6 +126,8 @@ def read(self, read_u_nk=True, read_dHdl=True): | |||
Whether to read the u_nk. | |||
read_dHdl : bool | |||
Whether to read the dHdl. | |||
n_jobs : int | |||
Number of parallel workers to use for reading the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain what the default -1 does
@@ -115,7 +116,7 @@ def __init__( | |||
else: | |||
raise NotImplementedError(f"{software} parser not found.") | |||
|
|||
def read(self, read_u_nk=True, read_dHdl=True): | |||
def read(self, read_u_nk=True, read_dHdl=True, n_jobs=-1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is default -1 really always the best choice? Did you try on a machine with, say, 16 cores, and 16 hyperthreaded cores (or really anything with hyperthreads)?
I am willing to make -1 the default if this is not throwing surprises for users. Otherwise the conservative 1 would be better and users can then explicitly enable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add versionchanged to docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a paragraph about parallelization: how to enable it, what it does (for each file), any potential problems...
@@ -201,6 +219,7 @@ def run( | |||
overlap="O_MBAR.pdf", | |||
breakdown=True, | |||
forwrev=None, | |||
n_jobs=-1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
- Is -1 safe as new default?
- add docs (explanation)
- add versionchanged
@@ -307,7 +329,7 @@ def update_units(self, units=None): | |||
logger.info(f"Set unit to {units}.") | |||
self.units = units or None | |||
|
|||
def preprocess(self, skiptime=0, uncorr="dE", threshold=50): | |||
def preprocess(self, skiptime=0, uncorr="dE", threshold=50, n_jobs=-1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
- Is -1 safe as new default?
- add docs (explanation)
- add versionchanged
Use joblib to parallelise the read and preprocess.