Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel read and preprocess the data #371

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

xiki-tempula
Copy link
Collaborator

@xiki-tempula xiki-tempula commented May 29, 2024

Use joblib to parallelise the read and preprocess.

@xiki-tempula xiki-tempula linked an issue May 29, 2024 that may be closed by this pull request
Copy link

codecov bot commented May 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.84%. Comparing base (584588c) to head (54bb316).
Report is 28 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #371   +/-   ##
=======================================
  Coverage   98.83%   98.84%           
=======================================
  Files          28       28           
  Lines        1895     1899    +4     
  Branches      407      408    +1     
=======================================
+ Hits         1873     1877    +4     
  Misses          2        2           
  Partials       20       20           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@xiki-tempula xiki-tempula marked this pull request as ready for review May 29, 2024 20:34
@xiki-tempula xiki-tempula force-pushed the 359-speed-up-the-readpreprocess-in-abfe-workflow branch from 0f9ddae to d5fe5dd Compare June 3, 2024 09:03
Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks like a powerful new feature for the workflow. Do you have some simple performance benchmarks?

My primary concerns are (see comments)

  • Making sure that joblib is explicitly installed.
  • Is the new default n_jobs=-1 safe?
  • add docs

@@ -13,14 +13,17 @@ The rules for this file:
* release numbers follow "Semantic Versioning" https://semver.org

------------------------------------------------------------------------------
??/??/2024 orbeckst
??/??/2024 orbeckst, xiki-tempula
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge & resolve conflicts and set up 2.5.0

@@ -11,3 +11,4 @@ dependencies:
- pyarrow
- matplotlib
- loguru
- joblib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also update any other places where installation is needed (pyproject,toml, CI devtools/conda-envs/test_env.yaml, RTD, later the cf feedstock)

import pytest
from alchemtest.amber import load_bace_example
from alchemtest.gmx import load_ABFE
from joblib import parallel_config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just import joblib so that it's clearer below what comes from joblib? In this way any un-qualified functions and classes are alchemlyb and everything else is external.

suffix="xvg",
T=310,
)
with parallel_config(backend="threading"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find

Suggested change
with parallel_config(backend="threading"):
with joblib.parallel_config(backend="threading"):

clearer when quickly reading the code.

@@ -125,6 +126,8 @@ def read(self, read_u_nk=True, read_dHdl=True):
Whether to read the u_nk.
read_dHdl : bool
Whether to read the dHdl.
n_jobs : int
Number of parallel workers to use for reading the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain what the default -1 does

@@ -115,7 +116,7 @@ def __init__(
else:
raise NotImplementedError(f"{software} parser not found.")

def read(self, read_u_nk=True, read_dHdl=True):
def read(self, read_u_nk=True, read_dHdl=True, n_jobs=-1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is default -1 really always the best choice? Did you try on a machine with, say, 16 cores, and 16 hyperthreaded cores (or really anything with hyperthreads)?

I am willing to make -1 the default if this is not throwing surprises for users. Otherwise the conservative 1 would be better and users can then explicitly enable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add versionchanged to docs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a paragraph about parallelization: how to enable it, what it does (for each file), any potential problems...

@@ -201,6 +219,7 @@ def run(
overlap="O_MBAR.pdf",
breakdown=True,
forwrev=None,
n_jobs=-1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

  • Is -1 safe as new default?
  • add docs (explanation)
  • add versionchanged

@@ -307,7 +329,7 @@ def update_units(self, units=None):
logger.info(f"Set unit to {units}.")
self.units = units or None

def preprocess(self, skiptime=0, uncorr="dE", threshold=50):
def preprocess(self, skiptime=0, uncorr="dE", threshold=50, n_jobs=-1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

  • Is -1 safe as new default?
  • add docs (explanation)
  • add versionchanged

@orbeckst orbeckst self-assigned this Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up the read/preprocess in ABFE workflow
2 participants