Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the read/preprocess in ABFE workflow #359

Open
xiki-tempula opened this issue May 17, 2024 · 4 comments · May be fixed by #371
Open

Speed up the read/preprocess in ABFE workflow #359

xiki-tempula opened this issue May 17, 2024 · 4 comments · May be fixed by #371

Comments

@xiki-tempula
Copy link
Collaborator

In the current ABFE workflow set up, the file read and preprocess are running on a single thread, which is kind of waste of time when reading a lot of files.
In the ABFE workflow, I think we could speed things up by wrap the read and preprocess in a multiprocess thread.
I'm thinking of adding a new dependency joblib for that. I wonder if I could get some advice if the community are happy with that.

@orbeckst
Copy link
Member

joblib is not a big dependency; however, we can also think about making it an optional dep.

Perhaps a good starting point for discussion is to see how much speed-up it can bring and if it's something people will likely always want to use. Do you have some benchmark comparisons for typical data sets and how it scales?

@xiki-tempula
Copy link
Collaborator Author

It is quite a big speed up. Assume that we have 64 lambda windows where each one has 6251 time points.
On a 64 core instance, the read goes from
7.98 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
to
625 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The preprocess goes from
1min 10s ± 598 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
to
3.12 s ± 86.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@xiki-tempula xiki-tempula linked a pull request May 29, 2024 that will close this issue
@xiki-tempula
Copy link
Collaborator Author

Apparently joblib is a dependency of scikit-learn, which is our dependency. So we would always have joblib in our env.

@orbeckst
Copy link
Member

Just make it explicit anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants