Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about expected speedup when using parallelization via joblib #169

Open
nguyentr17 opened this issue Dec 4, 2024 · 3 comments
Open

Comments

@nguyentr17
Copy link

nguyentr17 commented Dec 4, 2024

Hi,

I came across your repository while searching for ways to train multiple NN models simultaneously using 1 single GPU. My model is pretty small (just 1 layer MLP) and the VRAM used more each model is only 260mb. However, when I try to use joblib train multiple models at the same time, though they do start at the same time (according to the log), the total training time is still the same as training models sequentially. Do you happen to have any tips / quick insights / things to look at for this? I know this is not directly an issue with your package but would really appreciate any help.

My code is like this:

    with parallel_backend('loky', n_jobs=-1):
        parallel = Parallel(n_jobs=-1)
        parallel(
            delayed(process_latent_pair)(mi_estimator, iid, tid, cfg, exp_name, args, DEVICE) # process_latent_pair trains 1 NN model
            for iid in range(13)
            for tid in range(13)
        )

My environment:

python 3.9.19
torch==2.4.1
joblib==1.4.2
@xuyxu
Copy link
Member

xuyxu commented Dec 5, 2024

Which model are you training? The parallel part is already implemented in torchensemble, you only need to pass the n_jobs param

@nguyentr17
Copy link
Author

Hi @xuyxu I don't use torchensemble but using joblib directly. I didn't get any luck debugging this and thought you probably have lots of experience with this so would like to ask for advice on what might have caused this lack of speedup.

@xuyxu
Copy link
Member

xuyxu commented Dec 6, 2024

You can check the use of Joblib in torchensemble here, which is different from you in a slight way, may this help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants