Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock when training torch model #1044

Open
egillax opened this issue May 15, 2023 · 8 comments
Open

deadlock when training torch model #1044

egillax opened this issue May 15, 2023 · 8 comments

Comments

@egillax
Copy link
Contributor

egillax commented May 15, 2023

Hi @dfalbel ,

I'm having an issue that the model training just hangs when training models. This doesn't happen always but has now happened twice in the last week. I'm using torch 0.10 on ubuntu 20.

I'm attaching backtraces generated with gdb and thread apply all bt full when attaching to the hanging process.

Any idea what could be going on or how to debug further?

gdb_deadlock1.txt.txt
gdb_deadlock2.txt.txt

@dfalbel
Copy link
Member

dfalbel commented May 15, 2023

HI @egillax ,

Is the process running torch forked at some point?
Usually, forking is not safe when using LibTorch, if doing parallel work it's better to use multi-process parallelization.
If you really need forking, you must make sure that you don't use autograd in the main process (the one that will be forked) before forking, otherwise bad things can happen, incluing deadlocks like this one.

There's some discussion here: #971

@egillax
Copy link
Contributor Author

egillax commented May 15, 2023

Hi @dfalbel,

No there shouldn't be any forking. There should be only one process running the model training. I did notice though that for the first deadlock that happened there was another process from an older rsession occupying gpu memory. But for the later deadlock there was only one process using the gpu, not sure that's relevant.

@dfalbel
Copy link
Member

dfalbel commented May 15, 2023

This is weird! It's interesting to see some JVM symbols in the backtrace. Do you know where they could come from?

Eg:

#7  0x00007fece5a97406 in ?? () from /usr/lib/jvm/java-11-openjdk-amd64/lib/server/libjvm.so

It seems that the deadlock situation is very similar to what we saw in #971: Autograd is running and, it tries to allocate more memory, which is not possible because GPU is at it's max, it then tries to cleanup some memory by calling the GC, which in turn tries to release some memory but gets deadlocked.

@dfalbel
Copy link
Member

dfalbel commented May 15, 2023

Looking at the traceback to understand what's happening:

  1. During autograd, a free memory callback is called and thus the delete tasks event loop start running:
#5  0x00007f36286bd604 in EventLoop<void>::run() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/liblantern.so
No symbol table info available.
#6  0x00007f36286bc565 in wait_for_gc() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/liblantern.so
  1. Calling the R garbage collector is requested, and we see it get's called:
R-4.1/x86_64-pc-linux-gnu/torch/lib/liblantern.so
No symbol table info available.
#11 0x00007fed64976562 in _lantern_Tensor_delete () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/liblantern.so
No symbol table info available.
#12 0x00007fed65763ad9 in lantern_Tensor_delete (x=0x561f72906d00) at ../inst/include/lantern/lantern.h:316
No locals.
#13 delete_tensor (x=0x561f72906d00) at torch_api.cpp:143

However, this call is in the main thread, and thus can't acquire the lock to delete tensors, and should be rescheduled to the autograd thread (the on that is running the delete_tasks envent loop). This reschedule should happne because of

https://github.com/mlverse/torch/blob/8a3b5b3f5da44c3254cef0eb48c948a7298a5a2d/src/lantern/src/Delete.cpp#L15C1-L24

For some reason though it seems that delete_tasks.is_running returns false and the deletion happens in the main thread and deadlocks:

        __PRETTY_FUNCTION__ = "__pthread_mutex_lock"
        id = <optimized out>
#2  0x00007fed63581f54 in c10::cuda::CUDACachingAllocator::raw_delete(void*) () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/libc10_cuda.so
No symbol table info available.
#3  0x00007fed63b37998 in c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >::reset_() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/libc10.so
No symbol table info available.
#4  0x00007fed63b310ef in c10::TensorImpl::~TensorImpl() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/libc10.so
No symbol table info available.
#5  0x00007fed63b311b9 in c10::TensorImpl::~TensorImpl() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/libc10.so
No symbol table info available.
#6  0x00007fed648cdf3c in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/liblantern.so
No symbol table info available.
#7  0x00007fed648c9662 in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::~intrusive_ptr() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/liblantern.so
No symbol table info available.
#8  0x00007fed6486ecea in at::TensorBase::~TensorBase() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/liblantern.so
No symbol table info available.
#9  0x00007fed6486f162 in at::Tensor::~Tensor() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/liblantern.so

@egillax
Copy link
Contributor Author

egillax commented May 15, 2023

I'm not sure where the jvm stuff is coming from. There are packages earlier in my pipeline using java, to connect to a database and fetch the data.

I'm trying now to reproduce the issue in a simpler setting. Originally this happened on a server running my full pipeline. Now I'm trying to run only the affected code on the server with same data and separately on my laptop with fake data.

@egillax
Copy link
Contributor Author

egillax commented May 22, 2023

I just ran into this again now when running the affected code segment manually, now I'm sure there is no forking happening anywhere. I've attached the gdb backtrace in case it helps. Is there anything else I can do to get more info about this?

gdb.txt

@dfalbel
Copy link
Member

dfalbel commented May 22, 2023

@egillax ! Thanks for the backtrace. Is the code running into this problem public? I'll try to reproduce it, with a minimal example, but it would be nice to take a look at the code to see if I could see other clues. So far, it seems that it's caused by a
GC call during a backward() call on GPU. But I think that should definitely happen more often if this was the only cause, as backward allocates a lot of memory and is very likely to call GC.

@egillax
Copy link
Contributor Author

egillax commented May 23, 2023

Yes the code is public. The main training loop is in this class which is instantiated and fit during hyperparameter tuning here. The issue has both happened with a ResNet and a Transformer.

I'm also trying to make a more minimal example I can share, will post it here if I manage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants