-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deadlock when training torch model #1044
Comments
HI @egillax , Is the process running torch forked at some point? There's some discussion here: #971 |
Hi @dfalbel, No there shouldn't be any forking. There should be only one process running the model training. I did notice though that for the first deadlock that happened there was another process from an older rsession occupying gpu memory. But for the later deadlock there was only one process using the gpu, not sure that's relevant. |
This is weird! It's interesting to see some JVM symbols in the backtrace. Do you know where they could come from? Eg:
It seems that the deadlock situation is very similar to what we saw in #971: Autograd is running and, it tries to allocate more memory, which is not possible because GPU is at it's max, it then tries to cleanup some memory by calling the GC, which in turn tries to release some memory but gets deadlocked. |
Looking at the traceback to understand what's happening:
However, this call is in the main thread, and thus can't acquire the lock to delete tensors, and should be rescheduled to the autograd thread (the on that is running the delete_tasks envent loop). This reschedule should happne because of For some reason though it seems that
|
I'm not sure where the jvm stuff is coming from. There are packages earlier in my pipeline using java, to connect to a database and fetch the data. I'm trying now to reproduce the issue in a simpler setting. Originally this happened on a server running my full pipeline. Now I'm trying to run only the affected code on the server with same data and separately on my laptop with fake data. |
I just ran into this again now when running the affected code segment manually, now I'm sure there is no forking happening anywhere. I've attached the gdb backtrace in case it helps. Is there anything else I can do to get more info about this? |
@egillax ! Thanks for the backtrace. Is the code running into this problem public? I'll try to reproduce it, with a minimal example, but it would be nice to take a look at the code to see if I could see other clues. So far, it seems that it's caused by a |
Yes the code is public. The main training loop is in this class which is instantiated and fit during hyperparameter tuning here. The issue has both happened with a ResNet and a Transformer. I'm also trying to make a more minimal example I can share, will post it here if I manage. |
Hi @dfalbel ,
I'm having an issue that the model training just hangs when training models. This doesn't happen always but has now happened twice in the last week. I'm using torch 0.10 on ubuntu 20.
I'm attaching backtraces generated with gdb and
thread apply all bt full
when attaching to the hanging process.Any idea what could be going on or how to debug further?
gdb_deadlock1.txt.txt
gdb_deadlock2.txt.txt
The text was updated successfully, but these errors were encountered: