You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have previously filed issue hpc#45, but got no answer there. Maybe you have an idea what's going on?
We are running on RH 7.7, kernel 3.10.0-1062.9.1.el7.x86_64, on AMD EPYC 7702 64-Core Processor. Quite often after an OpenMPI job, which uses UCX for all communication, is killed before it finalizes cleanly the processes get stuck in D-state and cannot be killed, the node has to be rebooted. Looking at /proc/<PID>/stack shows this type of stacks:
I haven't seen this happening after a job exits cleanly, and UCX is finalized. So I get an impression that when I kill a job and UCX cannot release the xpmem pointers, something goes wrong with the process cleanup in the kernel.
Do you have any ideas as to what might be the reason?
The text was updated successfully, but these errors were encountered:
I have previously filed issue hpc#45, but got no answer there. Maybe you have an idea what's going on?
We are running on RH 7.7, kernel 3.10.0-1062.9.1.el7.x86_64, on AMD EPYC 7702 64-Core Processor. Quite often after an OpenMPI job, which uses UCX for all communication, is killed before it finalizes cleanly the processes get stuck in D-state and cannot be killed, the node has to be rebooted. Looking at
/proc/<PID>/stack
shows this type of stacks:I haven't seen this happening after a job exits cleanly, and UCX is finalized. So I get an impression that when I kill a job and UCX cannot release the xpmem pointers, something goes wrong with the process cleanup in the kernel.
Do you have any ideas as to what might be the reason?
The text was updated successfully, but these errors were encountered: