Unkillable processes caused by xpmem #12

angainor · 2021-12-10T16:52:40Z

I have previously filed issue hpc#45, but got no answer there. Maybe you have an idea what's going on?

We are running on RH 7.7, kernel 3.10.0-1062.9.1.el7.x86_64, on AMD EPYC 7702 64-Core Processor. Quite often after an OpenMPI job, which uses UCX for all communication, is killed before it finalizes cleanly the processes get stuck in D-state and cannot be killed, the node has to be rebooted. Looking at /proc/<PID>/stack shows this type of stacks:

b1118-mn: [<ffffffffb9192d48>] call_rwsem_down_read_failed+0x18/0x30
b1118-mn: [<ffffffffc088f73d>] xpmem_clear_PTEs_range+0x24d/0x300 [xpmem]
b1118-mn: [<ffffffffc088f80b>] xpmem_clear_PTEs+0x1b/0x20 [xpmem]
b1118-mn: [<ffffffffc088d8d0>] xpmem_remove_seg+0x50/0xf0 [xpmem]
b1118-mn: [<ffffffffc088dbfa>] xpmem_remove_segs_of_tg+0x4a/0x80 [xpmem]
b1118-mn: [<ffffffffc088d82f>] xpmem_teardown+0x3f/0x90 [xpmem]
b1118-mn: [<ffffffffc0891011>] xpmem_mmu_release+0x111/0x180 [xpmem]
b1118-mn: [<ffffffffb901c2f7>] __mmu_notifier_release+0x57/0x140
b1118-mn: [<ffffffffb8ffb065>] exit_mmap+0x175/0x1a0
b1118-mn: [<ffffffffb8e982d7>] mmput+0x67/0xf0
b1118-mn: [<ffffffffb8ea1ff8>] do_exit+0x288/0xa50
b1118-mn: [<ffffffffb8ea283f>] do_group_exit+0x3f/0xa0
b1118-mn: [<ffffffffb8eb364e>] get_signal_to_deliver+0x1ce/0x5e0
b1118-mn: [<ffffffffb8e2c527>] do_signal+0x57/0x6f0
b1118-mn: [<ffffffffb8e2cc32>] do_notify_resume+0x72/0xc0
b1118-mn: [<ffffffffb958457c>] retint_signal+0x48/0x8c
b1118-mn: [<ffffffffffffffff>] 0xffffffffffffffff

I haven't seen this happening after a job exits cleanly, and UCX is finalized. So I get an impression that when I kill a job and UCX cannot release the xpmem pointers, something goes wrong with the process cleanup in the kernel.

Do you have any ideas as to what might be the reason?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unkillable processes caused by xpmem #12

Unkillable processes caused by xpmem #12

angainor commented Dec 10, 2021

Unkillable processes caused by xpmem #12

Unkillable processes caused by xpmem #12

Comments

angainor commented Dec 10, 2021