Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unkillable processes caused by xpmem #12

Open
angainor opened this issue Dec 10, 2021 · 0 comments
Open

Unkillable processes caused by xpmem #12

angainor opened this issue Dec 10, 2021 · 0 comments

Comments

@angainor
Copy link

I have previously filed issue hpc#45, but got no answer there. Maybe you have an idea what's going on?

We are running on RH 7.7, kernel 3.10.0-1062.9.1.el7.x86_64, on AMD EPYC 7702 64-Core Processor. Quite often after an OpenMPI job, which uses UCX for all communication, is killed before it finalizes cleanly the processes get stuck in D-state and cannot be killed, the node has to be rebooted. Looking at /proc/<PID>/stack shows this type of stacks:

b1118-mn: [<ffffffffb9192d48>] call_rwsem_down_read_failed+0x18/0x30
b1118-mn: [<ffffffffc088f73d>] xpmem_clear_PTEs_range+0x24d/0x300 [xpmem]
b1118-mn: [<ffffffffc088f80b>] xpmem_clear_PTEs+0x1b/0x20 [xpmem]
b1118-mn: [<ffffffffc088d8d0>] xpmem_remove_seg+0x50/0xf0 [xpmem]
b1118-mn: [<ffffffffc088dbfa>] xpmem_remove_segs_of_tg+0x4a/0x80 [xpmem]
b1118-mn: [<ffffffffc088d82f>] xpmem_teardown+0x3f/0x90 [xpmem]
b1118-mn: [<ffffffffc0891011>] xpmem_mmu_release+0x111/0x180 [xpmem]
b1118-mn: [<ffffffffb901c2f7>] __mmu_notifier_release+0x57/0x140
b1118-mn: [<ffffffffb8ffb065>] exit_mmap+0x175/0x1a0
b1118-mn: [<ffffffffb8e982d7>] mmput+0x67/0xf0
b1118-mn: [<ffffffffb8ea1ff8>] do_exit+0x288/0xa50
b1118-mn: [<ffffffffb8ea283f>] do_group_exit+0x3f/0xa0
b1118-mn: [<ffffffffb8eb364e>] get_signal_to_deliver+0x1ce/0x5e0
b1118-mn: [<ffffffffb8e2c527>] do_signal+0x57/0x6f0
b1118-mn: [<ffffffffb8e2cc32>] do_notify_resume+0x72/0xc0
b1118-mn: [<ffffffffb958457c>] retint_signal+0x48/0x8c
b1118-mn: [<ffffffffffffffff>] 0xffffffffffffffff

I haven't seen this happening after a job exits cleanly, and UCX is finalized. So I get an impression that when I kill a job and UCX cannot release the xpmem pointers, something goes wrong with the process cleanup in the kernel.

Do you have any ideas as to what might be the reason?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant