Interrupting stuck network processes #1356

Mellvik · 2022-07-01T20:13:45Z

My intention was to put this on the 0.7.0 list, but it may as well be considered a regular bug.

My memory tells me we've touched this one before, but Seraching the repository didn't give me any hints.

Anyway, stuck network processes hardly ever react to signals - kill -9 ... whatever. Say we do a 'net stop' while an incoming telnet is active and the telnetd process is stuck forever, and reboot is required.

There are many such examples- in particular when running elks-to-elks networking because of timeouts that eventually ends up in hung processes, but this is the easiest to reproduce.

Attacking this problem would be really helpful. And I'm interested in doing so. @ghaerr, any ideas about where to start?

-M

ghaerr · 2022-07-01T20:35:49Z

This is going to be a really tough one - the biggest reason is that even if we "fix" the interruptible_sleep() that the processes are likely waiting on so that they "return out of the kernel" (and are killed), the kernel network routines will be left with their semaphores in possibly incorrect/bad states (e.g. a semaphore may be left in the always off or on position, which will fail when networking is "restarted".)

In order to see this, I advise for you to start by looking at elks/net/ipv4/af_inet.c: each of the "interruptible_sleep" routines will need to be unwound properly, on a case-by-case basis. Remember how sleep works, a process is effectively unscheduled until a wake_up call is made, except, if the process receives a signal, the process will wake, and its kernel task will continue; this means it will return from interruptible_sleep, and code can check current->signal to see whether the wakeup was a result of a wake_up call, or a signal. THEN, each individually-coded sleep can be possibly unwound properly.

Lets take inet_bind() in the above file to start: the rwlock semaphore is DOWN, so that would have to be reversed, and ktcp has been given a request, for which there has (presumably) not been a response. We also need to cancel tcpdev_clear_data_avail(). After all this, we could then allow the process to be killed, which occurs after the current system call is completed, but if ktcp is still running, it will be put into a bad state, so it needs to be killed also, and it may be sleeping in a similar kernel routine, with the same or different set of semaphores gated/unlocked/etc.

This whole dilemma is the reason many UNIXs and sometimes Linux still hang and require a reboot, despite repeated kill requests. In some sense, what is needed is a kind of "kernel reset" that resets all network variables to their starting state, and kills all associated process - a big kluge, and not always possible since the kernel doesn't "know" which processes might be "network" processes.

Take a look and we'll go from here after you've looked further at it. Frankly, even if we "fix" it, we really can't guarantee kernel correctness afterwards, which might cause very strange problems for ourselves and users after a kill.

Mellvik · 2022-07-02T09:57:47Z

Thank you @ghaerr - that was a cold shower indeed. Very thorough - much appreciated. Maybe we could lower the level of ambition, ignore the havoc a -9 signal may cause, and - say -enable ktcp to terminate all open connections when terminating, regardless of cause. My case is mostly ´net stop´, forgetting that there are open connections, idle or active. I'll take a look at the code you suggested in an attempt to understand the big picture. _.)

…

-M

1. jul. 2022 kl. 22:36 skrev Gregory Haerr ***@***.***>: This is going to be a really tough one - the biggest reason is that even if we "fix" the interruptible_sleep() that the processes are likely waiting on so that they "return out of the kernel" (and are killed), the kernel network routines will be left with their semaphores in possibly incorrect/bad states (e.g. a semaphore may be left in the always off or on position, which will fail when networking is "restarted".) In order to see this, I advise for you to start by looking at elks/net/ipv4/af_inet.c: each of the "interruptible_sleep" routines will need to be unwound properly, on a case-by-case basis. Remember how sleep works, a process is effectively unscheduled until a wake_up call is made, except, if the process receives a signal, the process will wake, and its kernel task will continue; this means it will return from interruptible_sleep, and code can check current->signal to see whether the wakeup was a result of a wake_up call, or a signal. THEN, each individually-coded sleep can be possibly unwound properly. Lets take inet_bind() in the above file to start: the rwlock semaphore is DOWN, so that would have to be reversed, and ktcp has been given a request, for which there has (presumably) not been a response. We also need to cancel tcpdev_clear_data_avail(). After all this, we could then allow the process to be killed, which occurs after the current system call is completed, but if ktcp is still running, it will be put into a bad state, so it needs to be killed also, and it may be sleeping in a similar kernel routine, with the same or different set of semaphores gated/unlocked/etc. This whole dilemma is the reason many UNIXs and sometimes Linux still hang and require a reboot, despite repeated kill requests. In some sense, what is needed is a kind of "kernel reset" that resets all network variables to their starting state, and kills all associated process - a big kluge, and not always possible since the kernel doesn't "know" which processes might be "network" processes. Take a look and we'll go from here after you've looked further at it. Frankly, even if we "fix" it, we really can't guarantee kernel correctness afterwards, which might cause very strange problems for ourselves and users after a kill. — Reply to this email directly, view it on GitHub <#1356 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA3WGOGU342DSNXGGR2A5ETVR5JDBANCNFSM52NXDGXA>. You are receiving this because you authored the thread.

ghaerr · 2022-07-02T16:11:19Z

enable ktcp to terminate all open connections when terminating, regardless of cause.

Well, all sockets opened are automatically closed by ktcp when it exits. I don't think this currently changes any processes sleeping on their own socket connections to the kernel though. However, knowing that there is no /dev/tcpdev open (meaning ktcp has exited) may allow for a killed process to exit its kernel task state more cleanly, without worry for ktcp corruption as mentioned above.

That would mean one would execute net stop, then possibly kill the remaining network processes... is that what you're doing now, when seeing the hanging processes?

Mellvik · 2022-10-11T06:54:59Z

enable ktcp to terminate all open connections when terminating, regardless of cause. Well, all sockets opened are automatically closed by ktcp when it exits. I don't think this currently changes any processes sleeping on their own socket connections to the kernel though. However, knowing that there is no /dev/tcpdev open (meaning ktcp has exited) may allow for a killed process to exit its kernel task state more cleanly, without worry for ktcp corruption as mentioned above. That would mean one would execute net stop, then possibly kill the remaining network processes... is that what you're doing now, when seeing the hanging processes?

Yes, that's it. And yes, it does seem like a fairly common situation - the device going away. Like the carrier going away on a serial/modem except in this case there may be many processes using it... Anyway, getting this one to work would be a big step.

…

-M

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

Mellvik added the bug Defect in the product label Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interrupting stuck network processes #1356

Interrupting stuck network processes #1356

Mellvik commented Jul 1, 2022

ghaerr commented Jul 1, 2022

Mellvik commented Jul 2, 2022 via email

ghaerr commented Jul 2, 2022

Mellvik commented Oct 11, 2022 via email

Interrupting stuck network processes #1356

Interrupting stuck network processes #1356

Comments

Mellvik commented Jul 1, 2022

ghaerr commented Jul 1, 2022

Mellvik commented Jul 2, 2022 via email

ghaerr commented Jul 2, 2022

Mellvik commented Oct 11, 2022 via email