-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Multiple ni_process_reap: issue in wickedd logs #741
Comments
Still happening 3.5 years later... lots of errors... very verbose... openSUSE Leap 15.3
What information do you need from me to debug this?? |
It's about the handle_hangup callback called on poll(2) hangup event:
In src/process.c, there is a:
I've pushed some test code to https://github.com/mtomaschewski/wicked/tree/pipe-test branch which may be helpful to narrow down what is causing it -- it shows the command that were executed and there is a skeleton in testing using a very simple ni_process exec.
I didn't found it in the logs on my machines. As you're getting them, could you please provide the logs with the command? |
Here's what I see (so far) after applying the patch and rebooting:
Looking at the code and noting that the wait times above are all less that one millisecond, it appears that the bug in the code is simply a mistaken assumption: Just because a socket that is connected to process X has been closed, that doesn't mean that process X has completed exiting. So it's possible that the first invocation of |
Yes. A process can close it's fd's and still do something (e.g. cleanup) before it exits. Further, it can also be, that the process already exited, just the kernel didn't cleaned up the process resources yet and is not ready to report the exit status yet. We use the ni_process functions for many things, to send or to receive some data (or nothing), so it's a quite important place with several use-cases. And "almost always" we need the data + error code at the end. (When you look into e.g.
Yes, I'd tend keep it as is (no fixed "deadline" or e.g. 100ms what is already "very long time"), but to change the error msg to debug, call blocking waitpid as we did until now and then either log debug message again (so we see in debug how long it took) or when it took 1sec or more, log this as error or warning? |
Changed error log message to debug with an additional warning when it took 1sec or more.
Better would be to carefully change the code to use a signal handler (use signalfd, which allows to poll them too) and finally call the notify_callback, but this would be much more intrusive than the above and needs extensive tests. |
Yes I think we're saying more or less the same thing. Your approach is slightly more airtight because it guarantees the process has fully exited. But either way, we would define some "error threshold" of 1 second or whatever that would be considered abnormal and trigger a warning, whereas below this value no warning wold be generated. Thanks. |
Yes, I think so too.
Yes, when you prefer we can keep it an error. When we run into cases where it needs >=1s to get the exit status, there is something fishy and it's needed to take a much closer look at all this (rewrite to use signalfd IMO). Thanks! |
Changed error log message to debug with an additional warning when it took 1sec or more.
process: log command of reaped child (gh##741)
Hello!
Have Leap 42.3 installed on VirtualBox.
Wicked version: 0.6.40.
Multiple ni_process_reap: issue in logs, such like:
journalctl -e _PID=1188 command gives no result. So, I can't catch what the problem process (1176, 1188, etc) is.
Can you, please, give a hint what
ni_process_reap:
is and how to solve it, if it is problem indeed?Thanks!
wickedd_0_6_40_ni_process_reap.txt
The text was updated successfully, but these errors were encountered: