-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stderr task finished with error #589
Comments
In the moment documentation
Comparing the event of POSEIDON-3V with POSEIDON-3W, the occurence times do not correlate. This makes network events as cause more unlikely as we would expect that network issues above the TCP layer interfere with all active connections between two hosts and not only individual. The submissions related to the failing executions are unsuspicious and do not allow the error reproduction. From a theoretical consideration, we do not expect Poseidon nor the payload to cause this issue, because even with a failing payload Nomad should return a proper error code. For the last event (turtle graphics), we find the exec requests in the Nomad logs (agent-4). Nomad Logs
We see that both exec sessions run for about six seconds. After it, the Stdout (main) exec session ended normally, the Stderr (FIFO handling) exec session ended with an error: Despite these insights, I was not able to reproduce the error (by testing the named pipe and turtle graphics) nor find helpful internet guidance on this error. Also, the resources of the agent had barely been used that day (no peak load). How should we proceed with this issue? PS: We learned the lesson that the |
On second thought, this might be a bug in the moby repository. The error is defined in Go's io library for Pipes. For each of our exec sessions, Nomad establishes a WebSocket connection to the Preferably, we find a way to reproduce this issue. Alternatively, we ignore this issue as it does not seem to impact the user experience. Furthermore, we found that this pipe error affects not only this Sentry issue but also |
While (not) studying, I had to rethink this error and it came to my mind that Nomad is throwing this error and not the Docker daemon. The error logged by Nomad can just be from the library itself because it is not transferred over the HTTP connection to Moby. Therefore, the error must originate in Nomad. The error is thrown here and comes from the go-dockerclient library. However, when following the logic, the error likely occurs here or here. Both are So we backtrace the stream coming from Nomad [1] [2] [3] [4] [5]. Suddenly, we see Next time, we will examine how Nomad's |
Now, after suspecting everyone, we finally identified the failing dependency. Go-DockerClient: The unofficial version of Docker's Go client is still used by Nomad. The race condition happens in about 14 of 197.536 executions (aka 98.768 Poseidon Executions) (production); about every 24254 executions in the local environment. The error happens because sometimes Go's runtime switches goroutines between these two lines. Another Goroutine is still reading the same stream (stdin) and throws an error when the stream closes. If the latter happens in between the two lines, the error occurs. I will create an issue at the repository: fsouza/go-dockerclient#1076 |
Awesome, what a great discovery and in-depth analysis of the underlying issue! I am glad you found the root cause and created a corresponding issue with such a detailed description that's very easy to follow 🥳. I also went ahead and aimed to reproduce the example and succeeded in the end. Still, you might want to expand the reproduction steps a little (click to expand)
On my system, the error occurred too (after more than 1600 executions, invoked with
|
Great, thank you for these additions. I adjusted the Reproduction steps. |
Since we created the upstream issue and catched the erroneous behavior with #590, we are closing this issue. |
Nomad switched from from fsouza/go-dockerclient to the native docker client: hashicorp/nomad#23966 Hence, this issue will become obsolete for Poseidon soon. For now, however, there is a regression I reported in hashicorp/nomad#24171. This issue also blocks the upgrade started in #715. |
Sentry Issue: POSEIDON-3V
error executing command in allocation:
unexpected EOF
The text was updated successfully, but these errors were encountered: