Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stderr task finished with error #589

Closed
sentry-io bot opened this issue May 15, 2024 · 8 comments
Closed

Stderr task finished with error #589

sentry-io bot opened this issue May 15, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@sentry-io
Copy link

sentry-io bot commented May 15, 2024

Sentry Issue: POSEIDON-3V

Stderr task finished with error
  • error executing command in allocation:unexpected EOF
    • Verify if a network event caused this event or if something stopped the runner unexpectedly
@MrSerth MrSerth added the bug Something isn't working label May 29, 2024
@mpass99
Copy link
Contributor

mpass99 commented Jul 23, 2024

In the moment documentation

Comparing the event of POSEIDON-3V with POSEIDON-3W, the occurence times do not correlate. This makes network events as cause more unlikely as we would expect that network issues above the TCP layer interfere with all active connections between two hosts and not only individual.

The submissions related to the failing executions are unsuspicious and do not allow the error reproduction.

From a theoretical consideration, we do not expect Poseidon nor the payload to cause this issue, because even with a failing payload Nomad should return a proper error code.

For the last event (turtle graphics), we find the exec requests in the Nomad logs (agent-4).

Nomad Logs

2024-06-27T18:20:25.645Z [INFO]  client: task exec session starting: exec_id=09cef026-e16d-d113-6919-88ade6f18336 alloc_id=4f62817a-4e83-e6d5-6336-b5b416881505 task=default-task command=["/bin/bash", "-c", "echo -ne \"\\x1EPoseidon /sbin/setuser user $(date +%s%3N)\\x1E\"; /sbin/setuser user /bin/bash -c \"echo -ne \\\"\\\\x1EPoseidon mkfifo \\$(date +%s%3N)\\\\x1E\\\"; mkfifo /tmp/stderr_1719512425631783877.fifo && (cat /tmp/stderr_1719512425631783877.fifo; rm /tmp/stderr_1719512425631783877.fifo); ec=\\$?; echo -ne \\\"\\\\x1EPoseidon exit \\$ec \\$(date +%s%3N)\\\\x1E\\\" && exit \\$ec\""] tty=true action=""
2024-06-27T18:20:25.646Z [INFO]  client: task exec session starting: exec_id=03079427-9997-54a6-0a8f-97d6b11b7327 alloc_id=4f62817a-4e83-e6d5-6336-b5b416881505 task=default-task command=["/bin/bash", "-c", "echo -ne \"\\x1EPoseidon unset \\${!NOMAD_@} $(date +%s%3N)\\x1E\"; unset ${!NOMAD_@} && /bin/bash -c \"echo -ne \\\"\\\\x1EPoseidon /sbin/setuser user \\$(date +%s%3N)\\\\x1E\\\"; /sbin/setuser user /bin/bash -c \\\"echo -ne \\\\\\\"\\\\\\\\x1EPoseidon until \\\\\\$(date +%s%3N)\\\\\\\\x1E\\\\\\\"; until [ -e /tmp/stderr_1719512425631783877.fifo ]; do sleep 0.01; done; (echo -ne \\\\\\\"\\\\\\\\x1EPoseidon python3 -B /usr/lib/python3.8/webpython.py -f aufgabe323.py \\\\\\$(date +%s%3N)\\\\\\\\x1E\\\\\\\"; env CODEOCEAN=true /bin/bash -c \\\\\\\"python3 -B /usr/lib/python3.8/webpython.py -f aufgabe323.py\\\\\\\"; ec=\\\\\\$?; echo -ne \\\\\\\"\\\\\\\\x1EPoseidon exit \\\\\\$ec \\\\\\$(date +%s%3N)\\\\\\\\x1E\\\\\\\" && exit \\\\\\$ec) 2> /tmp/stderr_1719512425631783877.fifo\\\"\""] tty=true action=""
2024-06-27T18:20:31.900Z [INFO]  client: task exec session ended with an error: error="failed to start exec: io: read/write on closed pipe" code=0xc000ad61a0
2024-06-27T18:20:31.949Z [INFO]  client: task exec session ended: exec_id=03079427-9997-54a6-0a8f-97d6b11b7327

We see that both exec sessions run for about six seconds. After it, the Stdout (main) exec session ended normally, the Stderr (FIFO handling) exec session ended with an error: error="failed to start exec: io: read/write on closed pipe" code=0xc000ad61a0. failed to start exec is a Nomad wrapper for Docker errors (when executing our command).

Despite these insights, I was not able to reproduce the error (by testing the named pipe and turtle graphics) nor find helpful internet guidance on this error. Also, the resources of the agent had barely been used that day (no peak load).

How should we proceed with this issue?

PS: We learned the lesson that the unexpected EOF error might hide some error in the Nomad logs which Nomad is not passing to us.

@mpass99
Copy link
Contributor

mpass99 commented Jul 23, 2024

On second thought, this might be a bug in the moby repository.

The error is defined in Go's io library for Pipes.

For each of our exec sessions, Nomad establishes a WebSocket connection to the /exec/:id/start route handled here.
In this exact function, Docker (Moby) is creating multiple io.Pipes for the StdIn. It seems that the handling of the pipes contains a race condition. This might be triggered by the behavior of our StdIn for the Stderr exec session, which just blocks till the execution is done. While Moby supports not having a StdIn stream, Nomad requires one.

Preferably, we find a way to reproduce this issue. Alternatively, we ignore this issue as it does not seem to impact the user experience. (Or, we reconsider having two exec sessions. Docker, instead, multiplexes StdOut and StdErr via a simple protocol. By re-implementing the client side as a shell script, we could use their Go handlers for decoding. As we already require a script (/sbin/setuser) in the Docker images, another script for encoding wouldn't change a lot.)

Furthermore, we found that this pipe error affects not only this Sentry issue but also unexpected EOF errors of POSEIDON-3W (main exec session) [1] [2] [3].
unexpected EOF errors from issue POSEIDON-5C are not caused by this Pipe error (as it is a WebSocket connection between CodeOcean and Poseidon).

This was referenced Jul 23, 2024
@mpass99
Copy link
Contributor

mpass99 commented Jul 25, 2024

While (not) studying, I had to rethink this error and it came to my mind that Nomad is throwing this error and not the Docker daemon. The error logged by Nomad can just be from the library itself because it is not transferred over the HTTP connection to Moby. Therefore, the error must originate in Nomad.

The error is thrown here and comes from the go-dockerclient library. However, when following the logic, the error likely occurs here or here. Both are io.Copy calls, copying between the WebSocket connection to moby and the Nomad handled Reader/Writer. io.Copy just returns the error of either the reading or writing call. It is more likely that the error originates from the Nomad Reader/Writer than from Go's net/http library.

So we backtrace the stream coming from Nomad [1] [2] [3] [4] [5]. Suddenly, we see io.Pipes being created - exactly those that throw such error, we see in the logs.

Next time, we will examine how Nomad's StreamToExecOptions function allows reading/writing to a pipe after closing it.

@mpass99
Copy link
Contributor

mpass99 commented Jul 25, 2024

Now, after suspecting everyone, we finally identified the failing dependency.

Go-DockerClient: The unofficial version of Docker's Go client is still used by Nomad.

The race condition happens in about 14 of 197.536 executions (aka 98.768 Poseidon Executions) (production); about every 24254 executions in the local environment.

The error happens because sometimes Go's runtime switches goroutines between these two lines. Another Goroutine is still reading the same stream (stdin) and throws an error when the stream closes. If the latter happens in between the two lines, the error occurs.

I will create an issue at the repository: fsouza/go-dockerclient#1076

@MrSerth
Copy link
Member

MrSerth commented Aug 1, 2024

Awesome, what a great discovery and in-depth analysis of the underlying issue! I am glad you found the root cause and created a corresponding issue with such a detailed description that's very easy to follow 🥳.

I also went ahead and aimed to reproduce the example and succeeded in the end. Still, you might want to expand the reproduction steps a little (click to expand)
  • I had to call the file with the file extension go run -race main.go since otherwise I got an error package main is not in std (/opt/homebrew/Cellar/go/1.22.5/libexec/src/main)
  • Then, I didn't figured out how to include the dependencies (besides downloading it through go install github.com/fsouza/go-dockerclient@latest) and finally ended up creating a go.mod with go mod init main and added the dependency through go get github.com/fsouza/go-dockerclient
  • Finally, I had to pull the image with docker pull ubuntu

On my system, the error occurred too (after more than 1600 executions, invoked with go run -race main.go):

[...]
hello world 1677
failed to start exec: io: read/write on closed pipeexit status 2

@mpass99
Copy link
Contributor

mpass99 commented Aug 7, 2024

Great, thank you for these additions. I adjusted the Reproduction steps.
Interesting, how much this race condition/Go's scheduling differs from system to system.

@MrSerth
Copy link
Member

MrSerth commented Aug 7, 2024

Since we created the upstream issue and catched the erroneous behavior with #590, we are closing this issue.

@MrSerth
Copy link
Member

MrSerth commented Oct 11, 2024

Nomad switched from from fsouza/go-dockerclient to the native docker client: hashicorp/nomad#23966

Hence, this issue will become obsolete for Poseidon soon. For now, however, there is a regression I reported in hashicorp/nomad#24171. This issue also blocks the upgrade started in #715.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants