Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uv_run hangs on macOS arm64/Rosetta in x86_64 docker container #4279

Open
bradking opened this issue Jan 9, 2024 · 18 comments
Open

uv_run hangs on macOS arm64/Rosetta in x86_64 docker container #4279

bradking opened this issue Jan 9, 2024 · 18 comments
Labels

Comments

@bradking
Copy link
Contributor

bradking commented Jan 9, 2024

Version: 1.43.0, 1.47.0, others
Platform: macOS arm64 + Rosetta + Docker + Linux/x86_64

On a macOS Apple Silicon host using Docker and Rosetta to run a Linux/x86_64 container, the following test program hangs and leaves a defunct child process. The same program works well in containers with native architectures, or on plain Linux and macOS hosts.

The example in demo.tar.gz is a small C program using libuv to run sh -c "echo echo", read its output, wait for exit, cleanup, and repeat. It should iterate 10000 times, but hangs with a defunct child sh . For convenience, the contents of demo.tar.gz include:

test-uv.c (click to expand)
#include <assert.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <uv.h>

static void my_on_alloc(uv_handle_t* handle, size_t suggested_size,
                        uv_buf_t* buf)
{
  (void)handle;
  *buf =
    uv_buf_init((char*)malloc(suggested_size), (unsigned int)suggested_size);
}

static void my_on_read(uv_stream_t* stream, ssize_t nread, const uv_buf_t* buf)
{
  if (nread > 0) {
    fwrite(buf->base, 1, nread, stdout);
  } else if (nread < 0) {
    uv_close((uv_handle_t*)stream, NULL);
  }
  free(buf->base);
}

static void my_on_exit(uv_process_t* handle, int64_t exitStatus,
                       int termSignal)
{
  (void)exitStatus;
  (void)termSignal;
  uv_close((uv_handle_t*)handle, NULL);
}

static void run()
{
  /* Create a child process that creates its
     own child that prints output.  */
  const char* cmd[] = { "sh", "-c", "echo echo", 0 };

  /* Create the event loop.  */
  uv_loop_t loop;
  uv_loop_init(&loop);

  /* Create the child stdout pipe.  */
  uv_file fd[2];
  uv_pipe(fd, 0, 0);

  uv_stdio_container_t stdio[3];
  stdio[0].flags = UV_INHERIT_FD;
  stdio[0].data.fd = 0;
  stdio[1].flags = UV_INHERIT_FD;
  stdio[1].data.fd = fd[1];
  stdio[2].flags = UV_INHERIT_FD;
  stdio[2].data.fd = 2;

  uv_process_options_t options;
  memset(&options, 0, sizeof(options));
  options.file = cmd[0];
  options.args = (char**)cmd;
  options.stdio = stdio;
  options.stdio_count = 3;
  options.exit_cb = my_on_exit;

  /* Spawn the child process.  */
  uv_process_t process;
  int spawn_result = uv_spawn(&loop, &process, &options);
  if (spawn_result < 0) {
    fprintf(stderr, "uv_spawn failed: %s\n", uv_strerror(spawn_result));
  }

  /* The child stdout pipe write end is not needed in the parent.  */
  close(fd[1]);

  /* Read the child stdout pipe.  */
  uv_pipe_t outPipe;
  uv_pipe_init(&loop, &outPipe, 0);
  uv_pipe_open(&outPipe, fd[0]);
  uv_read_start((uv_stream_t*)&outPipe, my_on_alloc, my_on_read);

  /* Run the event loop to completion.  */
  uv_run(&loop, UV_RUN_DEFAULT);

  /* Destroy the event loop.  */
  int close_result = uv_loop_close(&loop);
  if (close_result < 0) {
    fprintf(stderr, "uv_loop_close failed: %s\n", uv_strerror(close_result));
  }
}

int main(void)
{
  int i;
  for (i = 0; i < 10000; ++i) {
    fprintf(stderr, "%04d\n", i);
    run();
  }
  return 0;
}
Dockerfile (click to expand)
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y --no-install-recommends gcc libc-dev libuv1-dev
WORKDIR /root
COPY test-uv.c /root/
README.md (click to expand)

On a macOS Apple Silicon host:

docker build --platform linux/amd64 -t demo .
docker run --platform linux/amd64 -t --rm demo \
  bash -c 'gcc test-uv.c -luv && ./a.out'

The test process in the container eventually hangs.

@bradking
Copy link
Contributor Author

bradking commented Jan 9, 2024

For reference, I narrowed this example down from CMake Issue 25562. CMake uses libuv to run child processes, and works on many platforms, but hangs in libuv on this one.

@santigimeno
Copy link
Member

If you run the program with strace -f. Does the output provide any hints?

@bradking
Copy link
Contributor Author

I just ran the program under strace -f. The hang does not occur and all 10000 iterations complete. Then I ran the process by itself first, waited for it to hang, and used strace -f -p $PID 2>log to attach to it. That causes the process to resume running. If I Ctrl-C the strace process then the test program eventually hangs again. Attaching strace wakes it up again. Each time the log file shows the following at the top:

strace: Process 59 attached
syscall_0xf0003ffff128(0xf0003ffff0c0, 0xf0003ffff0c0, 0xf0003ffff110, 0, 0x8000ffff782f2d90, 0xd7) = 0x8000ffff782f2fe4
syscall_0xf0003ffff128(0xf0003ffff0c0, 0, 0xf0003ffff110, 0, 0x800000181ed0, 0x87) = 0x8000001820f0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=..., si_uid=0, si_status=0, si_utime=1, si_stime=0} ---
syscall_0xf0003ffff128(0xf0003ffff0c0, 0, 0x6, 0, 0x8000ffff782f2d90, 0xde) = 0x8000ffff782f2fe4
read(0, NULL, 0)                        = 140737489936624
[ Process PID=59 runs in x32 mode. ]
syscall_0x5554ffffbf40(0x4, 0x5554ffffa94f, 0x11, 0, 0x2, 0x3f) = 0xffffffff
syscall_0x5554fffff4b8(0x4, 0x5554ffffa94f, 0x11, 0, 0x2, 0x40) = 0xffffffff
syscall_0x5554fffff4b8(0x4, 0x5554ffffa94f, 0x11, 0, 0x2, 0x40) = 0xffffffff
[ Process PID=59 runs in 64 bit mode. ]
read(0, 0xc000, 3)                      = 9223653509552877540
read(0, NULL, 3)                        = 4294967295

These lines are identical every time except for the si_pid value since a different child exits.

@bnoordhuis
Copy link
Member

[ Process PID=59 runs in x32 mode. ]
...
[ Process PID=59 runs in 64 bit mode. ]

That's... unusual. strace is saying the process's personality changed between syscalls.

It checks bit 30 in the syscall number (i.e. rax & 0x40000000) to distinguish between x86_64 and x32 syscalls, so maybe it gets confused because all the other values are bogus-looking too.

I can't really make heads or tails of that trace. Can you perhaps trying dumping core or attaching with gdb when the process hangs and a) inspect registers, and b) obtain a backtrace?

@bradking
Copy link
Contributor Author

I think strace's heuristics are getting confused. It prints the x32/64 bit messages repeatedly. It's also printing most of the syscalls in raw form.

gdb doesn't work well in the container either. I found some posts about using ROSETTA_DEBUGSERVER_PORT to run a process in a gdbserver and debug it from another terminal, but that doesn't actually run the process until a remote debugger attaches to the session. When I do that and run the "continue" step in gdb the process hangs immediately in epoll_pwait:

gdb Immediate Hang Backtrace (click to expand)
(gdb) where
#0  0x00007fffffebcb10 in ?? ()
#1  0x0000555555574130 in uv__io_poll (loop=0x5554fffff1d0, timeout=-1) at /root/libuv/src/unix/linux.c:1430
#2  0x000055555555eb95 in uv_run (loop=0x5554fffff1d0, mode=UV_RUN_DEFAULT) at /root/libuv/src/unix/core.c:447
#3  0x0000555555559dcb in run () at test-uv.c:83
#4  0x0000555555559e8c in main () at test-uv.c:99

I've not been able to get a backtrace for the real hang, but I patched libuv as follows to print more information:

debug-prints.patch (Click to expand)
diff --git a/src/unix/linux.c b/src/unix/linux.c
index 8eeb352e..64a6d351 100644
--- a/src/unix/linux.c
+++ b/src/unix/linux.c
@@ -1427,7 +1427,9 @@ void uv__io_poll(uv_loop_t* loop, int timeout) {
      */
     lfields->current_timeout = timeout;

+    fprintf(stderr, "before epoll_pwait\n");
     nfds = epoll_pwait(epollfd, events, ARRAY_SIZE(events), timeout, sigmask);
+    fprintf(stderr, "after epoll_pwait\n");

     /* Update loop->time unconditionally. It's tempting to skip the update when
      * timeout == 0 (i.e. non-blocking poll) but there is no guarantee that the
diff --git a/src/unix/process.c b/src/unix/process.c
index dd58c18d..12b6b02c 100644
--- a/src/unix/process.c
+++ b/src/unix/process.c
@@ -76,6 +76,7 @@ extern char **environ;

 #ifdef UV_USE_SIGCHLD
 static void uv__chld(uv_signal_t* handle, int signum) {
+  write(2, "c\n", 2);
   assert(signum == SIGCHLD);
   uv__wait_children(handle->loop);
 }
diff --git a/src/unix/signal.c b/src/unix/signal.c
index bc4206e6..46139c56 100644
--- a/src/unix/signal.c
+++ b/src/unix/signal.c
@@ -185,6 +185,8 @@ static void uv__signal_handler(int signum) {
   uv_signal_t* handle;
   int saved_errno;

+  write(2, "a\n", 2);
+
   saved_errno = errno;
   memset(&msg, 0, sizeof msg);

@@ -197,6 +199,7 @@ static void uv__signal_handler(int signum) {
        handle != NULL && handle->signum == signum;
        handle = RB_NEXT(uv__signal_tree_s, &uv__signal_tree, handle)) {
     int r;
+    write(2, "b\n", 2);

     msg.signum = signum;
     msg.handle = handle;

The changes print before and after epoll_pwait, and print a, b, and c at various points of uv__signal_handler and uv__chld. With that the test program produces the output:

...
0006
before epoll_pwait
after epoll_pwait
echo
before epoll_pwait
after epoll_pwait
before epoll_pwait
a
b
after epoll_pwait
before epoll_pwait
after epoll_pwait
c
0007
before epoll_pwait
after epoll_pwait
echo
before epoll_pwait
after epoll_pwait
before epoll_pwait

This shows that the signal handling and process reaping works several times (omitted in ..., the last copy is shown as 0006), but eventually the process hangs in epoll_pwait and SIGCHLD is not received/processed. The number of iterations before it hangs seems random. Sometimes it takes a few hundred. I can wake up the hung process with strace -p $pid, and then it processes SIGCHLD and output continues as normal:

a
b
after epoll_pwait
before epoll_pwait
after epoll_pwait
c
0008
...

Then the hang doesn't occur again until I Ctrl-C the strace process.

@vtjnash
Copy link
Member

vtjnash commented Jan 12, 2024

The code to handle signals does look like there is an illegal data race here, since this variable update might need to be seq-cst (it is read outside the lock, but needs to be sequenced-with the read from the previous epoll read--but this variable is written after the pipe write, so it would appear to be a data race even with seq-cst annotations). It currently isn't even marked atomic on read or write:

if (sh->caught_signals > sh->dispatched_signals) {

@vtjnash
Copy link
Member

vtjnash commented Jan 12, 2024

We should maybe consider refactoring the code so that we can share this with uv_async, since that code is correctly sequenced for use from signals, and this signal-handler code does not look correct for use even from threads:

libuv/src/unix/async.c

Lines 65 to 70 in a7cbda9

int uv_async_send(uv_async_t* handle) {
_Atomic int* pending;
_Atomic int* busy;
pending = (_Atomic int*) &handle->pending;
busy = (_Atomic int*) &handle->u.fd;

@bnoordhuis
Copy link
Member

but this variable is written after the pipe write

I assume this == handle->caught_signals? It's "protected" by a syscall-induced memory barrier but yeah, non-atomic stores can take a while to propagate to other cores so it's quite possible for events to get lost. I'll open a pull request to fix that.

handle->dispatched_signals is only accessed from the event loop thread. There should be no synchronization issues there.

bnoordhuis added a commit to bnoordhuis/libuv that referenced this issue Jan 15, 2024
Changes to the `caught_signals` field should be immediately visible to
other threads, otherwise "we get signal" events can get lost.

Tentative fix for the linked issue. I say "tentative" because the cause
is not fully understood but all signs point to libuv's handling of
SIGCHLD signals, and they only manifest on architectures with weaker
memory models than x86.

Fixes: libuv#4279
bnoordhuis added a commit to bnoordhuis/libuv that referenced this issue Jan 15, 2024
Changes to the `caught_signals` field should be immediately visible to
other threads, otherwise "we get signal" events can get lost.

Tentative fix for the linked issue. I say "tentative" because the cause
is not fully understood but all signs point to libuv's handling of
SIGCHLD signals, and they only manifest on architectures with weaker
memory models than x86.

Fixes: libuv#4279
@bnoordhuis
Copy link
Member

#4287

@vtjnash
Copy link
Member

vtjnash commented Jan 15, 2024

Ah, I see now that yes, it has a lock exclusion around it, so non-atomic access is okay, as it only gets accessed on the main thread when the lock is released and it is not in the list of handlers, but changing the list requires that lock, and releasing the lock provides the needed memory barrier

@vtjnash
Copy link
Member

vtjnash commented Jan 15, 2024

I guess the trace above does seem to indicate a kernel bug, since the signal itself appears to be missing until something triggers the kernel to check for it? Although it might be a issue with the rosetta/docker emulator also? I ran the test inside an ubuntu-aarch64 vm, and did not run into issue.

@vtjnash
Copy link
Member

vtjnash commented Jan 15, 2024

The simplest fix might be to change to use signalfd instead (since linux 2.6.27 / glibc 2.8), which works almost identically to the libuv emulation of it. We already did this on macOS for essentially identical reasons (though kevent supports waitpid natively for fast native work, while poll/epoll seems to require some inefficient workarounds)

@arichardson
Copy link

I saw a similar hang when signals are delivered while in a signal handler: https://github.com/libuv/libuv/pull/3755/files does the added test look similar to the issue you are seeing?

@vtjnash
Copy link
Member

vtjnash commented Jan 16, 2024

No that is a different issue. In that example, you showed there is a case where handle->dispatched_signals++ is not being counted correctly, since it is supposed to be a count of messages read from the pipe, but we incorrectly skip counting messages that were read but skipped. It seems that count should be incremented with each message read, moving the increment a bit earlier in that method. The PR got stuck because it changed a lot more than that, which obscured the necessary part of your fix.

EDIT: misread slightly the PR diff. This case seems to be an issue with the PR itself, but not an issue with v1.x code

@bnoordhuis
Copy link
Member

Although it might be a issue with the rosetta/docker emulator also?

I'm leaning towards this. The fact that it manifests neither on native ARM or native x86 is suggestive.

Switching to signalfd won't work. Any thread can be the receiver of a signal. signalfd only works correctly when combined with pthread_sigmask.

@vtjnash
Copy link
Member

vtjnash commented Jan 18, 2024

Oh, yeah I see now the design of signalfd seems to be just as bad as for sigwaitinfo, which it states it is identical to (and which was the other possible option for fixing this). I assumed they had designed the handling better, to be like kevent, but alas, no. The problem with most alternative options (except kevent, which doesn't have this issue) is indeed that the signal must be blocked on all threads, as setting it to SIG_IGN globally also breaks signalfd.

I am also not certain if this is actually considered a kernel bug though, or just undefined behavior in the posix implementation on linux--I have observed that the handling of signals is such that the kernel picks a thread at random, and then waits to run the signal handler if that thread later is eligible to be scheduled and isn't blocking signals. But if that thread dies or doesn't need to be rescheduled, then the kernel may choose to delay delivery of that signal indefinitely. (I have run into this exact problem in the past because I had blocked SIGCHLD on a high-priority thread once, and it caused libuv to miss the occasional SIGCHLD notifications, because the kernel happened to pick that thread to choose to handle them and that one thread blocked them)

@etcwilde
Copy link

I've had a few reports of libuv hanging in uv__io_poll on kevent on macOS without emulation or containerization.
This sample is from CMake 3.28.1 running on an arm64 device without Rosetta emulation.

Call graph:
    7710 Thread_132034   DispatchQueue_1: com.apple.main-thread  (serial)
      7710 start  (in dyld) + 2360  [0x18a0cd0e0]
        7710 main  (in cmake) + 9284  [0x100f9d9a0]
          7710 cmcmd::ExecuteCMakeCommand(std::vector<std::basic_string<char>> const&, std::unique_ptr<cmConsoleBuf>)  (in cmake) + 8460  [0x1011b6b14]
            7710 cmSystemTools::RunSingleCommand(std::vector<std::basic_string<char>> const&, std::basic_string<char>*, std::basic_string<char>*, int*, char const*, cmSystemTools::OutputOption, std::chrono::duration<double, std::ratio<1l, 1l>>, cmProcessOutput::Encoding)  (in cmake) + 820  [0x100e90114]
              7710 uv_run  (in cmake) + 284  [0x101160a8c]
                7710 uv__io_poll  (in cmake) + 712  [0x101166e10]
                  7710 kevent  (in libsystem_kernel.dylib) + 8  [0x18a40ea00]

@vtjnash
Copy link
Member

vtjnash commented Jan 18, 2024

That must be unrelated, since it is a different kernel and does not use signals, but yes, it sounds similar (c.f. https://gitlab.kitware.com/cmake/cmake/-/issues/25562)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants