Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The wget binary leaks defunct ssl_client processes #158

Open
aleskandro opened this issue Feb 29, 2024 · 7 comments
Open

The wget binary leaks defunct ssl_client processes #158

aleskandro opened this issue Feb 29, 2024 · 7 comments

Comments

@aleskandro
Copy link

aleskandro commented Feb 29, 2024

The included wget binary (from busybox) seems to leak defunct ssl_client processes.

When using wget for health checking a registry serving on SSL, the host's ulimit for forking is eventually reached, and the registry becomes nonfunctional.

Affected versions and env

  • docker.io/library/registry:2.8.2
  • docker.io/library/registry:2.8.3
  • docker.io/library/registry:3.0.0-alpha.1

Server: Docker Engine - Community
Engine:
Version: 25.0.3
API version: 1.44 (minimum version 1.24)
Go version: go1.21.6
Git commit: f417435
Built: Tue Feb 6 21:14:27 2024
OS/Arch: linux/amd64

Kernel: Linux seraph 6.8.0-0.rc5.41.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Feb 19 14:05:40 UTC 2024 x86_64 GNU/Linux
Host OS: Fedora CoreOS

Steps to reproduce

$ docker run -d --rm --name myreg docker.io/library/registry:2.8.3
$ docker exec -it myreg sh

/ # ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 registry serve /etc/docker/registry/config.yml
   18 root      0:00 sh
   24 root      0:00 ps aux
/ # wget https://google.com
Connecting to google.com (142.250.186.174:443)
Connecting to www.google.com (172.217.18.4:443)
saving to 'index.html'
index.html           100% |********************************************************************************************************************************************************************************| 19417  0:00:00 ETA
'index.html' saved
/ # ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 registry serve /etc/docker/registry/config.yml
   18 root      0:00 sh
   26 root      0:00 [ssl_client]
   27 root      0:00 [ssl_client]
   28 root      0:00 ps aux
/ # 

Other Infos

I tried it on the base image, alpine:3.18.6, and it didn't reproduce

aleskandro added a commit to openshift-qe/baremetal-qe-infra that referenced this issue Feb 29, 2024
The registry@ services become unhealthy and get killed after a while they started.

This is due to distribution/distribution-library-image#158. The healthchecks are based on wget, and wget leaks a defunct process everytime the healthcheck launches. Eventually, we reach the ulimit, no more forks are allowed and the container is restarted.

This commit changes the healtcheck to use curl, that is installed by the healthcheck itself if not already available. This is a workaround to revert once a fix for distribution/distribution-library-image#158 lands and a new image is published.
aleskandro added a commit to openshift-qe/baremetal-qe-infra that referenced this issue Feb 29, 2024
The registry@ services become unhealthy and get killed after a while they started.

This is due to distribution/distribution-library-image#158. The healthchecks are based on wget, and wget leaks a defunct process everytime the healthcheck launches. Eventually, we reach the ulimit, no more forks are allowed and the container is restarted.

This commit changes the healtcheck to use curl, that is installed by the healthcheck itself if not already available. This is a workaround to revert once a fix for distribution/distribution-library-image#158 lands and a new image is published.
@milosgajdos
Copy link
Member

I'm a bit confused, you report that the issue affects v3.0.0-alpha1 but then conclude that it didn't reproduce on the base image?

I tried it on the base image, alpine:3.18.6, and it didn't reproduce

The example shows you've reproduced it on the v2.8.3 which sucks, indeed. I've noticed that that version has some critical SSL CVEs that need patching.

Can you grab the latest Dockerfile from this repo and build it and see if you can repro the issue?

@aleskandro
Copy link
Author

Yeah, it reproduces on the 3.0.0-alpha.1 I pulled too: docker.io/library/registry@sha256:36c26a131c88e3bc3195ae98070282c63b8015aa4b84b049d8b9826def317f6b precisely.

I'm a bit confused, you report that the issue affects v3.0.0-alpha1 but then conclude that it didn't reproduce on the base image?
I tried it on the base image, alpine:3.18.6, and it didn't reproduce

I got confused too actually, but had to go working on other issues and wanted to raise the issue at least.

Can you grab the latest Dockerfile from this repo and build it and see if you can repro the issue?

I'll double check this too as soon as i can

@aleskandro
Copy link
Author

aleskandro commented Mar 4, 2024

Ok, I was able to reproduce in the base image by having /bin/sleep inf as command:

docker run alpine:3.19 /bin/sleep inf
docker exec ... # same reproducer as above

This is also reproducible on the ubuntu image when using the busybox version of wget actually (as far as the PID1 process cannot collect the defunct process?).

I'll check where I can open an issue for busybox.

In the meantime, would it be ok to onboard curl or wget (the non-busybox version) in this image? Proposed change in #159

aleskandro added a commit to aleskandro/distribution-library-image that referenced this issue Mar 4, 2024
When running busybox wget v1.31.x to request an SSL page, there is a
leak of defunct processes that eventually could hit the limit for fork
syscalls, leading to the unavailability of main registry process.

Installing wget from the alpine repository resolves the issue as this
only affects the busybox distribution. Moreover, this should
transparently resolve the issue for deployments
using wget as healthcheck/readiness/liveness probe.

xref: https://bugs.busybox.net/show_bug.cgi?id=15967

Closes distribution#158
@aleskandro
Copy link
Author

Opened an upstream bug for busybox at https://bugs.busybox.net/show_bug.cgi?id=15967

@milosgajdos
Copy link
Member

A couple of questions -- sorry if they come out as dumb, but I'm still confused and would like to understand the problem better:

  • what does busybox have to do with this issue when the distribution binary is actually packaged into alpine?
  • why do we need to install wget when it's clearly available in the alpine image we use for distribution? are you saying the wget binary in the alpine image originates in busybox?

@aleskandro
Copy link
Author

aleskandro commented Mar 4, 2024

The wget available in the alpine image is a busybox applet, affected by the above mentioned bug.

If we use that binary to run an healthcheck about the registry availability (e.g., wget --spider https://localhost:5000/v2), eventually, the registry's container will fill up the maximum number of allowed forks and will not respond anymore.

/ # wget --help # same as `busybox wget --help`
BusyBox v1.36.1 (2023-11-06 11:32:24 UTC) multi-call binary. <------------ the shipped and bugged 'wget'

Usage: wget [-cqS] [--spider] [-O FILE] [-o LOGFILE] [--header STR]
	[--post-data STR | --post-file FILE] [-Y on/off]
	[-P DIR] [-U AGENT] [-T SEC] URL...

Retrieve files via HTTP or FTP

	--spider	Only check URL existence: $? is 0 if exists
	--header STR	Add STR (of form 'header: value') to headers
	--post-data STR	Send STR using POST method
	--post-file FILE	Send FILE using POST method
	-c		Continue retrieval of aborted transfer
	-q		Quiet
	-P DIR		Save to DIR (default .)
	-S    		Show server response
	-T SEC		Network read timeout is SEC seconds
	-O FILE		Save to FILE ('-' for stdout)
	-o LOGFILE	Log messages to FILE
	-U STR		Use STR for User-Agent header
	-Y on/off	Use proxy
/ # apk add wget
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/community/x86_64/APKINDEX.tar.gz
(1/4) Installing libunistring (1.1-r1)
(2/4) Installing libidn2 (2.3.4-r1)
(3/4) Installing pcre2 (10.42-r1)
(4/4) Installing wget (1.21.4-r0)
Executing busybox-1.36.1-r5.trigger
OK: 10 MiB in 19 packages
/ # wget --help
GNU Wget 1.21.4, a non-interactive network retriever. <----- after installing wget
Usage: wget [OPTION]... [URL]...

Mandatory arguments to long options are mandatory for short options too.

Startup:
  -V,  --version                   display the version of Wget and exit
  -h,  --help                      print this help
  -b,  --background                go to background after startup
  -e,  --execute=COMMAND           execute a `.wgetrc'-style command

Logging and input file:
  -o,  --output-file=FILE          log messages to FILE
  -a,  --append-output=FILE        append messages to FILE
  -d,  --debug                     print lots of debugging information
  -q,  --quiet                     quiet (no output)
  -v,  --verbose                   be verbose (this is the default)
....

aleskandro added a commit to aleskandro/distribution-library-image that referenced this issue Mar 4, 2024
When running busybox wget v1.31.x to request an SSL page, there is a
leak of defunct processes that eventually could hit the limit for fork
syscalls, leading to the unavailability of main registry process.

Installing wget from the alpine repository resolves the issue as this
only affects the busybox distribution. Moreover, this should
transparently resolve the issue for deployments
using wget as healthcheck/readiness/liveness probe.

xref: https://bugs.busybox.net/show_bug.cgi?id=15967

Closes distribution#158

Signed-off-by: aleskandro <[email protected]>
@milosgajdos
Copy link
Member

FYI: we bumped the base alpine image to 3.21 in the latest v3 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants