Linux network namespace sysctl safety verifier.
Ensure that net
sysctls are network-namespace-safe.
usage: verify.py [-h] [-v]
optional arguments:
-h, --help show this help message and exit
-v, --verbose Verbose output
Currently, this must be run as root, in order to use CLONE_NEWNET
.
$ sudo ./verify.py -v
The premise behind this tool is simple:
- Take a snapshot of all values in
/proc/sys/net
. - Create a child process with a new netns (using
CLONE_NEWNET
). - In the child netns, modify every writable value in
/proc/sys/net
. - Exit the child netns.
- Take a second snapshot of
/proc/sys/net
. - Compare the snapshots and report any differences.
Anything in the parent which changed as a result of manipulations in the child is considered a "leak".
The Linux kernel provides runtime-configurable kernel parameters known as
"sysctls", which are accessed via /proc/sys/
.
Linux also supports supports network namespaces (netns) which enable isolated
virtual network stacks and are used heavily by containerization platforms like
LXC or Docker. See network_namespaces(7)
.
It's generally understood that the "net" sysctls (under /proc/sys/net
) are
supposed to be "netns safe", meaning that manipulating sysctls from one network
namespace cannot affect any other network namespace. This isn't exactly
guaranteed, though.
It may be desirable to allow a container to write to net sysctls, specifically
parameters of devices which exist only within the container's netns. However,
the latest version of Docker (20.10.6 as of this writing) mounts all of
/proc/sys
read-only, to prevent changes made in a container from "leaking"
out of the container. This protection mechanism makes it more difficult (and
less secure) to run a libvirt QEMU VM inside of a Docker container.
This tool was inspired by conversation on this runc issue.
Use of this tool helped to uncover several bugs in the Linux kernel's implementation of several sysctls, which have been subsequently fixed by this tool's author:
Bug 1: Several nf_conntrack
sysctls are global and writable by any netns
- Affected sysctls:
net.nf_conntrack_max
net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_expect_max
- First broken: (long ago; since introduction of net namespaces)
- Fix:
netfilter: conntrack: Make global sysctls readonly in non-init netns
- Fixed in Kernels:
- 5.13+:
v5.13-rc1
(2671fa4dc010
) - 5.12:
v5.12.2
(671c54ea8c7f
) - 5.11:
v5.11.19
(fbf85a34ce17
) - 5.10:
v5.10.35
(d3598eb3915c
) - 5.4:
v5.4.120
(baea536cf51f
) - 4.19:
v4.19.191
(9b288479f7a9
) - 4.14:
v4.14.233
(68122479c128
) - 4.9:
v4.9.269
(da50f56e826e
)
- 5.13+:
Bug 2: tcp_allowed_congestion_control
is global and writable by any netns
- Affected sysctls:
net.ipv4.tcp_allowed_congestion_control
- First broken: v5.7
- Fix:
net: Make tcp_allowed_congestion_control readonly in non-init netns
- Fixed in Kernels:
- 5.12+:
v5.12-rc8
(97684f0970f6
) - 5.11:
v5.11.16
(1ccdf1bed140
) - 5.10:
v5.10.32
(35d7491e2f77
) - 5.4: (n/a)
- 4.19: (n/a)
- 4.14: (n/a)
- 4.4: (n/a)
- 5.12+:
Bug 3: Setting tcp_congestion_control
can globally affect tcp_allowed_congestion_control
- Related sysctls:
net.ipv4.tcp_congestion_control
(affects)net.ipv4.tcp_allowed_congestion_control
(affected)
- First broken: v4.15
- Fix:
net: Only allow init netns to set default tcp cong to a restricted algo
- Fixed in Kernels:
- 5.13+:
v5.13-rc1
(8d432592f30f
) - 5.12:
v5.12.4
(e7d7bedd507b
) - 5.11:
v5.11.21
(efe1532a6e1a
) - 5.10:
v5.10.37
(6c1ea8bee75d
) - 5.4:
v5.4.119
(9884f745108f
) - 4.19:
v4.19.191
(992de06308d9
) - 4.14: (n/a)
- 4.9: (n/a)
- 5.13+:
Additionally, a safety check was added to the kernel to prevent certain classes of bugs from going unnoticed:
31c4d2f160eb
:net: Ensure net namespace isolation of sysctls