New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpf-restrict-fs: preserve cgroup_hash map during PID1 reexecution #32659
base: main
Are you sure you want to change the base?
Conversation
Important An -rc1 tag has been created and a release is being prepared, so please note that PRs introducing new features and APIs will be held back until the new version has been released. |
f22e47c
to
fb11209
Compare
src/core/bpf-restrict-fs.c
Outdated
assert(r <= 0); | ||
if (r < 0) | ||
return log_error_errno(r, "bpf-restrict-fs: Failed to set inner map fd: %m"); | ||
if (map_fd > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fd are valid if >= 0
Hmm, the pre-existing code is indeed really broken about this. By passing over the map fd things certainly will be better, but this still means that over the reexec we'll unload the bpf program and thus not enforce strict fs anymore. Which really sucks I guess. I think there are multiple options we have here:
|
Maybe a compromise that is relatively simple to implement is this:
I think that should be reasonably easy to implement hence: just pin the two things, don#t pass over any fds, instead try to reopen the pinned programs when we initialize. And be careful with picking the names for the pinned programs so that we can eventually add a version suffix to it, if we want. See nsresourced for inspiration how the pinning works. It should be very simple I guess. Pinning is nice anyway, since it allows other programs/admins to inspect the program and the maps more easily. |
Just to make it clearer: with the patch applied the fs strict enforcement will be disabled only during the short period of time when PID1 will basically "manager_free() + exec() + manager_new()".
The BPF program could be simply updated by creating and loading a new instance of the BPF program, passing the map fd to the new instance and then unload the old BPF instance via the fd that was passed over by the previous instance of PID1. I'm not really seeing the point to check the version here since the BPF program is part of PID1 (at least on openSUSE) and we always assume that PID1 is re-executing only toward newer versions that must be compat with its previous state. Regarding making the BPF program available via bpffs that doesn't seem to be useful either: PID1 already has all the infra to serialize/deserialize its state so passing the prog/map fds over is trivial. Moreover the BPF program is not going to be shared with any other processes and PID1 is not supposed to exit at any time (unless it crashes). So we might want to pin it just in case PID1 crashes but I don't see the point to make the objects available via bpffs. |
When PID1 is re-executed, "cgroup_hash" map was destroyed and the "restrict_fs" eBPF program was restarted with a new and empty map, loosing all entries for the registered services (those with RestrictFileSystem= set): $ systemctl start test-restrict-fs.service $ bpftool map dump name cgroup_hash key: dd 0e 00 00 00 00 00 00 inner_map_id: 21 Found 1 element $ systemctl daemon-reexec $ bpftool map dump name cgroup_hash Found 0 elements The fix consists in keeping the fds of the eBPF prog/link/map opened in order to make sure that the eBPF keeps running properly while PID1 is being re-executed. When the new instance of PID1 is started, the old instance of the eBPF program is replaced by a new version and the old eBPF map containing all the registered services is reused by the new prog instance.
fb11209
to
2dc98f5
Compare
That's not necessarily "short", we cannot know how much time passes during such a cycle on an overloaded system.
BPF maps are closely related to BPF programs. they are the program's data structures after all. You wouldn't separate those from a C program either, I am pretty sure we shouldn't do that here either.
Don't let @DaanDeMeyer know that you don't care about version downgrades... I am actually fine if things break gracefully for downgrades. I am not on board if they fail in confusing and weird ways because we updated the bpf program but not the bpf map belonging to it.
dunno. I think we should pin all our programs, and some of the newer bpf programs we added already do this. It just makes things a lot nicer to work with since you can introspect things with bpftool. Hence, pinning is a win in any case. Also, pinning is trivial.
pinning means making things visible in bpffs. |
When PID1 is reexecuted, "cgroup_hash" map was destroyed and the "restrict_fs" eBPF program was restarted with a new and empty map, loosing all entries for previously registered services with RestrictFileSystem= set:
$ systemctl start test-restrict-fs.service
$ bpftool map dump name cgroup_hash
key: dd 0e 00 00 00 00 00 00 inner_map_id: 21
Found 1 element
$ systemctl daemon-reexec
$ bpftool map dump name cgroup_hash
Found 0 elements
This patch fixes this issue by preserving the eBPF map and by reusing it when "restrict_fs" is being initialized again during PID reexec.