Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf-restrict-fs: preserve cgroup_hash map during PID1 reexecution #32659

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fbuihuu
Copy link
Contributor

@fbuihuu fbuihuu commented May 6, 2024

When PID1 is reexecuted, "cgroup_hash" map was destroyed and the "restrict_fs" eBPF program was restarted with a new and empty map, loosing all entries for previously registered services with RestrictFileSystem= set:

$ systemctl start test-restrict-fs.service

$ bpftool map dump name cgroup_hash
key: dd 0e 00 00 00 00 00 00 inner_map_id: 21
Found 1 element

$ systemctl daemon-reexec

$ bpftool map dump name cgroup_hash
Found 0 elements

This patch fixes this issue by preserving the eBPF map and by reusing it when "restrict_fs" is being initialized again during PID reexec.

@github-actions github-actions bot added util-lib please-review PR is ready for (re-)review by a maintainer labels May 6, 2024
Copy link

github-actions bot commented May 6, 2024

Important

An -rc1 tag has been created and a release is being prepared, so please note that PRs introducing new features and APIs will be held back until the new version has been released.

src/core/manager-serialize.c Outdated Show resolved Hide resolved
src/core/manager.c Outdated Show resolved Hide resolved
@fbuihuu fbuihuu force-pushed the fix-restrict-fs-on-reexec branch from f22e47c to fb11209 Compare May 6, 2024 12:57
assert(r <= 0);
if (r < 0)
return log_error_errno(r, "bpf-restrict-fs: Failed to set inner map fd: %m");
if (map_fd > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fd are valid if >= 0

@poettering
Copy link
Member

Hmm, the pre-existing code is indeed really broken about this.

By passing over the map fd things certainly will be better, but this still means that over the reexec we'll unload the bpf program and thus not enforce strict fs anymore. Which really sucks I guess.

I think there are multiple options we have here:

  1. Pass over both the bpf map fd and the bpf program "link" fd, so that the program is always pinned. This means a reexec would not update the BPF programs anymore. They have seldomly changed, so maybe this is OK, but if we do this maybe send along some version id just in case, and refuse to reuse too old maps/programs.

  2. Use the bpf "pinning" feature for this (we already use that in systemd-nsresourced). It means a bpf program/map can be marked as "pinned" in kernel, so that they stay around even after the last fd is dropped. these objects are made visible via bpffs then, and that's also how one can reacquire an fd (which libbpf hides if used properly). This creates the same versioning probablem as above. We could simply include the version id in the pinning name maybe?

  3. Neither pin, nor pass fds over, and instead just rebuild the full map after coming back, accepting the window where no bpf program runs.

  4. Be smart about version updates: i.e. pin or pass fds for old map/program. but after reexec instead of reusing them, create everything anew, rebuild a new map. Then install it, and finally close the old fds, thus deinstalling the old stuff, leaving only the old one in place.

@poettering
Copy link
Member

Maybe a compromise that is relatively simple to implement is this:

  1. Pin both program and map for now.

  2. Consider the name you pick a version identifier for program and map. i.e. when the bpf program is changed these names must be changed.

  3. Don't bother with any schemes for BPF program updates just now, since so far we never came into the situation, and we can think about it then. i.e. don't bother with loading the new bpf program, and rebuilding the map for now. Instead leave that until we actually change the BPF program.

I think that should be reasonably easy to implement hence: just pin the two things, don#t pass over any fds, instead try to reopen the pinned programs when we initialize. And be careful with picking the names for the pinned programs so that we can eventually add a version suffix to it, if we want.

See nsresourced for inspiration how the pinning works. It should be very simple I guess.

Pinning is nice anyway, since it allows other programs/admins to inspect the program and the maps more easily.

@poettering poettering added reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks bpf and removed please-review PR is ready for (re-)review by a maintainer labels May 6, 2024
@fbuihuu
Copy link
Contributor Author

fbuihuu commented May 13, 2024

Hmm, the pre-existing code is indeed really broken about this.

By passing over the map fd things certainly will be better, but this still means that over the reexec we'll unload the bpf program and thus not enforce strict fs anymore. Which really sucks I guess.

Just to make it clearer: with the patch applied the fs strict enforcement will be disabled only during the short period of time when PID1 will basically "manager_free() + exec() + manager_new()".

  1. Pass over both the bpf map fd and the bpf program "link" fd, so that the program is always pinned. This means a reexec would not update the BPF programs anymore. They have seldomly changed, so maybe this is OK, but if we do this maybe send along some version id just in case, and refuse to reuse too old maps/programs.

The BPF program could be simply updated by creating and loading a new instance of the BPF program, passing the map fd to the new instance and then unload the old BPF instance via the fd that was passed over by the previous instance of PID1.

I'm not really seeing the point to check the version here since the BPF program is part of PID1 (at least on openSUSE) and we always assume that PID1 is re-executing only toward newer versions that must be compat with its previous state.

Regarding making the BPF program available via bpffs that doesn't seem to be useful either: PID1 already has all the infra to serialize/deserialize its state so passing the prog/map fds over is trivial. Moreover the BPF program is not going to be shared with any other processes and PID1 is not supposed to exit at any time (unless it crashes).

So we might want to pin it just in case PID1 crashes but I don't see the point to make the objects available via bpffs.

When PID1 is re-executed, "cgroup_hash" map was destroyed and the "restrict_fs"
eBPF program was restarted with a new and empty map, loosing all entries for
the registered services (those with RestrictFileSystem= set):

   $ systemctl start test-restrict-fs.service

   $ bpftool map dump name cgroup_hash
   key: dd 0e 00 00 00 00 00 00  inner_map_id: 21
   Found 1 element

   $ systemctl daemon-reexec

   $ bpftool map dump name cgroup_hash
   Found 0 elements

The fix consists in keeping the fds of the eBPF prog/link/map opened in order
to make sure that the eBPF keeps running properly while PID1 is being
re-executed. When the new instance of PID1 is started, the old instance of the
eBPF program is replaced by a new version and the old eBPF map containing all
the registered services is reused by the new prog instance.
@fbuihuu fbuihuu force-pushed the fix-restrict-fs-on-reexec branch from fb11209 to 2dc98f5 Compare May 14, 2024 08:08
@github-actions github-actions bot added please-review PR is ready for (re-)review by a maintainer and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels May 14, 2024
@poettering
Copy link
Member

Just to make it clearer: with the patch applied the fs strict enforcement will be disabled only during the short period of time when PID1 will basically "manager_free() + exec() + manager_new()".

That's not necessarily "short", we cannot know how much time passes during such a cycle on an overloaded system.

  1. Pass over both the bpf map fd and the bpf program "link" fd, so that the program is always pinned. This means a reexec would not update the BPF programs anymore. They have seldomly changed, so maybe this is OK, but if we do this maybe send along some version id just in case, and refuse to reuse too old maps/programs.

The BPF program could be simply updated by creating and loading a new instance of the BPF program, passing the map fd to the new instance and then unload the old BPF instance via the fd that was passed over by the previous instance of PID1.

BPF maps are closely related to BPF programs. they are the program's data structures after all. You wouldn't separate those from a C program either, I am pretty sure we shouldn't do that here either.

I'm not really seeing the point to check the version here since the BPF program is part of PID1 (at least on openSUSE) and we always assume that PID1 is re-executing only toward newer versions that must be compat with its previous state.

Don't let @DaanDeMeyer know that you don't care about version downgrades...

I am actually fine if things break gracefully for downgrades. I am not on board if they fail in confusing and weird ways because we updated the bpf program but not the bpf map belonging to it.

Regarding making the BPF program available via bpffs that doesn't seem to be useful either: PID1 already has all the infra to serialize/deserialize its state so passing the prog/map fds over is trivial. Moreover the BPF program is not going to be shared with any other processes and PID1 is not supposed to exit at any time (unless it crashes).

dunno. I think we should pin all our programs, and some of the newer bpf programs we added already do this. It just makes things a lot nicer to work with since you can introspect things with bpftool. Hence, pinning is a win in any case.

Also, pinning is trivial.

So we might want to pin it just in case PID1 crashes but I don't see the point to make the objects available via bpffs.

pinning means making things visible in bpffs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bpf please-review PR is ready for (re-)review by a maintainer util-lib
Development

Successfully merging this pull request may close these issues.

None yet

3 participants