bpf-restrict-fs: preserve cgroup_hash map during PID1 reexecution #32659

fbuihuu · 2024-05-06T08:53:58Z

When PID1 is reexecuted, "cgroup_hash" map was destroyed and the "restrict_fs" eBPF program was restarted with a new and empty map, loosing all entries for previously registered services with RestrictFileSystem= set:

$ systemctl start test-restrict-fs.service

$ bpftool map dump name cgroup_hash
key: dd 0e 00 00 00 00 00 00 inner_map_id: 21
Found 1 element

$ systemctl daemon-reexec

$ bpftool map dump name cgroup_hash
Found 0 elements

This patch fixes this issue by preserving the eBPF map and by reusing it when "restrict_fs" is being initialized again during PID reexec.

github-actions · 2024-05-06T08:57:04Z

Important

An -rc1 tag has been created and a release is being prepared, so please note that PRs introducing new features and APIs will be held back until the new version has been released.

src/core/manager-serialize.c

src/core/manager.c

poettering · 2024-05-06T17:04:45Z

src/core/bpf-restrict-fs.c

- assert(r <= 0);
- if (r < 0)
- return log_error_errno(r, "bpf-restrict-fs: Failed to set inner map fd: %m");
+ if (map_fd > 0) {


fd are valid if >= 0

poettering · 2024-05-06T17:19:04Z

Hmm, the pre-existing code is indeed really broken about this.

By passing over the map fd things certainly will be better, but this still means that over the reexec we'll unload the bpf program and thus not enforce strict fs anymore. Which really sucks I guess.

I think there are multiple options we have here:

Pass over both the bpf map fd and the bpf program "link" fd, so that the program is always pinned. This means a reexec would not update the BPF programs anymore. They have seldomly changed, so maybe this is OK, but if we do this maybe send along some version id just in case, and refuse to reuse too old maps/programs.
Use the bpf "pinning" feature for this (we already use that in systemd-nsresourced). It means a bpf program/map can be marked as "pinned" in kernel, so that they stay around even after the last fd is dropped. these objects are made visible via bpffs then, and that's also how one can reacquire an fd (which libbpf hides if used properly). This creates the same versioning probablem as above. We could simply include the version id in the pinning name maybe?
Neither pin, nor pass fds over, and instead just rebuild the full map after coming back, accepting the window where no bpf program runs.
Be smart about version updates: i.e. pin or pass fds for old map/program. but after reexec instead of reusing them, create everything anew, rebuild a new map. Then install it, and finally close the old fds, thus deinstalling the old stuff, leaving only the old one in place.

poettering · 2024-05-06T17:23:29Z

Maybe a compromise that is relatively simple to implement is this:

Pin both program and map for now.
Consider the name you pick a version identifier for program and map. i.e. when the bpf program is changed these names must be changed.
Don't bother with any schemes for BPF program updates just now, since so far we never came into the situation, and we can think about it then. i.e. don't bother with loading the new bpf program, and rebuilding the map for now. Instead leave that until we actually change the BPF program.

I think that should be reasonably easy to implement hence: just pin the two things, don#t pass over any fds, instead try to reopen the pinned programs when we initialize. And be careful with picking the names for the pinned programs so that we can eventually add a version suffix to it, if we want.

See nsresourced for inspiration how the pinning works. It should be very simple I guess.

Pinning is nice anyway, since it allows other programs/admins to inspect the program and the maps more easily.

fbuihuu · 2024-05-13T08:33:41Z

Hmm, the pre-existing code is indeed really broken about this.

By passing over the map fd things certainly will be better, but this still means that over the reexec we'll unload the bpf program and thus not enforce strict fs anymore. Which really sucks I guess.

Just to make it clearer: with the patch applied the fs strict enforcement will be disabled only during the short period of time when PID1 will basically "manager_free() + exec() + manager_new()".

Pass over both the bpf map fd and the bpf program "link" fd, so that the program is always pinned. This means a reexec would not update the BPF programs anymore. They have seldomly changed, so maybe this is OK, but if we do this maybe send along some version id just in case, and refuse to reuse too old maps/programs.

The BPF program could be simply updated by creating and loading a new instance of the BPF program, passing the map fd to the new instance and then unload the old BPF instance via the fd that was passed over by the previous instance of PID1.

I'm not really seeing the point to check the version here since the BPF program is part of PID1 (at least on openSUSE) and we always assume that PID1 is re-executing only toward newer versions that must be compat with its previous state.

Regarding making the BPF program available via bpffs that doesn't seem to be useful either: PID1 already has all the infra to serialize/deserialize its state so passing the prog/map fds over is trivial. Moreover the BPF program is not going to be shared with any other processes and PID1 is not supposed to exit at any time (unless it crashes).

So we might want to pin it just in case PID1 crashes but I don't see the point to make the objects available via bpffs.

When PID1 is re-executed, "cgroup_hash" map was destroyed and the "restrict_fs" eBPF program was restarted with a new and empty map, loosing all entries for the registered services (those with RestrictFileSystem= set): $ systemctl start test-restrict-fs.service $ bpftool map dump name cgroup_hash key: dd 0e 00 00 00 00 00 00 inner_map_id: 21 Found 1 element $ systemctl daemon-reexec $ bpftool map dump name cgroup_hash Found 0 elements The fix consists in keeping the fds of the eBPF prog/link/map opened in order to make sure that the eBPF keeps running properly while PID1 is being re-executed. When the new instance of PID1 is started, the old instance of the eBPF program is replaced by a new version and the old eBPF map containing all the registered services is reused by the new prog instance.

poettering · 2024-05-22T21:32:57Z

Just to make it clearer: with the patch applied the fs strict enforcement will be disabled only during the short period of time when PID1 will basically "manager_free() + exec() + manager_new()".

That's not necessarily "short", we cannot know how much time passes during such a cycle on an overloaded system.

Pass over both the bpf map fd and the bpf program "link" fd, so that the program is always pinned. This means a reexec would not update the BPF programs anymore. They have seldomly changed, so maybe this is OK, but if we do this maybe send along some version id just in case, and refuse to reuse too old maps/programs.

The BPF program could be simply updated by creating and loading a new instance of the BPF program, passing the map fd to the new instance and then unload the old BPF instance via the fd that was passed over by the previous instance of PID1.

BPF maps are closely related to BPF programs. they are the program's data structures after all. You wouldn't separate those from a C program either, I am pretty sure we shouldn't do that here either.

I'm not really seeing the point to check the version here since the BPF program is part of PID1 (at least on openSUSE) and we always assume that PID1 is re-executing only toward newer versions that must be compat with its previous state.

Don't let @DaanDeMeyer know that you don't care about version downgrades...

I am actually fine if things break gracefully for downgrades. I am not on board if they fail in confusing and weird ways because we updated the bpf program but not the bpf map belonging to it.

Regarding making the BPF program available via bpffs that doesn't seem to be useful either: PID1 already has all the infra to serialize/deserialize its state so passing the prog/map fds over is trivial. Moreover the BPF program is not going to be shared with any other processes and PID1 is not supposed to exit at any time (unless it crashes).

dunno. I think we should pin all our programs, and some of the newer bpf programs we added already do this. It just makes things a lot nicer to work with since you can introspect things with bpftool. Hence, pinning is a win in any case.

Also, pinning is trivial.

So we might want to pin it just in case PID1 crashes but I don't see the point to make the objects available via bpffs.

pinning means making things visible in bpffs.

github-actions bot added util-lib please-review PR is ready for (re-)review by a maintainer labels May 6, 2024

YHNdnzj reviewed May 6, 2024

View reviewed changes

src/core/manager-serialize.c Outdated Show resolved Hide resolved

src/core/manager.c Outdated Show resolved Hide resolved

fbuihuu force-pushed the fix-restrict-fs-on-reexec branch from f22e47c to fb11209 Compare May 6, 2024 12:57

poettering requested changes May 6, 2024

View reviewed changes

poettering added reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks bpf and removed please-review PR is ready for (re-)review by a maintainer labels May 6, 2024

fbuihuu force-pushed the fix-restrict-fs-on-reexec branch from fb11209 to 2dc98f5 Compare May 14, 2024 08:08

github-actions bot added please-review PR is ready for (re-)review by a maintainer and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpf-restrict-fs: preserve cgroup_hash map during PID1 reexecution #32659

bpf-restrict-fs: preserve cgroup_hash map during PID1 reexecution #32659

fbuihuu commented May 6, 2024 •

edited by github-actions bot

github-actions bot commented May 6, 2024

poettering May 6, 2024

poettering commented May 6, 2024

poettering commented May 6, 2024

fbuihuu commented May 13, 2024

poettering commented May 22, 2024

bpf-restrict-fs: preserve cgroup_hash map during PID1 reexecution #32659

Are you sure you want to change the base?

bpf-restrict-fs: preserve cgroup_hash map during PID1 reexecution #32659

Conversation

fbuihuu commented May 6, 2024 • edited by github-actions bot

github-actions bot commented May 6, 2024

poettering May 6, 2024

Choose a reason for hiding this comment

poettering commented May 6, 2024

poettering commented May 6, 2024

fbuihuu commented May 13, 2024

poettering commented May 22, 2024

fbuihuu commented May 6, 2024 •

edited by github-actions bot