Be careful about freeing callback trampolines #64

allisonkarlitskaya · 2024-02-01T20:38:47Z

Our approach to handling Source and Slot objects is fairly clever: we
tie the call trampoline and closure to the same object that holds a
reference to the source object on the C side. When we are about to
__del__() that object, we unref the source, preventing any further
events from being dispatched. In this way, we can be completely sure
that systemd will never call our trampoline after it's been freed.

Unfortunately, this isn't good enough: we have a lot of cases where we
free a Source while it is currently being dispatched. Until now we've
never noticed a problem, but Cockpit recently added a stress-test for
inotify (test_fsinfo_watch_identity_changes) which dispatches thousand
of events and runs long enough that garbage collection gets invoked,
freeing trampolines while they are currently running. Python does not
hold a reference to the data, and this causes crashes on some
architectures.

Let's give Source and Slot a common base class (Trampoline) that models
their common behaviour. This helper class also changes the __del__()
behaviour: in case some external caller has requested deferral of the
destruction of trampolines, we add them to a list just before we get
deleted, to prevent the FFI wrapper from being destroyed with us.

We know that the problem described above is only a problem if we're
dispatching from systemd's event loop, so setup deferral on entry to the
loop and drop the deferred objects on exit.

Closes #63

ruff started to complain about this in the venv tests.

martinpitt · 2024-02-02T08:20:11Z

src/systemd_ctypes/libsystemd.py

+        if self.deferred is not None:
+            self.deferred.append(self.trampoline)


This makes me a bit nervous -- deferred is a class variable and also set as such (on the class), but del is an instance method. Can it happen that event sources or bus slots get freed at the same time in parallel threads? Does python do the locking here, or don't we support ctypes from multiple threads at all?

Also, could this be Trampoiline.deferred, to point out that this is a class var? Accessing it as self feels misleading (even though it probably does the same).

It could definitely be accessed via the class.

I forgot about the GIL! So parallel access should be safe.

martinpitt · 2024-02-02T08:21:45Z

src/systemd_ctypes/event.py

+        # We can be sure we're not dispatching callbacks anymore
+        libsystemd.Trampoline.deferred = None


Ah, so that's the final cleanup of the deferred trampolines. Is there only ever one instance of Selector? This doesn't feel correct if there could be multiple ones

Indeed. This is why I'm not super happy about this fix. If we tried to run independent mainloops in separate threads, this would indeed be incorrect. Trying to do something "more correct" here is hard, though, and we never have anything but the default loop running in the main thread, so ...

That's fine. This could do with an assertion that there's only a single instance, but good enough now!

Our approach to handling Source and Slot objects is fairly clever: we tie the call trampoline and closure to the same object that holds a reference to the source object on the C side. When we are about to `__del__()` that object, we unref the source, preventing any further events from being dispatched. In this way, we can be completely sure that systemd will never call our trampoline after it's been freed. Unfortunately, this isn't good enough: we have a lot of cases where we free a Source while it is currently being dispatched. Until now we've never noticed a problem, but Cockpit recently added a stress-test for inotify (`test_fsinfo_watch_identity_changes`) which dispatches thousand of events and runs long enough that garbage collection gets invoked, freeing trampolines while they are currently running. Python does not hold a reference to the data, and this causes crashes on some architectures. Let's give Source and Slot a common base class (Trampoline) that models their common behaviour. This helper class also changes the `__del__()` behaviour: in case some external caller has requested deferral of the destruction of trampolines, we add them to a list just before we get deleted, to prevent the FFI wrapper from being destroyed with us. We know that the problem described above is only a problem if we're dispatching from systemd's event loop, so setup deferral on entry to the loop and drop the deferred objects on exit. Closes #63

allisonkarlitskaya requested a review from martinpitt February 1, 2024 20:38

allisonkarlitskaya force-pushed the trampoline-rescue branch from 08e0ee9 to 38a2f77 Compare February 1, 2024 20:40

allisonkarlitskaya added 2 commits February 1, 2024 21:49

test: use pytest.raise() instead of unittest

9a9a476

ruff started to complain about this in the venv tests.

librarywrapper: remove some old debugging code

3b44156

allisonkarlitskaya force-pushed the trampoline-rescue branch from 38a2f77 to 5788fa2 Compare February 1, 2024 20:49

allisonkarlitskaya mentioned this pull request Feb 2, 2024

vendor: update to the latest systemd_ctypes cockpit-project/cockpit#19927

Merged

1 task

martinpitt reviewed Feb 2, 2024

View reviewed changes

allisonkarlitskaya force-pushed the trampoline-rescue branch from 5788fa2 to fbb8e57 Compare February 2, 2024 08:34

martinpitt approved these changes Feb 2, 2024

View reviewed changes

allisonkarlitskaya merged commit 833f5ce into main Feb 2, 2024
24 checks passed

martinpitt deleted the trampoline-rescue branch February 2, 2024 08:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be careful about freeing callback trampolines #64

Be careful about freeing callback trampolines #64

allisonkarlitskaya commented Feb 1, 2024

martinpitt Feb 2, 2024

allisonkarlitskaya Feb 2, 2024

allisonkarlitskaya Feb 2, 2024

martinpitt Feb 2, 2024

martinpitt Feb 2, 2024

allisonkarlitskaya Feb 2, 2024

martinpitt Feb 2, 2024

		if self.deferred is not None:
		self.deferred.append(self.trampoline)

		# We can be sure we're not dispatching callbacks anymore
		libsystemd.Trampoline.deferred = None

Be careful about freeing callback trampolines #64

Be careful about freeing callback trampolines #64

Conversation

allisonkarlitskaya commented Feb 1, 2024

martinpitt Feb 2, 2024

Choose a reason for hiding this comment

allisonkarlitskaya Feb 2, 2024

Choose a reason for hiding this comment

allisonkarlitskaya Feb 2, 2024

Choose a reason for hiding this comment

martinpitt Feb 2, 2024

Choose a reason for hiding this comment

martinpitt Feb 2, 2024

Choose a reason for hiding this comment

allisonkarlitskaya Feb 2, 2024

Choose a reason for hiding this comment

martinpitt Feb 2, 2024

Choose a reason for hiding this comment