Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be careful about freeing callback trampolines #64

Merged
merged 3 commits into from
Feb 2, 2024

Conversation

allisonkarlitskaya
Copy link
Owner

Our approach to handling Source and Slot objects is fairly clever: we
tie the call trampoline and closure to the same object that holds a
reference to the source object on the C side. When we are about to
__del__() that object, we unref the source, preventing any further
events from being dispatched. In this way, we can be completely sure
that systemd will never call our trampoline after it's been freed.

Unfortunately, this isn't good enough: we have a lot of cases where we
free a Source while it is currently being dispatched. Until now we've
never noticed a problem, but Cockpit recently added a stress-test for
inotify (test_fsinfo_watch_identity_changes) which dispatches thousand
of events and runs long enough that garbage collection gets invoked,
freeing trampolines while they are currently running. Python does not
hold a reference to the data, and this causes crashes on some
architectures.

Let's give Source and Slot a common base class (Trampoline) that models
their common behaviour. This helper class also changes the __del__()
behaviour: in case some external caller has requested deferral of the
destruction of trampolines, we add them to a list just before we get
deleted, to prevent the FFI wrapper from being destroyed with us.

We know that the problem described above is only a problem if we're
dispatching from systemd's event loop, so setup deferral on entry to the
loop and drop the deferred objects on exit.

Closes #63

Comment on lines 49 to 50
if self.deferred is not None:
self.deferred.append(self.trampoline)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me a bit nervous -- deferred is a class variable and also set as such (on the class), but del is an instance method. Can it happen that event sources or bus slots get freed at the same time in parallel threads? Does python do the locking here, or don't we support ctypes from multiple threads at all?

Also, could this be Trampoiline.deferred, to point out that this is a class var? Accessing it as self feels misleading (even though it probably does the same).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could definitely be accessed via the class.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot about the GIL! So parallel access should be safe.

Comment on lines +96 to +97
# We can be sure we're not dispatching callbacks anymore
libsystemd.Trampoline.deferred = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so that's the final cleanup of the deferred trampolines. Is there only ever one instance of Selector? This doesn't feel correct if there could be multiple ones

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. This is why I'm not super happy about this fix. If we tried to run independent mainloops in separate threads, this would indeed be incorrect. Trying to do something "more correct" here is hard, though, and we never have anything but the default loop running in the main thread, so ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. This could do with an assertion that there's only a single instance, but good enough now!

Our approach to handling Source and Slot objects is fairly clever: we
tie the call trampoline and closure to the same object that holds a
reference to the source object on the C side.  When we are about to
`__del__()` that object, we unref the source, preventing any further
events from being dispatched.  In this way, we can be completely sure
that systemd will never call our trampoline after it's been freed.

Unfortunately, this isn't good enough: we have a lot of cases where we
free a Source while it is currently being dispatched.  Until now we've
never noticed a problem, but Cockpit recently added a stress-test for
inotify (`test_fsinfo_watch_identity_changes`) which dispatches thousand
of events and runs long enough that garbage collection gets invoked,
freeing trampolines while they are currently running.  Python does not
hold a reference to the data, and this causes crashes on some
architectures.

Let's give Source and Slot a common base class (Trampoline) that models
their common behaviour.  This helper class also changes the `__del__()`
behaviour: in case some external caller has requested deferral of the
destruction of trampolines, we add them to a list just before we get
deleted, to prevent the FFI wrapper from being destroyed with us.

We know that the problem described above is only a problem if we're
dispatching from systemd's event loop, so setup deferral on entry to the
loop and drop the deferred objects on exit.

Closes #63
@allisonkarlitskaya allisonkarlitskaya merged commit 833f5ce into main Feb 2, 2024
24 checks passed
@martinpitt martinpitt deleted the trampoline-rescue branch February 2, 2024 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

crash in library wrapper
2 participants