Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid deadlock after SIGUSR1 #709

Merged
merged 4 commits into from
Sep 20, 2024
Merged

Avoid deadlock after SIGUSR1 #709

merged 4 commits into from
Sep 20, 2024

Conversation

DavidePrincipi
Copy link
Member

Change the behavior of SIGUSR1 handers:

  • While Agent waits for actions to complete, it continues to listen and process events.
  • While Agent waits for actions to complete, it continues to listen and process the task queue. This is important to avoid a deadlock if a running action submits a subtask back to the Agent and waits for its completion.

The PR also embeds access to a shared map inside Mutex Lock/Unlock calls, and fixes some flaky test suites.

Refs NethServer/dev#7016

Actions can run for a long time, so wait for their completion before stopping
the event handlers.
Listen for new tasks until workers finish. A subtask created by a parent worker
needs to be processed to avoid entering a deadlock state.

When SIGUSR1 is received, tasks are still pulled from the Redis queue until all
running workers finish. At that point, the Redis BRPOP goroutine
is canceled, and we can exit the actions loop.

This commit also protects the shared taskCancelFunctions map with a Mutex
because it is accessed by many goroutines simultaneously.
@DavidePrincipi
Copy link
Member Author

🤖 ChatGPT > Here’s a summary of the changes introduced by the patch:

Patch 1/4: Stop Event Handlers Later

  • Changes:
    • Introduces separate context variables for actions and events.
    • Changes shutdown handling to wait for action completion before stopping event handlers.
    • Ensures proper synchronization by waiting for completion of action and event channels.

Patch 2/4: Fix Shutdown Tests

  • Changes:
    • Updates .gitignore to include tstate.
    • Upgrades Docker Python image from python:3.8-bullseye to python:3.11.
    • Updates robotframework version in requirements.txt from 6.1.1 to 7.1.
    • Modifies shutdown test cases to ensure commands are properly processed and agents are correctly monitored during shutdown.

Patch 3/4: Fix Flaky Task Cancellation Test

  • Changes:
    • Adjusts sleep duration in 20step2 script to avoid indefinite sleeping.
    • Updates 50__cancellation.robot to handle task cancellation more reliably and to verify task status and logs correctly.

Patch 4/4: Avoid Deadlock After SIGUSR1

  • Changes:
    • Modifies task handling to prevent deadlocks by ensuring tasks are fully processed before exiting.
    • Introduces a mutex to safely handle the taskCancelFunctions map.
    • Updates listenActionsAsync to properly handle SIGUSR1 and synchronize task processing with new context handling.

These changes enhance the stability and reliability of task processing and shutdown procedures in the agent.

@DavidePrincipi DavidePrincipi merged commit f8f32f5 into main Sep 20, 2024
3 checks passed
@DavidePrincipi DavidePrincipi deleted the bug-7016 branch September 20, 2024 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants