Avoid deadlock after SIGUSR1 #709

DavidePrincipi · 2024-09-19T16:18:12Z

Change the behavior of SIGUSR1 handers:

While Agent waits for actions to complete, it continues to listen and process events.
While Agent waits for actions to complete, it continues to listen and process the task queue. This is important to avoid a deadlock if a running action submits a subtask back to the Agent and waits for its completion.

The PR also embeds access to a shared map inside Mutex Lock/Unlock calls, and fixes some flaky test suites.

Actions can run for a long time, so wait for their completion before stopping the event handlers.

Listen for new tasks until workers finish. A subtask created by a parent worker needs to be processed to avoid entering a deadlock state. When SIGUSR1 is received, tasks are still pulled from the Redis queue until all running workers finish. At that point, the Redis BRPOP goroutine is canceled, and we can exit the actions loop. This commit also protects the shared taskCancelFunctions map with a Mutex because it is accessed by many goroutines simultaneously.

DavidePrincipi · 2024-09-19T16:27:35Z

🤖 ChatGPT > Here’s a summary of the changes introduced by the patch:

Patch 1/4: Stop Event Handlers Later

Changes:
- Introduces separate context variables for actions and events.
- Changes shutdown handling to wait for action completion before stopping event handlers.
- Ensures proper synchronization by waiting for completion of action and event channels.

Patch 2/4: Fix Shutdown Tests

Changes:
- Updates .gitignore to include tstate.
- Upgrades Docker Python image from python:3.8-bullseye to python:3.11.
- Updates robotframework version in requirements.txt from 6.1.1 to 7.1.
- Modifies shutdown test cases to ensure commands are properly processed and agents are correctly monitored during shutdown.

Patch 3/4: Fix Flaky Task Cancellation Test

Changes:
- Adjusts sleep duration in 20step2 script to avoid indefinite sleeping.
- Updates 50__cancellation.robot to handle task cancellation more reliably and to verify task status and logs correctly.

Patch 4/4: Avoid Deadlock After SIGUSR1

Changes:
- Modifies task handling to prevent deadlocks by ensuring tasks are fully processed before exiting.
- Introduces a mutex to safely handle the taskCancelFunctions map.
- Updates listenActionsAsync to properly handle SIGUSR1 and synchronize task processing with new context handling.

These changes enhance the stability and reliability of task processing and shutdown procedures in the agent.

DavidePrincipi added 4 commits September 19, 2024 18:06

Stop event handlers later

100a66d

Actions can run for a long time, so wait for their completion before stopping the event handlers.

Fix shutdown tests

8e2ced4

Fix flaky task cancellation test

c8a885c

DavidePrincipi self-assigned this Sep 19, 2024

DavidePrincipi requested a review from Tbaile September 19, 2024 16:23

Tbaile approved these changes Sep 20, 2024

View reviewed changes

DavidePrincipi merged commit f8f32f5 into main Sep 20, 2024
3 checks passed

DavidePrincipi deleted the bug-7016 branch September 20, 2024 07:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid deadlock after SIGUSR1 #709

Avoid deadlock after SIGUSR1 #709

DavidePrincipi commented Sep 19, 2024

DavidePrincipi commented Sep 19, 2024

Avoid deadlock after SIGUSR1 #709

Avoid deadlock after SIGUSR1 #709

Conversation

DavidePrincipi commented Sep 19, 2024

DavidePrincipi commented Sep 19, 2024

Patch 1/4: Stop Event Handlers Later

Patch 2/4: Fix Shutdown Tests

Patch 3/4: Fix Flaky Task Cancellation Test

Patch 4/4: Avoid Deadlock After SIGUSR1