Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support performing an action when a job is aborted #838

Open
wallart1 opened this issue Dec 19, 2024 · 5 comments
Open

Support performing an action when a job is aborted #838

wallart1 opened this issue Dec 19, 2024 · 5 comments

Comments

@wallart1
Copy link

wallart1 commented Dec 19, 2024

Summary

Support performing an action when a job is aborted.

Steps to reproduce the problem

When a job is aborted, there are cases when additional actions need to be taken in addition to just terminating tasks. I ran into this with a job that runs a ZFS scrub with the -w option, which waits for the asynchronous scrub to finish before returning. When I abort the job, the scrub command terminates, but the scrub itself is still running. To actually stop the scrub, one must issue an additional command just for this purpose.

There may also be jobs that, when voluntarily aborted, additional cleanup/recovery actions need to be taken.

Your Setup

Just a single server.

Operating system and version?

Linux Mint 22

Node.js version?

v20.17.0

Cronicle software version?

Version 0.9.59

Are you using a multi-server setup, or just a single server?

Single

Are you using the filesystem as back-end storage, or S3/Couchbase?

Filesystem

Can you reproduce the crash consistently?

Log Excerpts

@jhuckaby
Copy link
Owner

Aborting a job will send a SIGTERM to the outermost process. Can you use the Shell Plugin and provide a shell wrapper that traps SIGTERM and acts on it? Example:

#!/bin/bash

# Define a function to handle the SIGTERM signal
cleanup() {
    echo "Caught SIGTERM signal. Running cleanup..."
    # Add your custom cleanup commands here
    # For example, stop services, clean temporary files, etc.
    echo "Cleanup done."
    exit 0
}

# Trap SIGTERM signal
trap cleanup SIGTERM

# Run your job here
/path/to/my/script.sh

@wallart1
Copy link
Author

wallart1 commented Dec 19, 2024

That works! Thanks for the advice.

EDIT: Oops. See below.

@wallart1
Copy link
Author

Sorry. I spoke too soon. It doesn't actually cancel the ZFS scrub. Here is the job log after I aborted the job:

# Job ID: jm4vyxval0b
# Event Title: Scrub fivebays
# Hostname: foghorn
# Date/Time: 2024/12/19 18:44:38 (GMT-5)

+ trap cleanup SIGTERM
+ zpool scrub -w fivebays
Caught SIGTERM, killing child: 511024
Child did not exit, killing harder: 511024

# Job failed at 2024/12/19 18:45:12 (GMT-5).
# Error: Job Aborted: Manually aborted by user: admin
# End of log.

Here is the script:

#!/bin/bash
set -x

# Function to handle the SIGTERM signal (when job is aborted)
cleanup() {
    echo "Caught SIGTERM signal. Cancelling scrub of fivebays."
    zpool scrub -s fivebays
    exit $?
}

trap cleanup SIGTERM

zpool scrub -w fivebays

@wallart1 wallart1 reopened this Dec 20, 2024
@wallart1
Copy link
Author

I notice that the SIGTERM message in the log is not the same as the one in the script. Is something preempting it?

@jhuckaby
Copy link
Owner

Ah, I think I see the problem:

Caught SIGTERM, killing child: 511024
Child did not exit, killing harder: 511024

So, Cronicle gives the child 10 seconds to shut down after sending the SIGTERM. If it does not die, it sends a SIGKILL (which cannot be trapped).

You can increase the timeout in the configuration here: https://github.com/jhuckaby/Cronicle/blob/master/docs/Configuration.md#child_kill_timeout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants