Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically extending job liveness #702

Open
ddorian opened this issue Nov 15, 2022 · 4 comments
Open

Dynamically extending job liveness #702

ddorian opened this issue Nov 15, 2022 · 4 comments
Labels
Issue contains: Exploration & Design decisions 🤯 We don't know how this will be implemented yet Issue contains: Some Python 🐍 This issue involves writing some Python code Issue contains: Some SQL 🐘 This features require changing the SQL model Issue type: Feature ⭐️ Add a new feature that didn't exist before

Comments

@ddorian
Copy link

ddorian commented Nov 15, 2022

So you have Retry Stalled Jobs, where you set a RUNNING_JOBS_MAX_TIME and you retry jobs that are stalled https://procrastinate.readthedocs.io/en/stable/howto/retry_stalled_jobs.html.

Now, what happens if your jobs are dynamic in nature and can take very long?

Think, you're converting videos on a video-uploading website.
Someone uploads 1 minute video, another one uploads a 24hour video.

You need a way to "extend" the lifetime of the 24hour video job while also having the ability to retry it if it failed/stalled somehow.

This way you'd need a thread, that will extend a timestamp in the database every, say, 30 seconds.
And stalled jobs would only be considered those where time has passed since last extended time.

Makes sense? Or maybe there's another way?

@ewjoachim
Copy link
Member

ewjoachim commented Nov 15, 2022

I think it could make sense to store something in the database, but I'm not exactly sure if procrastinate should be in charge of that, and what exactly it would entail. Do you think it's something you could be doing on your side ? Rather than callingget_stalled_jobs, use list_jobs and have your own logic determine whether tasks are stalled or not ?

I'm always a bit hesitant to add new mechanisms, especially when they need threads and such.

@ddorian
Copy link
Author

ddorian commented Nov 16, 2022

I think it could make sense to store something in the database, but I'm not exactly sure if procrastinate should be in charge of that,

I think it's part of the job-queue.

and what exactly it would entail.

Just keeping a timestamp on the job and extending it and comparing it when doing list_jobs.

There are 2 ways, either keeping the job locked somehow (i think that's what celery/rabbitmq do) and that can be done by keeping transaction open (but will be heavy transaction). Or extending the time (what SQS does).

I'm always a bit hesitant to add new mechanisms, especially when they need threads and such.

With async it shouldn't be heavy. I've previously used gevent which also isn't heavy.
Or you can use 1 thread to extend all active jobs in the current process, which also shouldn't be heavy.

@caire-bear
Copy link

caire-bear commented May 9, 2023

Would also like this feature for parsing a large file in a job.

In NSQ you would touch a message so it wouldn't timeout and get requeued by the message broker. It was up to the client to keep the heartbeat going to extend the time to process the message, though. If it didn't hear back within some amount of time it would requeue the message. I think we used some light async periodic callback to do it tornado.ioloop.PeriodicCallback. Doing something like this would require adding some concept of a timeout timestamp to each procrastinate job that would get extended with each touch and also defining a timeout for each queue.
https://github.com/nsqio/nsq/blob/1362af17d50b7129b47c0291e7f2e0b7eef2bb62/nsqd/channel.go#L333

A similar mechanism in procrastinate may be helpful so a queue with a long running job that's still making progress can continue working while ensuring other queues are more timely via retry_stalled_jobs.

@ewjoachim
Copy link
Member

I believe this may be similar to what's discussed in #740 : using a heartbeat on jobs rather than a timeout to evaluate if a job is dead. Also, it's possible that a worker might be still sending heartbeats but the job would be in an infinite loop, and in that case, we'd still need a timeout. There's a mechanism for customizing retries. Maybe the should be a similar one for customizing timeouts.

@ewjoachim ewjoachim added Issue contains: Some SQL 🐘 This features require changing the SQL model Issue type: Feature ⭐️ Add a new feature that didn't exist before Issue contains: Exploration & Design decisions 🤯 We don't know how this will be implemented yet Issue contains: Some Python 🐍 This issue involves writing some Python code labels Jan 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue contains: Exploration & Design decisions 🤯 We don't know how this will be implemented yet Issue contains: Some Python 🐍 This issue involves writing some Python code Issue contains: Some SQL 🐘 This features require changing the SQL model Issue type: Feature ⭐️ Add a new feature that didn't exist before
Projects
None yet
Development

No branches or pull requests

3 participants