Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch: add on-demand retry for preemption #7581

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

juimonen
Copy link

The Batch backend currently doesn’t retry preempted jobs with on-demand VMs, which is problematic since our goal is to save costs and avoid failing large tasks due to a few preempted machines. This patch reintroduces, in part, functionality removed in commit 49d675d, enabling the job to restart on a STANDARD VM after encountering a VMPreemption error.

We have tested this initially and it seems to work, after preemption tries have gone, the job is restarted in standard VM.

Probably you notice that I'm a newbie in scala, so please just review and point out the stupidities, or just do a rewrite (patch is quite small). Anyway this is desperately needed in our project, thank you.

The Batch backend currently doesn’t retry preempted jobs with on-demand
VMs, which is problematic since our goal is to save costs and avoid
failing large tasks due to a few preempted machines. This patch
reintroduces, in part, functionality removed in commit 49d675d,
enabling the job to restart on a STANDARD VM after encountering a
VMPreemption error.
@juimonen juimonen requested a review from a team as a code owner November 19, 2024 13:18
@@ -825,7 +838,7 @@ class GcpBatchAsyncBackendJobExecutionActor(override val standardParams: Standar
override def executeAsync(): Future[ExecutionHandle] = {

// Want to force runtimeAttributes to evaluate so we can fail quickly now if we need to:
def evaluateRuntimeAttributes = Future.fromTry(Try(runtimeAttributes))
def evaluateRuntimeAttributes = Future.fromTry(Try(runtimeAttributes.copy(preemptible = preemptible)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why does this need copy?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some issues of getting the attributes updating with the new preemptible value... so I was getting the job respawned, but with original preemption value. This is me probably not understanding the intricacies of the scala class initialization, so when those parameters are updated. Should they be updated without setting the preempt with this copy?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah just tested and somehow the preemption value from initialization is not propagated without the copy, so I get spot VM instead of standard...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This occurs because the runtimeAttributes is being updated to set the local "preemptible" value, this is not the right place to do it, unfortunately, I can't recall the right place to set this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I now recall that the runtimeAttributes was some kind of object, that I just could not set the value (without copy). I guess some kind of static object created up somewhere. I think I found it from some test files where this kind of copies where made.

Anyway I see the batch API provision model is calculated from preemptible in runtimeAttributes, so if someone can guide me how to set it and where... So this is currently doing exactly what I want: after preempted job a new job with standard VM is created. Basically I just want to inject the "correct" preemptible value into runtimeAttributes after preempt fail. Of course it is doing "unnecessary" copy in the first time around, but how to get my hands on the runtimeAttributes on the second round?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants