-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
batch: add on-demand retry for preemption #7581
base: develop
Are you sure you want to change the base?
Conversation
The Batch backend currently doesn’t retry preempted jobs with on-demand VMs, which is problematic since our goal is to save costs and avoid failing large tasks due to a few preempted machines. This patch reintroduces, in part, functionality removed in commit 49d675d, enabling the job to restart on a STANDARD VM after encountering a VMPreemption error.
@@ -825,7 +838,7 @@ class GcpBatchAsyncBackendJobExecutionActor(override val standardParams: Standar | |||
override def executeAsync(): Future[ExecutionHandle] = { | |||
|
|||
// Want to force runtimeAttributes to evaluate so we can fail quickly now if we need to: | |||
def evaluateRuntimeAttributes = Future.fromTry(Try(runtimeAttributes)) | |||
def evaluateRuntimeAttributes = Future.fromTry(Try(runtimeAttributes.copy(preemptible = preemptible))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why does this need copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some issues of getting the attributes updating with the new preemptible value... so I was getting the job respawned, but with original preemption value. This is me probably not understanding the intricacies of the scala class initialization, so when those parameters are updated. Should they be updated without setting the preempt with this copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah just tested and somehow the preemption value from initialization is not propagated without the copy, so I get spot VM instead of standard...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This occurs because the runtimeAttributes is being updated to set the local "preemptible" value, this is not the right place to do it, unfortunately, I can't recall the right place to set this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I now recall that the runtimeAttributes was some kind of object, that I just could not set the value (without copy). I guess some kind of static object created up somewhere. I think I found it from some test files where this kind of copies where made.
Anyway I see the batch API provision model is calculated from preemptible in runtimeAttributes, so if someone can guide me how to set it and where... So this is currently doing exactly what I want: after preempted job a new job with standard VM is created. Basically I just want to inject the "correct" preemptible value into runtimeAttributes after preempt fail. Of course it is doing "unnecessary" copy in the first time around, but how to get my hands on the runtimeAttributes on the second round?
The Batch backend currently doesn’t retry preempted jobs with on-demand VMs, which is problematic since our goal is to save costs and avoid failing large tasks due to a few preempted machines. This patch reintroduces, in part, functionality removed in commit 49d675d, enabling the job to restart on a STANDARD VM after encountering a VMPreemption error.
We have tested this initially and it seems to work, after preemption tries have gone, the job is restarted in standard VM.
Probably you notice that I'm a newbie in scala, so please just review and point out the stupidities, or just do a rewrite (patch is quite small). Anyway this is desperately needed in our project, thank you.