-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: allowing for Retry to work with SageMaker steps #140
Comments
Interesting, I think that's a feature the Step Functions or SageMaker service needs to support. Step Functions will retry with the same parameters. A workaround that could be done today is to catch errors, go to another step that creates a new job name, then go back to the TrainingStep which reads the JobName from StepInput. Crude ASCII diagram:
Or perhaps |
Any update? |
I found a workaround for this. training_step = steps.TrainingStep(
"Train Step",
estimator=xgb,
data={
"train": sagemaker.TrainingInput(train_s3_file, content_type="application/x-parquet"),
"validation": sagemaker.TrainingInput(validation_s3_file, content_type="application/x-parquet"),
},
job_name=ExecutionInput()["dummy"],
parameters = {
"TrainingJobName.$": "States.Format('{}-{}-{}', $$.StateMachine.Name, $$.Execution.Name, $$.State.RetryCount)",
},
retry=default_retryer,
) Please note how I set the job_name to |
Is this feature being implemented? I am facing the same issue, although not related to retry. Which means we can only use |
In my example above the ExecutionInput is not used, it is literally a dummy.
Because here the TrainingJobName will be overwritten with whats provided here, which includes the name of the step function, the execution id and the retry count. This will generate a new name for every execution. |
Currently, the
Retry
mechanism does not work withTrainingStep
andProcessingStep
as the full job name must be specified to the step constructor so that if the step fails when the job has already been created, all retries will fail in submitting the job as the job name has already been used.This happens for almost any error (including capacity errors) excluding throttling errors.
A possible solution might be to add an alternative parameter to specify a job name prefix, instead of a full name, and let SageMaker add some random suffix.
The text was updated successfully, but these errors were encountered: