Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistake in validation of Node Termination Handler #16587

Open
flipsed opened this issue May 22, 2024 · 0 comments
Open

Mistake in validation of Node Termination Handler #16587

flipsed opened this issue May 22, 2024 · 0 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@flipsed
Copy link

flipsed commented May 22, 2024

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

1.28

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.28

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops1.28.4 replace --force -f /path/to/kops.yaml

5. What happened after the commands executed?

Error: error replacing cluster: spec.cloudProvider.aws.nodeTerminationHandler.enableScheduledEventDraining: Forbidden: scheduled event draining cannot be disabled in Queue Processor mode

6. What did you expect to happen?

I would expect to be able to have enabledScheduledEventDraining disabled in the config while in SQS mode. The kops validation is running this code which is problematic:

func validateNodeTerminationHandler(cluster *kops.Cluster, spec *kops.NodeTerminationHandlerSpec, fldPath *field.Path) (allErrs field.ErrorList) {
	if spec.IsQueueMode() {
		if spec.EnableSpotInterruptionDraining != nil && !*spec.EnableSpotInterruptionDraining {
			allErrs = append(allErrs, field.Forbidden(fldPath.Child("enableSpotInterruptionDraining"), "spot interruption draining cannot be disabled in Queue Processor mode"))
		}
		if spec.EnableScheduledEventDraining != nil && !*spec.EnableScheduledEventDraining {
			allErrs = append(allErrs, field.Forbidden(fldPath.Child("enableScheduledEventDraining"), "scheduled event draining cannot be disabled in Queue Processor mode"))
		}
		if !fi.ValueOf(spec.EnableRebalanceDraining) && fi.ValueOf(spec.EnableRebalanceMonitoring) {
			allErrs = append(allErrs, field.Forbidden(fldPath.Child("enableRebalanceMonitoring"), "rebalance events can only drain in Queue Processor mode"))
		}
	}
	return allErrs
}

Based on the AWS Node Termination Handler documention, enableScheduledEventDraining is only applicable in IMDS mode
image. While performing kops and kubernetes upgrades of our cluster, we ran into the error above.

Looking at the AWS Node Termination Handler source code, we can see that scheduled event draining is only used when !imdsDisabled (or when imds is enabled)

	if !imdsDisabled && nthConfig.EnableScheduledEventDraining {
		//will retry 4 times with an interval of 2 seconds.
		pollCtx, cancelPollCtx := context.WithTimeout(context.Background(), 8*time.Second)
		err = wait.PollUntilContextCancel(pollCtx, 2*time.Second, true, func(context.Context) (done bool, err error) {
			err = handleRebootUncordon(nthConfig.NodeName, interruptionEventStore, *node)
			if err != nil {
				log.Warn().Err(err).Msgf("Unable to complete the uncordon after reboot workflow on startup, retrying")
			}
			return false, nil
		})
		if err != nil {
			log.Warn().Err(err).Msgf("All retries failed, unable to complete the uncordon after reboot workflow")
		}
		cancelPollCtx()
	}

We should be able to disable Scheduled Event Draining while in SQS mode since it has no impact @johngmyers. Maybe I'm missing something here?

7. Please provide your cluster manifest.
This is the relevant part:

  nodeTerminationHandler:
    enabled: true
    enableSQSTerminationDraining: true
    managedASGTag: "aws-node-termination-handler/managed"
    cpuRequest: 200m
    prometheusEnable: true
    enableRebalanceMonitoring: false
    enableRebalanceDraining: false
    enableSpotInterruptionDraining: true
    enableScheduledEventDraining: false

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants