strategy.drainTimeout not working as intended? #346

jess-belliveau · 2022-11-18T06:03:15Z

Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT

What happened:
I am setting strategy.drainTimeout to 1000 seconds but I see the node immediately terminated after the node drain is issued.

What you expected to happen:
I expect upgrade-manager to wait 1000 seconds after the drain is issued before terminating the instance.

How to reproduce it (as minimally and precisely as possible):

➜ cat ru-drain.yml
apiVersion: upgrademgr.keikoproj.io/v1alpha1
kind: RollingUpgrade
metadata:
  annotations:
    app.kubernetes.io/managed-by: instance-manager
    instancemgr.keikoproj.io/upgrade-scope: <snip>-instance-manager-platform-apm-us-west-2a
  name: platform-apm-us-west-2a-20220715002858-19
  namespace: instance-manager
spec:
  asgName: <snip>-instance-manager-platform-apm-us-west-2a
  forceRefresh: true
  nodeIntervalSeconds: 10
  postDrain:
    waitSeconds: 300
  postDrainDelaySeconds: 45
  strategy:
    drainTimeout: 1000      <- this is the field I'm setting
    maxUnavailable: 1
    mode: eager

Anything else we need to know?:
Am I interpreting the spec correctly?

Environment:

rolling-upgrade-controller version: v1.0.6
Kubernetes version :

$ kubectl version -o yaml
serverVersion:
  buildDate: "2022-10-24T20:32:54Z"
  compiler: gc
  gitCommit: b07006b2e59857b13fe5057a956e86225f0e82b7
  gitTreeState: clean
  gitVersion: v1.21.14-eks-fb459a0
  goVersion: go1.16.15
  major: "1"
  minor: 21+
  platform: linux/amd64

Other debugging information (if applicable):

RollingUpgrade status:

➜ kd rollingupgrades platform-apm-us-west-2a-20220715002858-20 -n instance-manager
Name:         platform-apm-us-west-2a-20220715002858-20
Namespace:    instance-manager
Labels:       <none>
Annotations:  app.kubernetes.io/managed-by: instance-manager
              instancemgr.keikoproj.io/upgrade-scope: snip-instance-manager-platform-apm-us-west-2a
API Version:  upgrademgr.keikoproj.io/v1alpha1
Kind:         RollingUpgrade
Metadata:
  Creation Timestamp:  2022-11-18T05:41:11Z
  Generation:          1
  Managed Fields:
    API Version:  upgrademgr.keikoproj.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:app.kubernetes.io/managed-by:
          f:instancemgr.keikoproj.io/upgrade-scope:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:asgName:
        f:forceRefresh:
        f:nodeIntervalSeconds:
        f:postDrain:
          .:
          f:waitSeconds:
        f:postDrainDelaySeconds:
        f:strategy:
          .:
          f:drainTimeout:
          f:maxUnavailable:
          f:mode:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2022-11-18T05:41:11Z
    API Version:  upgrademgr.keikoproj.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:completePercentage:
        f:currentStatus:
        f:endTime:
        f:lastDrainTime:
        f:lastTerminationTime:
        f:nodesProcessed:
        f:startTime:
        f:statistics:
        f:totalNodes:
        f:totalProcessingTime:
    Manager:         manager
    Operation:       Update
    Time:            2022-11-18T05:43:18Z
  Resource Version:  228511895
  UID:               2eebdb9d-f8d8-4688-8985-7d713d9245f2
Spec:
  Asg Name:               snip-instance-manager-platform-apm-us-west-2a
  Force Refresh:          true
  Node Interval Seconds:  10
  Post Drain:
    Wait Seconds:            300
  Post Drain Delay Seconds:  45
  Strategy:
    Drain Timeout:    1000
    Max Unavailable:  1
    Mode:             eager
Status:
  Complete Percentage:    100%
  Current Status:         completed
  End Time:               2022-11-18T05:43:18Z
  Last Drain Time:        2022-11-18T05:43:16Z
  Last Termination Time:  2022-11-18T05:43:16Z
  Nodes Processed:        1
  Start Time:             2022-11-18T05:41:11Z
  Statistics:
    Duration Count:       1
    Duration Sum:         2.545233409s
    Step Name:            kickoff
    Duration Count:       1
    Duration Sum:         2m1.447352312s
    Step Name:            desired_node_ready
    Duration Count:       1
    Duration Sum:         41.598µs
    Step Name:            predrain_script
    Duration Count:       1
    Duration Sum:         180.544516ms
    Step Name:            drain
    Duration Count:       1
    Duration Sum:         6.235µs
    Step Name:            postdrain_script
    Duration Count:       1
    Duration Sum:         54.047µs
    Step Name:            post_wait
    Duration Count:       1
    Duration Sum:         225.887155ms
    Step Name:            terminate
    Duration Count:       1
    Duration Sum:         4.774µs
    Step Name:            post_terminate
    Duration Count:       1
    Duration Sum:         9.999999708s
    Step Name:            terminated
    Duration Count:       1
    Duration Sum:         2m13.853890401s
    Step Name:            total
  Total Nodes:            1
  Total Processing Time:  2m7s
Events:                   <none>

controller logs:

upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:14.771Z	INFO	controllers.RollingUpgrade	***Reconciling***
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:14.771Z	INFO	controllers.RollingUpgrade	operating on existing rolling upgrade	{"scalingGroup": "snip-instance-manager-platform-apm-us-west-2a", "update strategy": {"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":1000}, "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z	INFO	controllers.RollingUpgrade	scaling group details	{"scalingGroup": "snip-instance-manager-platform-apm-us-west-2a", "desiredInstances": 1, "launchConfig": "", "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z	INFO	controllers.RollingUpgrade	checking if rolling upgrade is completed	{"name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z	INFO	controllers.RollingUpgrade	rolling upgrade configured for forced refresh	{"instance": "i-0bbb077b2dab36ac5", "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z	INFO	controllers.RollingUpgrade	drift detected in scaling group	{"driftedInstancesCount/DesiredInstancesCount": "(1/1)", "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z	INFO	controllers.RollingUpgrade	selecting batch for rotation	{"batch size": 1, "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z	INFO	controllers.RollingUpgrade	found in-progress instances	{"instances": ["i-0bbb077b2dab36ac5"]}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z	INFO	controllers.RollingUpgrade	rolling upgrade configured for forced refresh	{"instance": "i-0bbb077b2dab36ac5", "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z	INFO	controllers.RollingUpgrade	rotating batch	{"instances": ["i-0bbb077b2dab36ac5"], "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z	INFO	controllers.RollingUpgrade	no InService instances in the batch	{"batch": ["i-0bbb077b2dab36ac5"], "instances(InService)": [], "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z	INFO	controllers.RollingUpgrade	waiting for desired nodes	{"name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z	INFO	controllers.RollingUpgrade	desired nodes are ready	{"name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z	INFO	controllers.RollingUpgrade	draining the node	{"instance": "i-0bbb077b2dab36ac5", "node name": "ip-172-29-72-153.us-west-2.compute.internal", "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager WARNING: ignoring DaemonSet-managed Pods: kube-system/cilium-9chsq, kube-system/clamav-akp-ck4p6, kube-system/ebs-csi-node-28fxf, kube-system/kiam-agent-w5lxw, kube-system/kube-proxy-s5d2n, kube-system/node-local-dns-t4z8q, monitoring/node-exporter-4qsg9, ossec/ossec-akp-x55sv
upgrade-manager-controller-manager-859c65b5db-gzfns manager evicting pod nginx-ing-utility/nginx-ingress-utility-controller-65d6447d75-rpb6t
upgrade-manager-controller-manager-859c65b5db-gzfns manager evicting pod nginx-ing-grpc/nginx-ingress-grpc-controller-7dd4c7b9f-q2mc5
upgrade-manager-controller-manager-859c65b5db-gzfns manager evicting pod nginx-ing-public/nginx-ingress-public-controller-c664fcc7c-82p9x
upgrade-manager-controller-manager-859c65b5db-gzfns manager evicting pod nginx-ing-default/nginx-ingress-default-controller-69d64b6b5-n9n5d
upgrade-manager-controller-manager-859c65b5db-gzfns manager evicting pod nginx-ing-bff/nginx-ingress-bff-controller-588cc868fc-d4vt2
### should the 1000 second pause not happen here????
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.627Z	INFO	controllers.RollingUpgrade	instances drained successfully, terminating	{"name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.628Z	INFO	controllers.RollingUpgrade	terminating instance	{"instance": "i-0bbb077b2dab36ac5", "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.867Z	INFO	controllers.RollingUpgrade	***Reconciling***

The text was updated successfully, but these errors were encountered:

jess-belliveau · 2022-11-18T06:05:18Z

Ah, I should have mentioned - the problem we are facing is the pods are still in terminating state when the underlying node is terminated. We are trying to configure the RU to have a wait to allow the pods to gracefully terminate as part of the drain.

shreyas-badiger · 2022-11-18T06:23:01Z

@jess-belliveau configuring drainTimeout is exactly what the name suggests. If the node drain doesn't complete within the drainTimeout value, then the rolling upgrade (RU) is marked as failed. So, if the drain command is completed within a second, the node is terminated right after.

If I understand it correctly, you are trying to delay the node termination.
You should consider using the postDrain in spec where you can specify the wait before the termination is initiated.

jess-belliveau · 2022-11-20T23:38:49Z

@shreyas-badiger , thanks for response.

If you look at my spec at the start, I actually have set a postDrain.waitSeconds:

  postDrain:
    waitSeconds: 300

I hadn't even realised this field doesn't appear to work either. I'm not seeing a 300 second pause anywhere.

shreyas-badiger · 2022-11-21T16:05:00Z

@jess-belliveau I think the implementation for the postDrain.waitSeconds is missing.
If you have some bandwidth, can you contribute?
If not, I think you can use the postDrain script.

upgrade-manager/controllers/script_runner.go

Line 109 in 79b38c0

out, err := r.runScript(script, target)

jess-belliveau · 2022-11-23T05:47:29Z

Thanks @shreyas-badiger - I might be able to loop back in the future and see what contributions I can make.

For the time being, we are having promising results with;

"postDrain":
  "script": |
    count=10; while [ $count -gt 0 ]; do count=`kubectl get pods -A --field-selector spec.nodeName=$INSTANCE_NAME -o jsonpath='{range .items[?(.metadata.ownerReferences[*].kind!="DaemonSet")]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}' | wc -l`; echo "$count pods draining"; sleep 10; done

the only caveat being, we have had to add some binaries to the rolling-upgrade-controller image - kubectl, wc and sleep.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strategy.drainTimeout not working as intended? #346

strategy.drainTimeout not working as intended? #346

jess-belliveau commented Nov 18, 2022

jess-belliveau commented Nov 18, 2022

shreyas-badiger commented Nov 18, 2022 •

edited

jess-belliveau commented Nov 20, 2022 •

edited

shreyas-badiger commented Nov 21, 2022

jess-belliveau commented Nov 23, 2022 •

edited

strategy.drainTimeout not working as intended? #346

strategy.drainTimeout not working as intended? #346

Comments

jess-belliveau commented Nov 18, 2022

jess-belliveau commented Nov 18, 2022

shreyas-badiger commented Nov 18, 2022 • edited

jess-belliveau commented Nov 20, 2022 • edited

shreyas-badiger commented Nov 21, 2022

jess-belliveau commented Nov 23, 2022 • edited

shreyas-badiger commented Nov 18, 2022 •

edited

jess-belliveau commented Nov 20, 2022 •

edited

jess-belliveau commented Nov 23, 2022 •

edited