snc-library: Delete the failed pods before check for available one #921

praveenkumar · 2024-06-19T07:08:23Z

Sometime pods goes to ContainerStatusUnknown state where it is not able to send the status to kubelet and it stays there till manually deleted and due to it our snc script fails. In this PR we are deleting the pods which are in failed state (which is the same for ContainerStatusUnknown one) and then checks the pods availablity.

+ sleep 256
+ all_pods_are_running_completed none
+ local ignoreNamespace=none
+ ./openshift-clients/linux/oc get pod --no-headers --all-namespaces '--field-selector=metadata.namespace!=none'
+ grep -v Running
+ grep -v Completed
openshift-kube-apiserver                           installer-11-crc                                         0/1   ContainerStatusUnknown   1                19m
+ exit=1
+ wait=512
+ count=10
+ '[' 10 -lt 10 ']'
+ echo 'Retry 10/10 exited 1, no more retries left.'
Retry 10/10 exited 1, no more retries left.

fixes: #920

openshift-ci · 2024-06-19T07:08:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from praveenkumar. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

praveenkumar · 2024-06-19T07:08:39Z

/cherry-pick release-4.16

openshift-cherrypick-robot · 2024-06-19T07:08:42Z

@praveenkumar: once the present PR merges, I will cherry-pick it on top of release-4.16 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

cfergeau · 2024-07-01T13:59:53Z

snc-library.sh

@@ -243,6 +243,7 @@ function no_operators_degraded() {

 function all_pods_are_running_completed() {
    local ignoreNamespace=$1
+    ${OC} delete pods --field-selector=status.phase=Failed -A
    ! ${OC} get pod --no-headers --all-namespaces --field-selector=metadata.namespace!="${ignoreNamespace}" | grep -v Running | grep -v Completed


My understanding of all_pods_are_running_completed is that it tells us if there are pods which are not in the Running state nor in the Completed state. If we start deleting these pods before doing the check, is it worth keeping the check at all?

Which pods get to this weird ContainerStatusUnknown state? Is there any explanation for this? Maybe we can only delete pods in this ContainerStatusUnknown state if we are sure it's always a false positive for all_pods_are_running_complete?

My understanding of all_pods_are_running_completed is that it tells us if there are pods which are not in the Running state nor in the Completed state. If we start deleting these pods before doing the check, is it worth keeping the check at all?

we are only deleting the pods which have in Failed phase , there might be other phase of pod like pending . (https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase )

Which pods get to this weird ContainerStatusUnknown state? Is there any explanation for this?

We did try to describe those pods but doesn't get proper explanation, just get the phase as Failed :(

Maybe we can only delete pods in this ContainerStatusUnknown state if we are sure it's always a false positive for all_pods_are_running_complete?

when container goes to this state the phase become Failed and that is used for field-section option. There is no option around ContainerStatusUnknown state in the field section side. It shouldn't provide a false positive because there we are making sure no failed container but also not a pending state container.

Do we have a strict equivalence between Failed and ContainerStatusUnknown? In other words, if a container in the Failed phase, it always has ContainerStatusUnknown, and if a container has ContainerStatusUnknown, it always is in the Failed phase. The latter is true, but is the former also true Failed implies ContainerStatusUnknown, and nothing else? We don't want to ignore Failed containers which are not ContainerStatusUnknown.

Do we have a strict equivalence between Failed and ContainerStatusUnknown? In other words, if a container in the Failed phase, it always has ContainerStatusUnknown, and if a container has ContainerStatusUnknown, it always is in the Failed phase.

Second assumption is correct if a container has ContainerStatusUnknown then it always in the Failed phase

The latter is true, but is the former also true Failed implies ContainerStatusUnknown, and nothing else? We don't want to ignore Failed containers which are not ContainerStatusUnknown.

There can be many reason a container can be in Failed phase like resource limitation or dependent resource not available ...etc. and most of the time it retries and change the phase like container creating or pending . If we delete a Failed pod which is associated with a deployment (which is the case for all the operator) then new pod come up. Even the ContainerStatusUnknown state pod is deleted and associated with deployment it will come up a new one so it is not like we ignoring it but we are making sure once we delete the failed one the new pods initiated by respective deployment should be green.

With this explanation, it makes sense to add a retry_failed_pods method which will do the oc delete. It's really unexpected that the method all_pods_are_running_or_completed would delete pods or ensure the failed pods are restated.

added retry_failed_pods and this method is called from all_pods_are_running_or_completed .

If you call retry_failed_pods from all_pods_are_running_or_completed, you still have the same issue I described in the previous comment:

It's really unexpected that the method all_pods_are_running_or_completed would delete pods or ensure the failed pods are restarted.

We already delete the failed pods here

snc/snc-library.sh

Line 271 in 26b44f5

retry ${OC} delete pods --field-selector=status.phase=Failed -A

Are there new failed pods appearing while we are waiting for all the pods to be ready?
all_pods_are_running_or_completed has an ignore_namespace parameter, should we also ignore that namespace when deleting?

Are there new failed pods appearing while we are waiting for all the pods to be ready?

@cfergeau yes that's the reason we are putting it in all_pods_are_running_or_completed function.

all_pods_are_running_or_completed has an ignore_namespace parameter, should we also ignore that namespace when deleting?

No, we don't need to ignore namespace for deleting failed pods.

@cfergeau yes that's the reason we are putting it in all_pods_are_running_or_completed function.

So even something like this is not enough?

diff --git a/snc-library.sh b/snc-library.sh index f3f2831..3997a06 100755 --- a/snc-library.sh +++ b/snc-library.sh @@ -248,13 +248,14 @@ function wait_till_cluster_stable() { fi if [ $count -eq $numConsecutive ]; then echo "Cluster has stabilized" - retry ${OC} delete pods --field-selector=status.phase=Failed -A break fi sleep 30s a=$((a + 1)) done + retry ${OC} delete pods --field-selector=status.phase=Failed -A + # Wait till all the pods are either running or complete state retry all_pods_are_running_completed "${ignoreNamespace}" }

The removal of this oc delete pods could be part of this PR..

No, we don't need to ignore namespace for deleting failed pods.

Why? This makes all_pods_are_running_or_completed even more odd and confusing.

Sometime pods goes to `ContainerStatusUnknown` state where it is not able to send the status to kubelet and it stays there till manually deleted and due to it our snc script fails. In this PR we are deleting the pods which are in failed state (which is the same for ContainerStatusUnknown one) and then checks the pods availablity. ``` + sleep 256 + all_pods_are_running_completed none + local ignoreNamespace=none + ./openshift-clients/linux/oc get pod --no-headers --all-namespaces '--field-selector=metadata.namespace!=none' + grep -v Running + grep -v Completed openshift-kube-apiserver installer-11-crc 0/1 ContainerStatusUnknown 1 19m + exit=1 + wait=512 + count=10 + '[' 10 -lt 10 ']' + echo 'Retry 10/10 exited 1, no more retries left.' Retry 10/10 exited 1, no more retries left. ``` fixes: crc-org#920

praveenkumar · 2024-07-03T07:00:21Z

/cherry-pick release-4.17

openshift-cherrypick-robot · 2024-07-03T07:00:23Z

@praveenkumar: once the present PR merges, I will cherry-pick it on top of release-4.17 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2024-07-03T09:41:36Z

@praveenkumar: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-microshift	`06e6b4b`	link	true	`/test e2e-microshift`
ci/prow/e2e-snc	`06e6b4b`	link	true	`/test e2e-snc`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from anjannath and gbraad June 19, 2024 07:08

cfergeau reviewed Jul 1, 2024

View reviewed changes

praveenkumar force-pushed the issue_920 branch from a4f5abd to 06e6b4b Compare July 3, 2024 06:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snc-library: Delete the failed pods before check for available one #921

snc-library: Delete the failed pods before check for available one #921

praveenkumar commented Jun 19, 2024

openshift-ci bot commented Jun 19, 2024

praveenkumar commented Jun 19, 2024

openshift-cherrypick-robot commented Jun 19, 2024

cfergeau Jul 1, 2024

praveenkumar Jul 2, 2024

cfergeau Jul 2, 2024

praveenkumar Jul 2, 2024

cfergeau Jul 2, 2024

praveenkumar Jul 3, 2024

cfergeau Jul 8, 2024

cfergeau Jul 8, 2024

praveenkumar Jul 9, 2024

cfergeau Jul 9, 2024 •

edited

Loading

praveenkumar commented Jul 3, 2024

openshift-cherrypick-robot commented Jul 3, 2024

openshift-ci bot commented Jul 3, 2024

snc-library: Delete the failed pods before check for available one #921

Are you sure you want to change the base?

snc-library: Delete the failed pods before check for available one #921

Conversation

praveenkumar commented Jun 19, 2024

openshift-ci bot commented Jun 19, 2024

praveenkumar commented Jun 19, 2024

openshift-cherrypick-robot commented Jun 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cfergeau Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

praveenkumar commented Jul 3, 2024

openshift-cherrypick-robot commented Jul 3, 2024

openshift-ci bot commented Jul 3, 2024

cfergeau Jul 9, 2024 •

edited

Loading