Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bad flakiness in e2e tests on Autopilot 1.29 #3746

Open
markmandel opened this issue Apr 5, 2024 · 4 comments
Open

bad flakiness in e2e tests on Autopilot 1.29 #3746

markmandel opened this issue Apr 5, 2024 · 4 comments
Assignees
Labels
kind/bug These are bugs.

Comments

@markmandel
Copy link
Member

Seems to be pretty consistent on these tests:

time="2024-04-05 17:26:01.545" level=info msg="Checking Fleet Ready replicas" expected=5 fleet=simple-fleet-1.0l4pgd fleetStatus="{Replicas:0 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:0 Players:0xc000785730 Counters:map[] Lists:map[]}" test=TestAllocatorWithSelectors
time="2024-04-05 17:26:03.546" level=info msg="Checking Fleet Ready replicas" expected=5 fleet=simple-fleet-1.0l4pgd fleetStatus="{Replicas:0 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:0 Players:0xc00047cfb0 Counters:map[] Lists:map[]}" test=TestAllocatorWithSelectors
time="2024-04-05 17:26:05.545" level=info msg="Checking Fleet Ready replicas" expected=5 fleet=simple-fleet-1.0l4pgd fleetStatus="{Replicas:0 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:0 Players:0xc00059f890 Counters:map[] Lists:map[]}" test=TestAllocatorWithSelectors
time="2024-04-05 17:26:07.544" level=info msg="Checking Fleet Ready replicas" expected=5 fleet=simple-fleet-1.0l4pgd fleetStatus="{Replicas:0 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:0 Players:0xc000785bd0 Counters:map[] Lists:map[]}" test=TestAllocatorWithSelectors
time="2024-04-05 17:26:09.155" level=info msg="error waiting for fleet condition, dumping Fleet and Gameserver data" error="context deadline exceeded" fleet=simple-fleet-1.0l4pgd fleetStatus="{Replicas:0 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:0 Players:<nil> Counters:map[] Lists:map[]}" test=TestAllocatorWithSelectors
time="2024-04-05 17:26:09.155" level=info msg="Dumping Events:" fleet=simple-fleet-1.0l4pgd kind= test=TestAllocatorWithSelectors
time="2024-04-05 17:26:09.292" level=info msg="Event!" fleet=simple-fleet-1.0l4pgd lastTimestamp="2024-04-05 17:16:09 +0000 UTC" message="Created GameServerSet simple-fleet-1.0l4pgd-qzg65" reason=CreatingGameServerSet test=TestAllocatorWithSelectors type=Normal
    framework.go:346: 
        	Error Trace:	/go/src/agones.dev/agones/test/e2e/framework/framework.go:346
        	            				/go/src/agones.dev/agones/test/e2e/allocator_test.go:172
        	Error:      	Received unexpected error:
        	            	context deadline exceeded
        	Test:       	TestAllocatorWithSelectors
        	Messages:   	error waiting for fleet condition on fleet: simple-fleet-1.0l4pgd
--- FAIL: TestAllocatorWithSelectors (602.32s)

Also

time="2024-04-05 17:06:04.671" level=info msg="ServiceAccount 1712336764/agones-sdk is created"
time="2024-04-05 17:06:04.846" level=info msg="RoleBinding 1712336764/agones-sdk-access is created"
panic: test timed out after 30m0s
running tests:
	TestRestAllocatorWithDeprecatedRequired (9m55s)

goroutine 964 [running]:
testing.(*M).startAlarm.func1()
	/usr/local/go/src/testing/testing.go:2259 +0x259
created by time.goFunc
	/usr/local/go/src/time/sleep.go:176 +0x45

goroutine 1 [chan receive, 11 minutes]:
testing.(*T).Run(0xc000105860, {0x2a4a031, 0x27}, 0x2b1a110)
	/usr/local/go/src/testing/testing.go:1649 +0x871
testing.runTests.func1(0x0?)
	/usr/local/go/src/testing/testing.go:2054 +0x85
testing.tRunner(0xc000105860, 0xc000821868)
	/usr/local/go/src/testing/testing.go:1595 +0x262
testing.runTests(0xc0001aae60?, {0x3c47920, 0x61, 0x61}, {0xc0008218e0?, 0x441c74?, 0x3c59f20?})
	/usr/local/go/src/testing/testing.go:2052 +0x8ae
testing.(*M).Run(0xc0001aae60)
	/usr/local/go/src/testing/testing.go:1925 +0xcd8
agones.dev/agones/test/e2e.TestMain(0x10c20f37eb6e55c0?)
	/go/src/agones.dev/agones/test/e2e/main_test.go:94 +0xaa5
main.main()
	_testmain.go:241 +0x308

goroutine 656 [select]:
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext({0x2db0398, 0xc00028c380}, {0x2da2678?, 0xc00012e240}, 0x1, 0x0, 0x0?)
	/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:91 +0x3d3
k8s.io/apimachinery/pkg/util/wait.PollUntilContextTimeout({0x2db0248, 0x3c8a900}, 0x77359400, 0x1?, 0x1?, 0x5?)
	/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:48 +0xd9
agones.dev/agones/test/e2e/framework.(*Framework).WaitForFleetCondition(0xc0003f4ab0, 0xc000524b60, 0xc0001df200, 0xc000aa1d40)
	/go/src/agones.dev/agones/test/e2e/framework/framework.go:353 +0x365
agones.dev/agones/test/e2e/framework.(*Framework).AssertFleetCondition(0x2d95540?, 0xc000524b60?, 0xc0001df200, 0x0?)
	/go/src/agones.dev/agones/test/e2e/framework/framework.go:345 +0x5a
agones.dev/agones/test/e2e.TestRestAllocatorWithDeprecatedRequired(0xc000524b60)
	/go/src/agones.dev/agones/test/e2e/allocator_test.go:252 +0x325
testing.tRunner(0xc000524b60, 0x2b1a110)
	/usr/local/go/src/testing/testing.go:1595 +0x262
created by testing.(*T).Run in goroutine 1
	/usr/local/go/src/testing/testing.go:1648 +0x846

Retry might be working in some cases, but this panic is definitely stopping retries -- and 30m is a long time for a 5 GameServer fleet not to spin up. It's very odd.

Some links to failed builds:

@markmandel markmandel added the kind/bug These are bugs. label Apr 5, 2024
@markmandel
Copy link
Member Author

Just looking in the cluster, I do see:

markmandel@cloudshell:~ (agones-images)$ kubectl get ns
NAME                       STATUS        AGE
1711579203                 Terminating   8d
1711696994                 Terminating   7d10h
1711795766                 Terminating   6d7h
1712076562                 Terminating   3d1h
1712340564                 Active        6m34s
agones-system              Active        17d
default                    Active        43d
gke-gmp-system             Active        43d
gke-managed-cim            Active        43d
gke-managed-filestorecsi   Active        43d
gke-managed-system         Active        43d
gmp-public                 Active        43d
kube-node-lease            Active        43d
kube-public                Active        43d
kube-system                Active        43d

...which is weird.

markmandel@cloudshell:~ (agones-images)$ kubectl describe ns 1711579203
Name:         1711579203
Labels:       kubernetes.io/metadata.name=1711579203
              owner=e2e-test
Annotations:  <none>
Status:       Terminating
Conditions:
  Type                                         Status  LastTransitionTime               Reason                 Message
  ----                                         ------  ------------------               ------                 -------
  NamespaceDeletionDiscoveryFailure            False   Fri, 05 Apr 2024 16:43:44 +0000  ResourcesDiscovered    All resources successfully discovered
  NamespaceDeletionGroupVersionParsingFailure  False   Wed, 27 Mar 2024 23:07:22 +0000  ParsedGroupVersions    All legacy kube types successfully parsed
  NamespaceDeletionContentFailure              True    Wed, 27 Mar 2024 23:07:22 +0000  ContentDeletionFailed  Failed to delete all resource types, 1 remaining: unexpected items still remain in namespace: 1711579203 for gvr: /v1, Resource=pods
  NamespaceContentRemaining                    True    Wed, 27 Mar 2024 23:07:22 +0000  SomeResourcesRemain    Some resources are remaining: gameservers.agones.dev has 12 resource instances, pods. has 12 resource instances
  NamespaceFinalizersRemaining                 True    Wed, 27 Mar 2024 23:07:22 +0000  SomeFinalizersRemain   Some content in the namespace has finalizers remaining: agones.dev in 12 resource instances
markmandel@cloudshell:~ (agones-images)$ kubectl get pods -n 1711579203
NAME                          READY   STATUS        RESTARTS   AGE
preferred-w4gnw-dkjjx-5s6bs   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-8h7wx   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-bkd7m   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-c4bdk   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-dj84f   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-gjhw7   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-npfvd   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-qnp8n   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-sl5qm   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-tmdqt   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-tx5wf   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-wcs2m   0/2     Terminating   0          8d
markmandel@cloudshell:~ (agones-images)$ kubectl get pod -n 1711579203 preferred-w4gnw-dkjjx-5s6bs -o yaml 
apiVersion: v1
kind: Pod
metadata:
  annotations:
    agones.dev/container: game-server
    agones.dev/ready-container-id: containerd://f39362db1ec461c97ae03e1ca5d20df5d946c107d425ee1dd20bdebac57ffc78
    agones.dev/sdk-version: 1.40.0-dev-5c2e964
    autopilot.gke.io/host-port-assignment: '{"min":7000,"max":8000,"portsAssigned":{"1":7240}}'
    autopilot.gke.io/primary-container: game-server
    autopilot.gke.io/resource-adjustment: '{"input":{"containers":[{"requests":{"cpu":"30m"},"name":"agones-gameserver-sidecar"},{"limits":{"cpu":"30m","memory":"32Mi"},"requests":{"cpu":"30m","memory":"32Mi"},"name":"game-server"}]},"output":{"containers":[{"limits":{"ephemeral-storage":"1Gi"},"requests":{"cpu":"30m","ephemeral-storage":"1Gi","memory":"2Gi"},"name":"agones-gameserver-sidecar"},{"limits":{"cpu":"470m","ephemeral-storage":"1Gi","memory":"62Mi"},"requests":{"cpu":"470m","ephemeral-storage":"1Gi","memory":"62Mi"},"name":"game-server"}]},"modified":true}'
    autopilot.gke.io/warden-version: 2.9.20
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
  creationTimestamp: "2024-03-27T22:54:37Z"
  deletionGracePeriodSeconds: 30
  deletionTimestamp: "2024-03-27T23:06:09Z"
  labels:
    agones.dev/gameserver: preferred-w4gnw-dkjjx-5s6bs
    agones.dev/role: gameserver
    agones.dev/safe-to-evict: "false"
  name: preferred-w4gnw-dkjjx-5s6bs
  namespace: "1711579203"
  ownerReferences:
  - apiVersion: agones.dev/v1
    blockOwnerDeletion: true
    controller: true
    kind: GameServer
    name: preferred-w4gnw-dkjjx-5s6bs
    uid: ec61fed6-63c7-4c80-9a85-9e7dc70d5015
  resourceVersion: "47363486"
  uid: d608d658-758d-4a07-960b-7ada2526a6ad
spec:
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchLabels:
              agones.dev/role: gameserver
          topologyKey: kubernetes.io/hostname
        weight: 100
  containers:
  - args:
    - --grpc-port=9357
    - --http-port=9358
    env:
    - name: GAMESERVER_NAME
      value: preferred-w4gnw-dkjjx-5s6bs
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: FEATURE_GATES
      value: CountsAndLists=true&DisableResyncOnSDKServer=false&Example=true&FleetAllocationOverflow=false&GKEAutopilotExtendedDurationPods=true&PlayerAllocationFilter=true&PlayerTracking=true
    - name: LOG_LEVEL
      value: Info
    image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.40.0-dev-5c2e964
    imagePullPolicy: Always
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 3
      periodSeconds: 3
      successThreshold: 1
      timeoutSeconds: 1
    name: agones-gameserver-sidecar
    resources:
      limits:
        ephemeral-storage: 1Gi
      requests:
        cpu: 30m
        ephemeral-storage: 1Gi
        memory: 2Gi
    securityContext:
      capabilities:
        drop:
        - NET_RAW
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-8jf5m
      readOnly: true
  - env:
    - name: AGONES_SDK_GRPC_PORT
      value: "9357"
    - name: AGONES_SDK_HTTP_PORT
      value: "9358"
    image: us-docker.pkg.dev/agones-images/examples/simple-game-server:0.28
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /gshealthz
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    name: game-server
    ports:
    - containerPort: 7654
      hostPort: 7240
      protocol: UDP
    resources:
      limits:
        cpu: 470m
        ephemeral-storage: 1Gi
        memory: 62Mi
      requests:
        cpu: 470m
        ephemeral-storage: 1Gi
        memory: 62Mi
    securityContext:
      capabilities:
        drop:
        - NET_RAW
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: empty
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: preferred-w4gnw-dkjjx-5s6bs
  nodeName: gk3-gke-autopilot-e2e-te-nap-qpzyqh5z-032f3139-lcj6
  nodeSelector:
    cloud.google.com/extended-duration-pods: "0"
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: gke.io/optimize-utilization-scheduler
  securityContext:
    seccompProfile:
      type: Unconfined
  serviceAccount: agones-sdk
  serviceAccountName: agones-sdk
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: kubernetes.io/arch
    operator: Equal
    value: amd64
  volumes:
  - emptyDir: {}
    name: empty
  - name: kube-api-access-8jf5m
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-03-27T23:10:10Z"
    status: "False"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2024-03-27T22:57:02Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-03-27T23:10:10Z"
    message: 'containers with unready status: [agones-gameserver-sidecar game-server]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-03-27T23:10:10Z"
    message: 'containers with unready status: [agones-gameserver-sidecar game-server]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-03-27T22:57:02Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://72f4fa8e6a90b882fe3be9eac808c0beed140feb526b22790f446fac46505c7a
    image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.40.0-dev-5c2e964
    imageID: us-docker.pkg.dev/agones-images/ci/agones-sdk@sha256:9aa78a749ccdf2b9b6c05484066e2d6f58e40843e87b218ec17cade8c04e267d
    lastState: {}
    name: agones-gameserver-sidecar
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://72f4fa8e6a90b882fe3be9eac808c0beed140feb526b22790f446fac46505c7a
        exitCode: 137
        finishedAt: "2024-03-27T23:06:09Z"
        reason: Error
        startedAt: "2024-03-27T22:57:12Z"
  - containerID: containerd://f39362db1ec461c97ae03e1ca5d20df5d946c107d425ee1dd20bdebac57ffc78
    image: us-docker.pkg.dev/agones-images/examples/simple-game-server:0.28
    imageID: us-docker.pkg.dev/agones-images/examples/simple-game-server@sha256:81d04419aa309fdb553a13be0614540da64a7df2ff9701f38a35c68398fe92e8
    lastState: {}
    name: game-server
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://f39362db1ec461c97ae03e1ca5d20df5d946c107d425ee1dd20bdebac57ffc78
        exitCode: 0
        finishedAt: "2024-03-27T23:05:39Z"
        reason: Completed
        startedAt: "2024-03-27T22:57:17Z"
  hostIP: 10.132.15.220
  hostIPs:
  - ip: 10.132.15.220
  phase: Running
  podIP: 10.64.134.69
  podIPs:
  - ip: 10.64.134.69
  qosClass: Burstable
  startTime: "2024-03-27T22:57:02Z"

So the pods are stuck for some reason?

@markmandel
Copy link
Member Author

kubectl delete pod --all --force --grace-period=0 -n <namespace> did work to cleanup the Pods though.

@markmandel
Copy link
Member Author

I did recently notice:

markmandel@cloudshell:~ (agones-images)$ kubectl get crds | grep agones
fleetautoscalers.autoscaling.agones.dev                2024-04-03T07:47:43Z
fleets.agones.dev                                      2024-04-03T07:47:42Z
gameserverallocationpolicies.multicluster.agones.dev   2024-04-03T07:47:43Z
gameserversets.agones.dev                              2024-04-03T07:47:44Z

But that seems to be after cleanup happened? GameServers where definitely there before.

Can CRDs go into a Terminating state?

@zmerlynn zmerlynn self-assigned this Apr 9, 2024
@zmerlynn
Copy link
Collaborator

zmerlynn commented Apr 9, 2024

Tracking this in an internal bug, will report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug These are bugs.
Projects
None yet
Development

No branches or pull requests

2 participants