Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bubble reconcile control plane reconcile failure to cluster status #7745

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cxbrowne1207
Copy link
Member

@cxbrowne1207 cxbrowne1207 commented Feb 29, 2024

Issue #, if available:

Description of changes:
We encountered an issue where the upgrade finished because the failure in the controller wasn’t bubbled up to the status. Making these changes would have caught it, causing the upgrade to fail correctly. There are other phases in the controller reconciliation where we’d need this as well, but for now I only did it for the control plane reconciliation phase.

In the PR, we set the failure message in the reconciliation phase where the error occures, and then clear it right after. This is because we don't want the failure message to persist for too long, especially if the next reconciliation phase returns a re-queue signal and lasts a while (and essentially waiting for some state to be met)

Testing (if applicable):

  • Unit tests
  • Ran the TestVSphereKubernetes128BottlerocketTo129StackedEtcdUpgrade test with the controller build that encounters an error during control plane reconciliation

Cluster with failureMessage and failureReason

[ec2-user@ip-172-31-61-197 eks-anywhere]$ k get clusters -o yaml
apiVersion: v1
items:
- apiVersion: anywhere.eks.amazonaws.com/v1alpha1
  kind: Cluster
  metadata:
    annotations:
      anywhere.eks.amazonaws.com/eksa-cilium: ""
      anywhere.eks.amazonaws.com/management-components-version: v0.19.0-dev+latest
    creationTimestamp: "2024-02-29T16:40:14Z"
    finalizers:
    - clusters.anywhere.eks.amazonaws.com/finalizer
    generation: 2
    name: eksa-test-67419a2
    namespace: default
    resourceVersion: "5183"
    uid: 60cfd82b-43ec-4b67-9858-15458fc90f26
  spec:
    clusterNetwork:
      cniConfig:
        cilium: {}
      dns: {}
      pods:
        cidrBlocks:
        - 192.168.0.0/16
      services:
        cidrBlocks:
        - 10.96.0.0/12
    controlPlaneConfiguration:
      count: 1
      endpoint:
        host: 195.17.199.69
      machineGroupRef:
        kind: VSphereMachineConfig
        name: eksa-test-67419a2-cp
      machineHealthCheck:
        maxUnhealthy: 100%
    datacenterRef:
      kind: VSphereDatacenterConfig
      name: eksa-test-67419a2
    eksaVersion: v0.19.0-dev+latest
    kubernetesVersion: "1.29"
    machineHealthCheck:
      maxUnhealthy: 100%
      nodeStartupTimeout: 10m0s
      unhealthyMachineTimeout: 5m0s
    managementCluster:
      name: eksa-test-67419a2
    workerNodeGroupConfigurations:
    - count: 1
      machineGroupRef:
        kind: VSphereMachineConfig
        name: eksa-test-67419a2
      machineHealthCheck:
        maxUnhealthy: 40%
      name: md-0
  status:
    childrenReconciledGeneration: 3
    conditions:
    - lastTransitionTime: "2024-02-29T16:40:42Z"
      status: "True"
      type: Ready
    - lastTransitionTime: "2024-02-29T16:40:14Z"
      status: "True"
      type: ControlPlaneInitialized
    - lastTransitionTime: "2024-02-29T16:40:42Z"
      status: "True"
      type: ControlPlaneReady
    - lastTransitionTime: "2024-02-29T16:40:30Z"
      status: "True"
      type: DefaultCNIConfigured
    - lastTransitionTime: "2024-02-29T16:40:14Z"
      status: "True"
      type: WorkersReady
    failureMessage: 'applying control plane objects: failed to reconcile object controlplane.cluster.x-k8s.io/v1beta1,
      Kind=KubeadmControlPlane, eksa-system/eksa-test-67419a2: admission webhook "validation.kubeadmcontrolplane.controlplane.cluster.x-k8s.io"
      denied the request: KubeadmControlPlane.cluster.x-k8s.io "eksa-test-67419a2"
      is invalid: spec.kubeadmConfigSpec.clusterConfiguration.featureGates.EtcdLearnerMode:
      Forbidden: cannot be modified'
    failureReason: ControlPlaneReconciliationError
    observedGeneration: 2
    reconciledGeneration: 1
kind: List
metadata:
  resourceVersion: ""

Documentation added/planned (if applicable):

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@eks-distro-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from cxbrowne1207. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@eks-distro-bot
Copy link
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@cxbrowne1207
Copy link
Member Author

/test all

@eks-distro-bot eks-distro-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 29, 2024
@cxbrowne1207 cxbrowne1207 force-pushed the bubble-reconcile-failure-to-status branch 2 times, most recently from ade054e to 3a533f4 Compare February 29, 2024 16:23
Copy link

codecov bot commented Feb 29, 2024

Codecov Report

Attention: Patch coverage is 81.13208% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 73.63%. Comparing base (4583834) to head (83152c5).
Report is 257 commits behind head on main.

Files Patch % Lines
pkg/providers/vsphere/reconciler/reconciler.go 60.00% 4 Missing ⚠️
pkg/providers/docker/reconciler/reconciler.go 80.00% 2 Missing ⚠️
pkg/providers/snow/reconciler/reconciler.go 80.00% 2 Missing ⚠️
pkg/providers/tinkerbell/reconciler/reconciler.go 80.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7745      +/-   ##
==========================================
+ Coverage   73.48%   73.63%   +0.14%     
==========================================
  Files         579      588       +9     
  Lines       36357    37187     +830     
==========================================
+ Hits        26718    27383     +665     
- Misses       7875     8015     +140     
- Partials     1764     1789      +25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cxbrowne1207 cxbrowne1207 force-pushed the bubble-reconcile-failure-to-status branch from 3a533f4 to 83152c5 Compare February 29, 2024 16:51
@cxbrowne1207 cxbrowne1207 marked this pull request as ready for review February 29, 2024 21:19
@cxbrowne1207
Copy link
Member Author

/hold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants