Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup implementation: controller, plugins, collection, webhooks #841

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

zerospiel
Copy link
Contributor

@zerospiel zerospiel commented Dec 27, 2024

  • rename Backup to ManagementBackup
  • removed Oneshot parameter from the Spec
  • reconcile scheduled backups
    (collect statuses, create schedules, etc.)
  • reconcile backups
    (collect statuses, create velero backups)
  • collect the required velero backup
    spec for the whole backup
  • label Credential references (clusterIdentities)
    in order to include them in backup
  • backup validation webhook
  • backup controller watches velero resources
  • amend backup controller logic
    to better handle scheduled and
    non-scheduled backups
  • set velero maintained plugins settings
  • add custom plugins set via mgmt spec
  • reconcile all the velero plugins either
    during the installation or depending
    on existing BSL objects exist in a cluster
  • rename k0smotron related provider labels
    to the correct ones from the k0sproject

-- TEMPORARY NOT VALID --

NOTE: Because of the changes introduced in the #699 (mutation) it is now impossible to properly restore the full management on an empty kind installation, the process of restoration is as follows:

$ kubectl patch deploy -n velero velero \
--type='json' \
--patch='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--restore-resource-priorities=\"customresourcedefinitions,namespaces,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,datauploads.velero.io,persistentvolumes,persistentvolumeclaims,serviceaccounts,secrets,configmaps,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,endpoints,services,providertemplates.hmc.mirantis.com,servicetemplates.hmc.mirantis.com,-,clusterbootstraps.run.tanzu.vmware.com,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io,apps.kappctrl.k14s.io,packageinstalls.packaging.carvel.dev\""}]'
  • perform velero restore (e.g. velero restore create <name> --existing-resource-policy update --from-backup <backup-name>)
  • wait for the restore in the Completed state (e.g. with velero restore get)
  • probably temporary manual step: restart 2 deploys hmc-controller-manager and the capi-controller-manager (probably the latter is the result of a bug in the former)
  • done

To properly test the feature:

  • have k0rdent instance with this PR
  • enable backup in the management, set .spec.backup to the something like {enabled: true; schedule: "@every 5m"}
  • install backupstoragelocation for velero, e.g.
---
apiVersion: v1 # optional, at this moment should already be installed
kind: Namespace
metadata:
  labels:
    kubernetes.io/metadata.name: velero
    name: velero
  name: velero
spec: {}
---
apiVersion: v1
data:
  # base64 decoded AK/SK, e.g. 3 lines:
  # [default]
  # aws_access_key_id = <key>
  # aws_secret_access_key = <secret_key>
  cloud: <base64>
kind: Secret
metadata:
  name: cloud-credentials
  namespace: velero
type: Opaque
---
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: default
  namespace: velero
spec:
  config:
    region: <eu-central-1> # region with an s3 bucket
  default: true
  objectStorage:
    bucket: <bucket-name> # bucket name
  provider: aws
  credential:
    name: cloud-credentials
    key: cloud
  • wait for the scheduled backup to be created
  • fully drop the k0rdent instance (imitate a disaster)
  • perform the aforementioned steps for the restoration

#814

@zerospiel zerospiel linked an issue Dec 27, 2024 that may be closed by this pull request
@zerospiel zerospiel force-pushed the backup_impl_2 branch 2 times, most recently from 80beb93 to 9694287 Compare December 30, 2024 12:16
@zerospiel zerospiel changed the title Backup impl 2 Backup implementation: controller, plugins, collection, webhooks Dec 30, 2024
@zerospiel zerospiel force-pushed the backup_impl_2 branch 9 times, most recently from 56f0846 to dd516c2 Compare January 6, 2025 16:01
@zerospiel zerospiel force-pushed the backup_impl_2 branch 5 times, most recently from d8b1204 to e581f96 Compare January 8, 2025 17:52
@zerospiel zerospiel marked this pull request as ready for review January 8, 2025 18:07
@zerospiel zerospiel force-pushed the backup_impl_2 branch 3 times, most recently from dca5c63 to d95cd1c Compare January 9, 2025 13:33
@zerospiel
Copy link
Contributor Author

The logic is valid but I'm still struggling with the mutation, so in terms of the restoration — it works only partially meaning that precisely two restores are required and both will be paritially_failed but afterwards the restoration in general is indeed successful. Trying to figure out how to mitigate this

@zerospiel
Copy link
Contributor Author

Ok, I've figured out how to mitigate the issue with the mutationwebhookconfigurations and the template resources being declined to created: it needs to be done on the client side during the restoration process (we cannot control it) namely patch the deploy adding the providertemplates and servicetemplates resource to high priority list, so they are being applied before the k0rdent deployment (that problem does not affect the clustertemplates since they are alphabetically higher than deployment.apps)

The patch is as follows:

$ kubectl patch deploy -n velero velero \
--type='json' \
--patch='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--restore-resource-priorities=\"customresourcedefinitions,namespaces,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,datauploads.velero.io,persistentvolumes,persistentvolumeclaims,serviceaccounts,secrets,configmaps,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,endpoints,services,providertemplates.hmc.mirantis.com,servicetemplates.hmc.mirantis.com,-,clusterbootstraps.run.tanzu.vmware.com,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io,apps.kappctrl.k14s.io,packageinstalls.packaging.carvel.dev\""}]'

* rename Backup to ManagementBackup
* removed Oneshot parameter from the Spec
* reconcile scheduled backups
  (collect statuses, create schedules, etc.)
* reconcile backups
  (collect statuses, create velero backups)
* collect the required velero backup
  spec for the whole backup
* label Credential references (clusterIdentities)
  in order include them in backup
* backup validation webhook
* backup controller watches velero resources
* amend backup controller logic
  to better handle scheduled and
  non-scheduled backups
* set velero maintained plugins settings
* add custom plugins set via mgmt spec
* reconcile all the velero plugins either
  during the installation or depending
  on existing BSL objects exist in a cluster
* rename k0smotron related provider labels
  to the correct ones from the k0sproject
return fmt.Errorf("failed to create uncached client: %w", err)
}

if err := r.config.InstallVeleroCRDs(uncachedCl); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should install velero as part of the hmc chart as a dependency (similar to cert-manager)

}
}

veleroSchedule := &velerov1api.Schedule{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to control execution ourselves (discussed seperately)

q.Add(getManagementNameIfEnabled(ctx))
},
}).
Watches(&hmcv1alpha1.ClusterDeployment{}, handler.EnqueueRequestsFromMapFunc(func(ctx context.Context, _ client.Object) []ctrl.Request {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we won't need all of the watchers in that case

* install velero via flux rather than code
* TODO: code removal due to the chart installation
* adjusted roles for the velero chart
* removed unnecessary controller values
* fix bug in providertemplates ctrl
  when ownerreferences are being updated
  but requeue is not set
* TODO: actually remove the code
* TODO: rework controller to ticker
  but watch the mgmt events and manage schedule
  instead of velero schedule
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

implement backups reconciliation
2 participants