Backup implementation: controller, plugins, collection, webhooks #841

zerospiel · 2024-12-27T18:30:22Z

rename Backup to ManagementBackup
removed Oneshot parameter from the Spec
reconcile scheduled backups
(collect statuses, create schedules, etc.)
reconcile backups
(collect statuses, create velero backups)
collect the required velero backup
spec for the whole backup
label Credential references (clusterIdentities)
in order to include them in backup
backup validation webhook
backup controller watches velero resources
amend backup controller logic
to better handle scheduled and
non-scheduled backups
set velero maintained plugins settings
add custom plugins set via mgmt spec
reconcile all the velero plugins either
during the installation or depending
on existing BSL objects exist in a cluster
rename k0smotron related provider labels
to the correct ones from the k0sproject

-- TEMPORARY NOT VALID --

NOTE: Because of the changes introduced in the #699 (mutation) it is now impossible to properly restore the full management on an empty kind installation, the process of restoration is as follows:

install any clean kind installation
temporary manual step: manually install the clustersummaries sveltos CRD (see HMC Installation Fails on Provisioned Cluster Due to Missing Sveltos Resources and CRDs #847)
install a clean velero instance with the provided S3 creds to access the existing backups
patch velero deploy to mitigate the issue with not ready mutation webhook server resulting in lose of some *templates

$ kubectl patch deploy -n velero velero \
--type='json' \
--patch='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--restore-resource-priorities=\"customresourcedefinitions,namespaces,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,datauploads.velero.io,persistentvolumes,persistentvolumeclaims,serviceaccounts,secrets,configmaps,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,endpoints,services,providertemplates.hmc.mirantis.com,servicetemplates.hmc.mirantis.com,-,clusterbootstraps.run.tanzu.vmware.com,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io,apps.kappctrl.k14s.io,packageinstalls.packaging.carvel.dev\""}]'

perform velero restore (e.g. velero restore create <name> --existing-resource-policy update --from-backup <backup-name>)
wait for the restore in the Completed state (e.g. with velero restore get)
probably temporary manual step: restart 2 deploys hmc-controller-manager and the capi-controller-manager (probably the latter is the result of a bug in the former)
done

To properly test the feature:

have k0rdent instance with this PR
enable backup in the management, set .spec.backup to the something like {enabled: true; schedule: "@every 5m"}
install backupstoragelocation for velero, e.g.

---
apiVersion: v1 # optional, at this moment should already be installed
kind: Namespace
metadata:
  labels:
    kubernetes.io/metadata.name: velero
    name: velero
  name: velero
spec: {}
---
apiVersion: v1
data:
  # base64 decoded AK/SK, e.g. 3 lines:
  # [default]
  # aws_access_key_id = <key>
  # aws_secret_access_key = <secret_key>
  cloud: <base64>
kind: Secret
metadata:
  name: cloud-credentials
  namespace: velero
type: Opaque
---
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: default
  namespace: velero
spec:
  config:
    region: <eu-central-1> # region with an s3 bucket
  default: true
  objectStorage:
    bucket: <bucket-name> # bucket name
  provider: aws
  credential:
    name: cloud-credentials
    key: cloud

wait for the scheduled backup to be created
fully drop the k0rdent instance (imitate a disaster)
perform the aforementioned steps for the restoration

#814

zerospiel · 2025-01-09T13:46:26Z

The logic is valid but I'm still struggling with the mutation, so in terms of the restoration — it works only partially meaning that precisely two restores are required and both will be paritially_failed but afterwards the restoration in general is indeed successful. Trying to figure out how to mitigate this

zerospiel · 2025-01-09T18:12:50Z

Ok, I've figured out how to mitigate the issue with the mutationwebhookconfigurations and the template resources being declined to created: it needs to be done on the client side during the restoration process (we cannot control it) namely patch the deploy adding the providertemplates and servicetemplates resource to high priority list, so they are being applied before the k0rdent deployment (that problem does not affect the clustertemplates since they are alphabetically higher than deployment.apps)

The patch is as follows:

$ kubectl patch deploy -n velero velero \
--type='json' \
--patch='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--restore-resource-priorities=\"customresourcedefinitions,namespaces,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,datauploads.velero.io,persistentvolumes,persistentvolumeclaims,serviceaccounts,secrets,configmaps,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,endpoints,services,providertemplates.hmc.mirantis.com,servicetemplates.hmc.mirantis.com,-,clusterbootstraps.run.tanzu.vmware.com,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io,apps.kappctrl.k14s.io,packageinstalls.packaging.carvel.dev\""}]'

* rename Backup to ManagementBackup * removed Oneshot parameter from the Spec * reconcile scheduled backups (collect statuses, create schedules, etc.) * reconcile backups (collect statuses, create velero backups) * collect the required velero backup spec for the whole backup * label Credential references (clusterIdentities) in order include them in backup * backup validation webhook * backup controller watches velero resources * amend backup controller logic to better handle scheduled and non-scheduled backups * set velero maintained plugins settings * add custom plugins set via mgmt spec * reconcile all the velero plugins either during the installation or depending on existing BSL objects exist in a cluster * rename k0smotron related provider labels to the correct ones from the k0sproject

Kshatrix · 2025-01-10T12:17:09Z

internal/controller/management_backup_controller.go

+		return fmt.Errorf("failed to create uncached client: %w", err)
+	}
+
+	if err := r.config.InstallVeleroCRDs(uncachedCl); err != nil {


we should install velero as part of the hmc chart as a dependency (similar to cert-manager)

Kshatrix · 2025-01-10T12:46:15Z

internal/controller/backup/schedule.go

+			}
+		}
+
+		veleroSchedule := &velerov1api.Schedule{


I think we need to control execution ourselves (discussed seperately)

Kshatrix · 2025-01-10T12:48:12Z

internal/controller/management_backup_controller.go

+				q.Add(getManagementNameIfEnabled(ctx))
+			},
+		}).
+		Watches(&hmcv1alpha1.ClusterDeployment{}, handler.EnqueueRequestsFromMapFunc(func(ctx context.Context, _ client.Object) []ctrl.Request {


we won't need all of the watchers in that case

* install velero via flux rather than code * TODO: code removal due to the chart installation * adjusted roles for the velero chart * removed unnecessary controller values * fix bug in providertemplates ctrl when ownerreferences are being updated but requeue is not set * TODO: actually remove the code * TODO: rework controller to ticker but watch the mgmt events and manage schedule instead of velero schedule

zerospiel linked an issue Dec 27, 2024 that may be closed by this pull request

implement backups reconciliation #814

Open

zerospiel force-pushed the backup_impl_2 branch 2 times, most recently from 80beb93 to 9694287 Compare December 30, 2024 12:16

zerospiel changed the title ~~Backup impl 2~~ Backup implementation: controller, plugins, collection, webhooks Dec 30, 2024

zerospiel force-pushed the backup_impl_2 branch 9 times, most recently from 56f0846 to dd516c2 Compare January 6, 2025 16:01

zerospiel force-pushed the backup_impl_2 branch 5 times, most recently from d8b1204 to e581f96 Compare January 8, 2025 17:52

zerospiel marked this pull request as ready for review January 8, 2025 18:07

zerospiel requested review from Kshatrix and a13x5 as code owners January 8, 2025 18:07

zerospiel force-pushed the backup_impl_2 branch 3 times, most recently from dca5c63 to d95cd1c Compare January 9, 2025 13:33

zerospiel force-pushed the backup_impl_2 branch from d95cd1c to fb31e55 Compare January 9, 2025 18:21

Kshatrix requested changes Jan 10, 2025

View reviewed changes

zerospiel force-pushed the backup_impl_2 branch from 46cb367 to 752d860 Compare January 10, 2025 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backup implementation: controller, plugins, collection, webhooks #841

Backup implementation: controller, plugins, collection, webhooks #841

zerospiel commented Dec 27, 2024 •

edited

Loading

zerospiel commented Jan 9, 2025

zerospiel commented Jan 9, 2025

Kshatrix Jan 10, 2025

Kshatrix Jan 10, 2025

Kshatrix Jan 10, 2025

Backup implementation: controller, plugins, collection, webhooks #841

Are you sure you want to change the base?

Backup implementation: controller, plugins, collection, webhooks #841

Conversation

zerospiel commented Dec 27, 2024 • edited Loading

zerospiel commented Jan 9, 2025

zerospiel commented Jan 9, 2025

Kshatrix Jan 10, 2025

Choose a reason for hiding this comment

Kshatrix Jan 10, 2025

Choose a reason for hiding this comment

Kshatrix Jan 10, 2025

Choose a reason for hiding this comment

zerospiel commented Dec 27, 2024 •

edited

Loading