baremetal: Implement "destroy cluster" support #2005

russellb · 2019-07-16T21:05:35Z

Original issue: openshift-metal3/kni-installer#74

kni-install does not yet support "destroy cluster" for baremetal clusters.

See pkg/destroy/baremetal/baremetal.go for the stub, and other implementations under pkg/destroy/ for examples of implementations on other platforms.

Having the baremetal-operator drive Ironic to destroy itself is not ideal, as we can't ensure that the cluster is actually fully destroyed. In particular, we can't drive all of the nodes through cleaning.

One way to do this would be the reverse of how Ironic moves in the cluster deployment process. We can copy all of the host information out of the cluster, shut down the baremetal-operator, and then re-launch Ironic on the provisioning host. The installer could then drive the local Ironic to ensure all hosts are deprovisioned.

@hardys : May 9-10

This is an interesting one, I'd assumed we'd run ironic on the bootstrap VM in the deploy case (where there's no external ironic, e.g on the provisioning host), but since there's no bootstrap VM on destroy that approach won't work, so I wonder if we should just run the ironic pod on the host via kni-installer in both cases?

This is actually quite tricky to implement in the same way as other platforms, because they all rely on tagging resources, then discovering all the tagged resources and deleting them. But this won't work unless we have a single long-lived ironic to maintain the state/tags.

I think we'll have to either scale down the worker machineset, kill the BMO (and hosted Ironic), then spin up another ironic to delete the masters (using details gathered from the externally provisioned BareMetalHost objects), or just grab all the BareMetalHost details, kill the BMO/Ironic, then use another/local Ironic to tear them all down.

@russellb: May 10

I think we'll have to either scale down the worker machineset, kill the BMO (and hosted Ironic), then spin up another ironic to delete the masters (using details gathered from the externally provisioned BareMetalHost objects), or just grab all the BareMetalHost details, kill the BMO/Ironic, then use another/local Ironic to tear them all down.

I agree with this.

@hardys: May 10

Ok so I think we should solve this by first fixing issue 68 (run ironic on the bootstrap VM) so we can optionally launch Ironic on the bootstrap VM via an injected manifest provided by ignition, then on destroy launch a similar VM with the same configuration (but without the bootstrap configuration).

This should mean some reuse, since we'll use the exact same pattern/config to deploy the masters and do deprovisioning on destroy, but also avoids potential complexity of running the Ironic container on the host directly (where we may want to support multiple OS options, and may not want to require host access e.g to modify firewall rules etc).

If that sounds reasonable, I'll take a look at enabling the bootstrap VM to run ironic, ideally using same/similar configuration that we enable for worker deployment in metal3-io/baremetal-operator#72

@dhellmann : May 10

How much cleaning is really involved? Could we just launch a DaemonSet to trigger wiping the partition table and then reboot the host?

@russellb : May 13

How much cleaning is really involved? Could we just launch a DaemonSet to trigger wiping the partition table and then reboot the host?

That would be simpler for sure, but the downside is the lack of any out-of-band components to verify that the cluster really has been destroyed and the process is complete.

@hardys : May 13

How much cleaning is really involved? Could we just launch a DaemonSet to trigger wiping the partition table and then reboot the host?

This may be something to discuss with product management downstream I guess, but FWIW we've already seen issues redeploying ceph on boxes where the disks aren't cleaned of metadata from previous deployments, and I was assuming there would be security/compliance reasons to prefer cleaning all the cluster data from the disks.

I also assumed we'd want all the nodes (including the masters) powered down after the destroy operation, which is probably most easily achieved using Ironic, at which point enabling cleaning on deprovision becomes easier to enable.

openshift-bot · 2020-02-20T02:02:39Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2020-03-21T05:47:22Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

dhellmann · 2020-03-23T16:35:24Z

/remove-lifecycle rotten

stbenjam · 2020-03-23T16:38:56Z

/lifecycle frozen

russellb added the platform/baremetal IPI bare metal hosts platform label Jul 16, 2019

russellb mentioned this issue Jul 16, 2019

Implement "destroy cluster" support openshift-metal3/kni-installer#74

Closed

hardys mentioned this issue Sep 4, 2019

Running make clean makes your host unreachable openshift-kni/install-scripts#52

Closed

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 20, 2020

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 21, 2020

openshift-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 23, 2020

openshift-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Mar 23, 2020

hardys mentioned this issue Apr 6, 2020

Nodes are not powered down on make clean and cause side effects openshift-metal3/dev-scripts#906

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

baremetal: Implement "destroy cluster" support #2005

baremetal: Implement "destroy cluster" support #2005

russellb commented Jul 16, 2019

openshift-bot commented Feb 20, 2020

openshift-bot commented Mar 21, 2020

dhellmann commented Mar 23, 2020

stbenjam commented Mar 23, 2020

baremetal: Implement "destroy cluster" support #2005

baremetal: Implement "destroy cluster" support #2005

Comments

russellb commented Jul 16, 2019

openshift-bot commented Feb 20, 2020

openshift-bot commented Mar 21, 2020

dhellmann commented Mar 23, 2020

stbenjam commented Mar 23, 2020