Use a server group to ensure anti-affinity for control plane nodes #1256

mkjpryor · 2022-06-06T08:30:25Z

/kind feature

Describe the solution you'd like
We should put control plane nodes in a server group with the soft anti-affinity policy to decrease the risk that a hypervisor failure takes out the whole control plane.

This becomes more important once #1252 is merged, but is something that I think we should do even when using explicit AZs.

Anything else you would like to add:

jichenjc · 2022-06-07T03:52:23Z

I knew OCP seems has such by default, are you suggesting we should default it in our template or in the code directly?

https://github.com/openshift/installer/blob/master/docs/user/openstack/README.md#using-a-server-group

mkjpryor · 2022-06-07T07:35:55Z

@jichenjc

IMHO, it would be nice if we could manage a server group for the control plane nodes. It would be better for the "Infrastructure-as-Code" approach to have the whole thing defined in Kubernetes resources, rather than having to make the server group ahead of time.

mkjpryor · 2022-06-07T07:36:29Z

We should probably at least document it in the CAPO book as well.

jichenjc · 2022-06-08T02:02:37Z

yes, we have such definition in api/v1alpha5/openstackmachine_types.go

// The server group to assign the machine to
        ServerGroupID string `json:"serverGroupID,omitempty"`

I think maybe we can refer to other impl, like sec group we did, a bool var to control whether we create for customer
if the bool is true, we create server group and honor the ID to each control plane node
if it's false, let customer decide whether they need such ..
and during deletion phase, delete the server group all together at last

mdbooth · 2022-06-10T10:07:51Z

For the control plane we certainly have the opportunity to create and manage this server group: the server group will have the same lifecycle as the cluster, so it can be managed with the cluster.

For machines in a machineset it is not currently possible to do this in CAPI as far as I'm aware. I have previously mooted the possibility of some CAPI object whose lifecycle spans multiple machines. The best suggestion so far is something along the lines of MachinePools, however this would be an abuse of this interface IMHO as OpenStack has no such concept. An object with the same lifecycle as a MachineSet would probably be sufficient: you'd need a new machineset for every server group, but this doesn't sound unreasonable to me.

k8s-triage-robot · 2022-09-08T10:18:03Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

tobiasgiese · 2022-09-08T10:27:26Z

/remove-lifecycle stale

k8s-triage-robot · 2022-12-07T10:33:34Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-01-06T10:36:35Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2023-02-05T10:50:21Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2023-02-05T10:50:25Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dalees · 2023-11-10T00:02:49Z

I'd like to re-open this issue and help implement this.

We are a cloud provider working on implementing OpenStack Magnum managed CAPO clusters and these must have anti-affinity applied to control plane nodes. (Use case: Region with a single AZ).

The preferred defaults would be anti-affinity for control plane to ensure separation and soft-anti-affinity for machines to allow scaling beyond available compute hypervisors.

Proposal:

Add ServerGroupAffinity string json:"serverGroupAffinity,omitempty" to OpenStackClusterSpec with valid values of anti-affinity, soft-anti-affinity (affinity, soft-affinity are valid in Nova, but not sensible here)

Note: This would need to apply to control plane nodes only, so perhaps it should be named with a CP prefix?

I note the restrictions of CAPI in creating the servergroup for MachineSets. Perhaps this is not in the first pass, but if we had an available ServerGroupAffinity for the MachineSet accessible we could create on the first Machine if it doesn't exist, and clean up if we remove the last machine from it. It would leave orphaned servergroups in some cases.

dalees · 2023-11-10T00:22:37Z

Related issue: #808

dalees · 2023-11-10T00:22:41Z

/reopen

k8s-ci-robot · 2023-11-10T00:22:45Z

@dalees: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jichenjc · 2023-11-10T01:51:00Z

/reopen

k8s-ci-robot · 2023-11-10T01:51:05Z

@jichenjc: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mkjpryor · 2023-11-10T06:31:30Z

@dalees Using pre-existing server groups is possible in the Helm charts already, so we could actually implement this fully in the Magnum driver if required.

I still think it would be nice to do in CAPO though.

For the CP nodes it is fairly obvious that we can just add a control plane affinity policy to OpenStackCluster and manage a server group for the CP nodes.

For the workers it is less obvious, as what you probably want is a server group per MD, but we only get to deal in machines in CAPO, and we aren’t supposed to care which MD they are in.

@mdbooth Didn’t you have ideas for reworking all the AZ logic? Maybe server groups could be worked into that?

mdbooth · 2023-11-10T15:20:56Z

Server groups won't live in the AZ logic when that (hopefully) happens. The reason is that AZs are basically a templating mechanism: they define a set of things which vary between machines. Server groups don't (must not) vary between members.

I would like to see us support server groups better than we do currently, which as Matt says is to just create one externally and reference it in the machine spec. I believe we will need a solution for this soon for OpenShift integration, so it will likely be on my TODO list... some time in the next 6 months, maybe? If somebody else wanted to implement it sooner I'd be delighted.

I don't want to add any more 'cluster-magic' to the machine controller. i.e. My preference is that the machine spec entirely specifies the machine, without reference to the cluster spec. So I'd be happy for the cluster controller to create the server group, I'd still want the control plane machine spec to reference it explicitly.

mkjpryor · 2023-11-13T14:09:20Z

So OpenStackCluster would have a list of server groups to create, which are then referenced from OpenStackMachine?

Could this be a good time to have an extra CRD for an OpenStackServerGroup maybe?

EmilienM · 2023-11-22T15:49:41Z

/remove-lifecycle rotten

EmilienM · 2023-11-22T16:03:02Z

@mkjpryor @dalees

Hey, I work with @mdbooth and we will need this one to be addressed soon so I'm very likely going to work on that.

A few questions:

Can we settle on having a new CRD for OpenStackServerGroup ? I think it's a great idea and indeed an opportunity to do it well in the first place.
Can you confirm that we'll want a new Controller as well? like it was suggested by @nikParasyr via [Feature] Support for anti-affinity/affinity rules for the created machines #1378 (comment)
I propose that we don't touch ServerGroupID, however when used CAPO wouldn't manage OpenStackServerGroup.
Are we focusing only on the control plane for now (fine by me)? or workers too?

End goal: CAPO needs to be able to create and manage the Server Groups (at a minimum for the Control Plane).

nikParasyr · 2023-11-22T17:30:48Z

I would like to see support for worker nodes as well if possible ( or a design which allows support for it to be added easily on a later stage ).

currently, we have clusters that:

have a single serverGroup used by all nodes ( control-plane + workers )
have a serverGroup for control-plane, and a second SG for all the worker nodes ( even with multiple node groups )
have a SG for control-plane, different SG per node group

most clusters follow the 2 pattern but there are some use cases where 1 or 3 is preferable.

dalees · 2023-11-24T02:33:14Z

@EmilienM - Great, if you'd like to take this forward and progress the CRD and controller for this, that'd be welcomed.

I am prepared to write the control plane part of this, but not ready to dedicate the time to write the CRD/Controller proposal and implementation. I may do the control plane part for my own learning and add a PR in the next few weeks. I don't expect this to be merged if a CRD based replacement is in the pipeline :)

Happy to follow, feedback and test. Having this functional for node groups would be great.

I can see a good use case for a different server group per node group, matching @nikParasyr 's case 3.

mkjpryor · 2023-11-28T14:44:16Z

@EmilienM

Personally, I really like the idea of an OpenStackServerGroup CRD, with the corresponding OpenStackMachine.spec.serverGroupRef field. This would make managing a server group per MD a breeze using our Helm charts.

mkjpryor · 2023-11-28T14:49:59Z

Just to clarify, I think users of CAPO should create instances of the OpenStackServerGroup CRD and assign them to machines using a spec.serverGroupRef, rather than CAPO managing the creation of OpenStackServerGroup CRD instances itself.

So the order of precedence for a machine would be:

If spec.serverGroupID is set, use that server group
If spec.serverGroupRef is set, use the server group created for the CRD instance
If neither are set, don't set a server group

CAPO will manage the actual OpenStack security group backing an OpenStackServerGroup CRD instance but will not create them automatically for clusters that do not request them (not every cluster will need or want server groups).

Does that make sense? @mdbooth?

dalees · 2024-02-08T03:41:47Z

This issue is now important to my current work, I'm planning to work on this next week and will propose a PR implementing the new CRD for OpenStackServerGroup and controller as @mkjpryor has outlined - I believe it covers the cases discussed in this issue so far.

@EmilienM please let me know if you have progressed this, otherwise I'm happy to pick this up now.

dalees · 2024-02-16T02:56:54Z

Quick update: I plan to upload a PR for feedback next week.

I have a working version of the OpenStackServerGroup CRD and controller. The OpenStackMachineSpec adds serverGroupRef into v1alpha8 which resolves once the server group is created. I'm now working on ensuring tests and edge cases are reasonable.

EmilienM · 2024-02-21T18:49:20Z

I'm honestly not sure we want a new CRD & controller for this as it might be too complex for our needs but I'm happy to be proven wrong.
I'll wait to see what @dalees proposed.

Implements new CRD for OpenstackServerGroup in v1alpha8 to allow managed Server Groups with standard policies, and adds ServerGroupRef to OpenstackMachine that references the new CRD and uses it for VM creation. Closes: kubernetes-sigs#1256

dalees · 2024-02-28T02:21:13Z

I've uploaded a first pass at this, there are a few TODO's listed but it is functional.

On the topic of it being a CRD & controller, I can see this is fairly heavyweight for a simple feature, but it does provide a solution to all use cases in #1256 (comment) and deals with the lifecycle cleanly. That's the main benefit of this.

After we discussed earlier in this thread, I noticed ServerGroupID has been changed into ServerGroup (a filter that allows name or uuid). One option is to extend this further to include the expected policy, and create if it the named group doesn't already exist. The deletion criteria become less obvious if used for worker nodes, however.

Can you think of other options? How do we decide the way forward? I'm happy to join the next CAPO meeting to discuss.

k8s-triage-robot · 2024-05-28T03:03:46Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

dalees · 2024-05-28T04:44:56Z

/remove-lifecycle stale

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 6, 2022

mkjpryor changed the title ~~Use server groups to ensure anti-affinity for control plane nodes~~ Use a server group to ensure anti-affinity for control plane nodes Jun 6, 2022

mkjpryor mentioned this issue Jun 12, 2022

✨ Allow clusters without explicit availability zones #1253

Closed

3 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 8, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 8, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 7, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 6, 2023

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2023

k8s-ci-robot reopened this Nov 10, 2023

EmilienM mentioned this issue Nov 22, 2023

[Feature] Support for anti-affinity/affinity rules for the created machines #1378

Closed

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 22, 2023

EmilienM mentioned this issue Nov 28, 2023

✨ proposal: new API for managed Security Groups #1756

Open

dalees linked a pull request Feb 28, 2024 that will close this issue

✨ Add OpenStackServerGroup CRD and Controller #1912

Open

4 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 28, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a server group to ensure anti-affinity for control plane nodes #1256

Use a server group to ensure anti-affinity for control plane nodes #1256

mkjpryor commented Jun 6, 2022

jichenjc commented Jun 7, 2022

mkjpryor commented Jun 7, 2022

mkjpryor commented Jun 7, 2022

jichenjc commented Jun 8, 2022

mdbooth commented Jun 10, 2022

k8s-triage-robot commented Sep 8, 2022

tobiasgiese commented Sep 8, 2022

k8s-triage-robot commented Dec 7, 2022

k8s-triage-robot commented Jan 6, 2023

k8s-triage-robot commented Feb 5, 2023

k8s-ci-robot commented Feb 5, 2023

dalees commented Nov 10, 2023

dalees commented Nov 10, 2023

dalees commented Nov 10, 2023

k8s-ci-robot commented Nov 10, 2023

jichenjc commented Nov 10, 2023

k8s-ci-robot commented Nov 10, 2023

mkjpryor commented Nov 10, 2023

mdbooth commented Nov 10, 2023

mkjpryor commented Nov 13, 2023

EmilienM commented Nov 22, 2023

EmilienM commented Nov 22, 2023 •

edited

nikParasyr commented Nov 22, 2023

dalees commented Nov 24, 2023

mkjpryor commented Nov 28, 2023 •

edited

mkjpryor commented Nov 28, 2023 •

edited

dalees commented Feb 8, 2024

dalees commented Feb 16, 2024

EmilienM commented Feb 21, 2024

dalees commented Feb 28, 2024 •

edited

k8s-triage-robot commented May 28, 2024

dalees commented May 28, 2024

Use a server group to ensure anti-affinity for control plane nodes #1256

Use a server group to ensure anti-affinity for control plane nodes #1256

Comments

mkjpryor commented Jun 6, 2022

jichenjc commented Jun 7, 2022

mkjpryor commented Jun 7, 2022

mkjpryor commented Jun 7, 2022

jichenjc commented Jun 8, 2022

mdbooth commented Jun 10, 2022

k8s-triage-robot commented Sep 8, 2022

tobiasgiese commented Sep 8, 2022

k8s-triage-robot commented Dec 7, 2022

k8s-triage-robot commented Jan 6, 2023

k8s-triage-robot commented Feb 5, 2023

k8s-ci-robot commented Feb 5, 2023

dalees commented Nov 10, 2023

dalees commented Nov 10, 2023

dalees commented Nov 10, 2023

k8s-ci-robot commented Nov 10, 2023

jichenjc commented Nov 10, 2023

k8s-ci-robot commented Nov 10, 2023

mkjpryor commented Nov 10, 2023

mdbooth commented Nov 10, 2023

mkjpryor commented Nov 13, 2023

EmilienM commented Nov 22, 2023

EmilienM commented Nov 22, 2023 • edited

nikParasyr commented Nov 22, 2023

dalees commented Nov 24, 2023

mkjpryor commented Nov 28, 2023 • edited

mkjpryor commented Nov 28, 2023 • edited

dalees commented Feb 8, 2024

dalees commented Feb 16, 2024

EmilienM commented Feb 21, 2024

dalees commented Feb 28, 2024 • edited

k8s-triage-robot commented May 28, 2024

dalees commented May 28, 2024

EmilienM commented Nov 22, 2023 •

edited

mkjpryor commented Nov 28, 2023 •

edited

mkjpryor commented Nov 28, 2023 •

edited

dalees commented Feb 28, 2024 •

edited