Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNV-46779: Udn for virt workloads #1673

Open
wants to merge 33 commits into
base: master
Choose a base branch
from

Conversation

maiqueb
Copy link
Contributor

@maiqueb maiqueb commented Aug 30, 2024

This PR provides an OpenShift enhancement to describe the virtualization specific required integrations for the primary UDN feature, which was designed in #1623 .

It builds on top of enhancement #1456, in which we design how to have persistent IPs for VM workloads secondary interfaces of layer2 or localnet type.

The overall goal is to provide OpenShift Virtualization workloads the ability to have live-migration on UDN networks; UDN networks are interesting for virt users since they are managed (have IPAM) and are closer to what a virt. user expects - isolated networks - instead of following the Kubernetes networking model, in which all pods can communicate between themselves, and connectivity is limited via Network Policy.

@openshift-ci openshift-ci bot requested review from abhat and trozet August 30, 2024 13:42
Copy link
Contributor

openshift-ci bot commented Aug 30, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign danwinship for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@maiqueb maiqueb changed the title Udn for virt workloads CNV-46779: Udn for virt workloads Aug 30, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 30, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 30, 2024

@maiqueb: This pull request references CNV-46779 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

This PR provides an OpenShift enhancement to describe the virtualization specific required integrations for the primary UDN feature, which was designed in #1623 .

It builds on top of enhancement #1456, in which we design how to have persistent IPs for VM workloads secondary interfaces of layer2 or localnet type.

The overall goal is to provide OpenShift Virtualization workloads the ability to have live-migration on UDN networks; UDN networks are interesting for virt users since they are managed (have IPAM) and are closer to what a virt. user expects - isolated networks - instead of following the Kubernetes networking model, in which all pods can communicate between themselves, and connectivity is limited via Network Policy.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@maiqueb maiqueb marked this pull request as draft August 30, 2024 13:44
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 30, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 30, 2024

@maiqueb: This pull request references CNV-46779 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.18." or "openshift-4.18.", but it targets "CNV v4.18.0" instead.

In response to this:

This PR provides an OpenShift enhancement to describe the virtualization specific required integrations for the primary UDN feature, which was designed in #1623 .

It builds on top of enhancement #1456, in which we design how to have persistent IPs for VM workloads secondary interfaces of layer2 or localnet type.

The overall goal is to provide OpenShift Virtualization workloads the ability to have live-migration on UDN networks; UDN networks are interesting for virt users since they are managed (have IPAM) and are closer to what a virt. user expects - isolated networks - instead of following the Kubernetes networking model, in which all pods can communicate between themselves, and connectivity is limited via Network Policy.

TODOs:

  • add link to the seamless migration epic. Maybe as a risk / mitigation combo in case the migration looses too many pkts
  • have ticket for OVN team to ensure we can request multi-chassis when using IC deployment. Link it in the enhancement
  • model the IPAMClaim conditions
  • add some mermaid diagrams explaining how the whole thing works from the user perspective / controllers

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.


This requires an enhancement on passt.

TODO: link the ticket to get an enhancement
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdDev let me know once we know what we require from passt, and have an opened ticket we can track.

Copy link

@oshoval oshoval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, will continue soon where i left
Top notch, thanks

Comment on lines 60 to 61
to use the primary network to which the VM is connected for north/south, while
still being able to connect to KAPI and consume Kubernetes DNS.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing probes beside KAPI / DNS ?

In short, KubeVirt (or a component on behalf
of KubeVirt) creates an `IPAMClaim` resource for each interface in the VM the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth to not mention kubevirt as the one that deploy it, because it is not true and not going to be changed

worth to say a component is responsible to create .... and to mention ipam-extensions

same for "and KubeVirt instructs the CNI plugin" few lines below

Copy link
Contributor Author

@maiqueb maiqueb Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to protect people that are not kubevirt savvy from understanding our shenanigans. And save me from explaining how we got there, what that project is, etc.

I think it doesn't matter much: the point is whatever component creates IPAMClaims understands the lifecycle of VMs.

do not require to update the KubeVirt API, which is something we have faced
extreme resistance in the past.

Keep in mind that up to now the IPAMClaims would be created by KubeVirt
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question as above (i cant link it, because it is in the middle of "review" so there is no links yet)
kubevirt is not the one that creates it

the desired IP addresses for the VM, and when creating the VM, point it to the
respective `IPAMClaim` using an annotation (on the VM).

KubeVirt (or a component on behalf of it) will see the annotation on the VM,
Copy link

@oshoval oshoval Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here it is even an action item, not just something that already exists (so the question is bit different than the ones above)
the ipam-ext should do it right ? i think it would be better to be precise here, and not mention kubevirt ?
(even that we have "or a component on behalf ...")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would that component. I just don't understand what anyone gets from having to know there's an extra moving piece.

What is important is that it is something that understands what a VM is.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 2, 2024

@maiqueb: This pull request references CNV-46779 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.18." or "openshift-4.18.", but it targets "CNV v4.18.0" instead.

In response to this:

This PR provides an OpenShift enhancement to describe the virtualization specific required integrations for the primary UDN feature, which was designed in #1623 .

It builds on top of enhancement #1456, in which we design how to have persistent IPs for VM workloads secondary interfaces of layer2 or localnet type.

The overall goal is to provide OpenShift Virtualization workloads the ability to have live-migration on UDN networks; UDN networks are interesting for virt users since they are managed (have IPAM) and are closer to what a virt. user expects - isolated networks - instead of following the Kubernetes networking model, in which all pods can communicate between themselves, and connectivity is limited via Network Policy.

TODOs:
- add link to the seamless migration epic. Maybe as a risk / mitigation combo in case the migration looses too many pkts
- have ticket for OVN team to ensure we can request multi-chassis when using IC deployment. Link it in the enhancement

  • model the IPAMClaim conditions
  • add some mermaid diagrams explaining how the whole thing works from the user perspective / controllers

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.


#### Pre-requirements - configure the primary UDN for a namespace
The user will need to perform the usual steps to configure a primary UDN for
their namespace - that is, to provision a `UserDefinedNetwork` of layer2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does primary-UDN only support layer2 topology? I remember the UDN enhancement talking also on layer3..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the virt integration, we only care about L2.

// The network name for which this persistent allocation was created
Network string `json:"network"`
// The pod interface name for which this allocation was created
Interface string `json:"interface"` // *new* attribute
Copy link

@RamLavi RamLavi Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's just my experience, but putting Network and Interface in the same level kinda suggests that they are referring to the VM.Spec's Network and Interface fields (which are connected in a 1:1 relationship anyways AFAIU). They are not the same though right? The Interface field here is talking about the interface inside the pod. Perhaps we should make it more explicit then?

Suggested change
Interface string `json:"interface"` // *new* attribute
PodInterface string `json:"podInterface"` // *new* attribute

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says right there in the comment:

The pod interface name for which this allocation was created

@maiqueb maiqueb marked this pull request as ready for review September 4, 2024 10:05
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 4, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 4, 2024

@maiqueb: This pull request references CNV-46779 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.18." or "openshift-4.18.", but it targets "CNV v4.18.0" instead.

In response to this:

This PR provides an OpenShift enhancement to describe the virtualization specific required integrations for the primary UDN feature, which was designed in #1623 .

It builds on top of enhancement #1456, in which we design how to have persistent IPs for VM workloads secondary interfaces of layer2 or localnet type.

The overall goal is to provide OpenShift Virtualization workloads the ability to have live-migration on UDN networks; UDN networks are interesting for virt users since they are managed (have IPAM) and are closer to what a virt. user expects - isolated networks - instead of following the Kubernetes networking model, in which all pods can communicate between themselves, and connectivity is limited via Network Policy.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Comment on lines 27 to 32
Live-migration is the bread and butter of any virtualization solution.
OpenShift Virtualization does not currently support live-migration over the
OpenShift default cluster network, for a variety of reasons. The scope of this
enhancement is to define how to use the existent
[primary UDN feature](https://github.com/openshift/enhancements/pull/1623/) to
enable virtualization workloads to migrate over the primary network.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also proper virtual machine restart is important and we are covering it too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Please resolve if you're OK.

Comment on lines 405 to 409
It makes more sense that multiple primary networks will be used in the hosted
cluster in order to provide tenants with better isolation from each other,
without the need for network policy.
There should be no hypershift platform-specific considerations with this
feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case hypershift move to primary UDN + passt, the fact that passt is not passing hostname over DHCP will break hypershift hosted clusters since there is an openshift check that ensure that nodes are not "localhost" hostnamed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature was not requested for HyperShift. I can list this though, but for that, I'd need to have an opened issue to be tracked.

which makes use of the multi requested-chassis OVN feature, in which a single
logical switch port can be bound to multiple chassis (and have traffic
forwarded to them).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add here the passt issues ?

  • live migration break connections
  • no DHCP hostname

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the latter is a non-issue to the requirements we have.

the former has been analyzed in the proposal section, and an alternative was provided. I think that's more than enough.

N/A

## Alternatives

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative to annotation the virt-launcher pod would be to labeling so it's easier to filter the IPAMClaim with the pod, now that we are going to reconcile them that would be possible, but not sure if adds value.

- would currently only work for IPv4. We would need to advertise routes either
from OVN-Kubernetes, or from within the pod (which would require CAP_NET_ADMIN,
which we do **not** want to reintroduce).
- integrating with pod network ecosystem (service mesh, metrics, etc) difficult
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you remember if mac spoofing mechanism was not working with linux bridge or was the opposite ? Don't know if it would work for passt though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what you mean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a mac spoofing mechanism implemented at kubevirt bindings, don't know if we miss that with passt or not.

Copy link
Contributor Author

@maiqueb maiqueb Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, there's MAC spoofing implemented at bridge CNI.

There's also MAC (and IP) spoofing implemented in OVN-K so we do more. This is orthogonal to the binding used.

This commit adds the summary, motivation, and user stories sections.

Signed-off-by: Miguel Duarte Barroso <[email protected]>
Signed-off-by: Miguel Duarte Barroso <[email protected]>
…m the subnet range

Signed-off-by: Miguel Duarte Barroso <[email protected]>
Signed-off-by: Miguel Duarte Barroso <[email protected]>
Signed-off-by: Miguel Duarte Barroso <[email protected]>
Comment on lines 28 to 29
OpenShift Virtualization does not currently support live-migration over the
OpenShift default cluster network, for a variety of reasons. The scope of this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true. We do support live migration over the pod network with use of masquerade. I would say we "allow live migration over the cluster network, but it requires a compromise in form of NAT which is making the solution unacceptable for many use-cases".

You may argue that TCP disconnect is not making it "live", but except that one issue we tick all the boxes of a live migration, and it is called live-migration by developers and users alike. So to avoid any confusion, I suggest changing this sentence.

Copy link
Contributor Author

@maiqueb maiqueb Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I argue that it's not only TCP disconnect, but TCP disconnect without any chance of ever reconnecting unless the clients go and learn the new IP to contact the VM.

I can compromise for "OpenShift Virtualization live-migration over the default network currently requires a compromise (in the form of NAT) which makes the solution unacceptable for any use case that involves network connectivity".

Is that wrong ?

Jaime asked to be removed as a reviewer.

Signed-off-by: Miguel Duarte Barroso <[email protected]>
This other alternative has the advantage of not requiring a KubeVirt API update
but adds a lot more work in OVN-Kubernetes, and requires updating the de-facto
standard.

Copy link
Member

@phoracek phoracek Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option:

Suggested change
Another option is to manage IP assignments on `UserDefinedNetwork`, where IPs from the excluded subnet could be assigned to a VM or a Pod by matching it's namespace and name.
The benefits of this are most obvious when looking at `ClusterUserDefinedNetwork`. Keeping all the mappings at a single place would make it easier to get a full picture of free and assigned IPs. Keeping the mapping on the UDN may be also considered more secure - the network belongs to the person defining the subnet, in case of `CUDN` to the cluster-admin. With this approach it will be always this owner giving away static IPs. This mimicks the workflow in traditional networks (reserving IPs on DHCP server configuration), and also the API used by VMware.
The implementation could look like this:
* We keep IPAMClaims as they are
* UDN now has a new section where the cluster/project admin can assign IPs from excluded subnet to Pods/VMs
* When Pod is created, OVN checks if it has reserved IP on UDN, if it does, it uses it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting; it approaches the problem in a different way: I had assumed we would want the VM owner to be able to specify the IP; this approach moves that responsibility / ability to the admin.

Without getting into the technical limitations we would face for it (now) it is quite clear we need to clarify which persona we want to be able to indicate the VM's IPs - the admin, or the VM owner.

I've reached some stake holders / field people. I'll clarify once I have heard from them.

Until we know exactly who the "user" is for that, no point

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the suggested alternative is not limited to a cluster admin, it is just a matter of how it is eventually implemented. IMO the concept itself is the most important part, managing IP pools and ranges in a centralized location and treat individual specific VM control as an exception (e.g. a pool of a single IP in it, but there are probably many other ways to approach the need).

Copy link

@EdDev EdDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am splitting the review between the overview & motivation part and the solution proposal.

I am posting my feedback on the first part now.

@@ -0,0 +1,638 @@
---
title: primary-user-defined-networks-for-virtualization-workloads
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioning UDN in the title, summary and motivationis mixes a solution proposal with the need. It embeds the solution in advance, which is limiting and blocks alternatives (even though may not be practical at this stage).

I think it will be beneficial to make sure the need is not mentioning any specific solution and the proposal adds the other options that could have been explored, even though it may be too late.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enhancement is about how to integrate with the existing UDN solution.

While a document such as the one you're suggesting makes sense, studying and identifying what could have beens is not our goal.

Comment on lines 38 to 41
Virtualization users are commonly used to having layer2 networks, and to having
their tenant networks isolated from other tenants. This is something that
opposes the Kubernetes networking model, in which all pods can reach other pods,
and security is provided by implementing Network Policy.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part assumes that Virt users require access to the pod/primary network for their needs.
If there is such a need, can you please add it before this paragraph?

In general, the tenancy, isolation and L2 expectations is a need that can be solved through secondary networks. Therefore, the suggestion to use the primary pod network is a solution, not the motivation.
But there may be specific needs to use the primary pod network for the VIrt users, so lets list them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being specific, AFAICT, you won't be able to use secondary networks in cloud environments (AWS / azure / ...) for anything other than east / west.

Hence, this type of users (requiring cluster egress on cloud environments) must use the primary network.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phoracek can you point some more usages of the pod network ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

### User Stories

- As the owner of an application running inside my VM, I want to have east/west
communication without NAT to other VMs and/or pods within the same network.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From later user stories we can understand why a single subnet is needed (e.g. for migration).

Is there some other reason NAT or different subnets rejected by such users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is literally what the users are requesting. @phoracek maybe you can provide more insight into their motives ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to meet the expectations of the VM workload that users want to migrate to OpenShift.

Users exploring masquerade on Pod network often raise one of these two issues, preventing them from using it:

  • Their existing workload uses the IP seen in the guest as the ID of the system. With masquerade, this IP is always the same, breaking the system
  • Their existing workload would be a distributed application, where each member run on its VM. It takes the IP seen inside the guest and reports it to a central system saying "find me here". Of course, this IP is not reachable by others (or goes to loopback)

In short, application inside a VM expects that its IP is A) unique and B) addressable by its neighbors.

I cannot name the exact applications, but I remember distributed databases and Windows suffering from these problems

Comment on lines 68 to 77
- As a VM owner, I want to be able to specify the IP address for the interface
of my VM.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please examine if this is a real story from the field or that the need makes sense?
Managing a pool of VMs at medium or large scale makes this story less likely to exist in this form.

E.g. if the need is to import VMs from a different platform, the story is different from the need to define IPs on a per VM individually.

I would expect operators to control the subnet used for a specific group of VMs or that sticky IPs which VMs are assigned are retained when converted to OCP-Virt. These stories may end up with a different implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. As indicated in #1673 (comment), we need to clarify this user story. Especially since we don't know who's the user in the story.

Comment on lines 84 to 93
- Provide a configurable way for the user to define the IP addresses on a VM's
interface.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is something we need to discuss further.
We need a strong reasoning to aim for this per-VM solution vs a more general pool based management.

I would expect users to be allowed to set the subnet in terms of an IP range and then be assigned sticky IPs from that pool. For scenarios where VMs are imported from other platforms and the IP must be preserved (do we have a ref for such a request?), I would recommend a more centralized general solution. The goal will be to allow such an import and the implementation listed later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. This needs clarifications from PMs; waiting on it.


KubeVirt uses a concept called bind mechanisms to extend networking to the VM;
we plan on using the
[passt](https://passt.top/passt/about/#passt-plug-a-simple-socket-transport)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the recent talks about splitting the efforts to:
4.18: using tap/macvtap
4.19: using passt
Can you please update the plan?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it is enough to summarize here only that a dedicated binding is chosen, but not providing too much details. This should allow us to change the details without the need to update this proposal.

But this is up to the author how to manage this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just link the binding enhancement. Simpler.

Comment on lines 630 to 638
While this would be a native way to integrate with the rest of the persistent
IPs feature, it would be a clumsy flow for primary UDNs, since the user would
be expected to use network selection elements to configure the primary UDN.

Furthermore, we currently not allow primary UDNs to be attached via network
selection elements. We would need to break this convention.

This route could be interesting in the future though, since we expect this to
become the way to request SR-IOV resources for primary UDNs.
Copy link
Contributor

@jcaamano jcaamano Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section is only highlighting the negatives of this alternative, but not the positives (or the juxtaposed negatives of the main proposal).

It might look weird and confusing that there is two different ways to use IPAM claims: NSE for secondary network but a different annotation for primary networks; when we could just go for what we are already doing.

It is highly probable that we will need NSE to customize the primary UDN network in the future, just as the default network can be customized today. When that day comes, everything will look even weirder because then you would be passing a NSE and you will be asking yourself on which of the two annotations the ipam claim reference has to go. Best case, we support and be forced to maintain both options.

Moreover, I think there is really no specific reason we don't allow NSE for primary UDNs at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section is only highlighting the negatives of this alternative, but not the positives (or the juxtaposed negatives of the main proposal).

I'll try to list the benefits.

It might look weird and confusing that there is two different ways to use IPAM claims: NSE for secondary network but a different annotation for primary networks; when we could just go for what we are already doing.

I'm not so sure it fits, or that it wouldn't look weirder than the current proposal. Please keep reading.

It is highly probable that we will need NSE to customize the primary UDN network in the future, just as the default network can be customized today.

I agree it might be needed to customize the primary UDN network.

I also agree there is currently a way to customize the cluster default network - via the multus annotation.

Maybe I just don't understand how this would look like - I'm assuming the NSE would have to request the attachment of the default network. For clarity, could you paste how you envision the NSE for - let's say - a primary UDN requesting an IPAMClaim and SR-IOV device ?

Assuming I've understood this correctly (and the NSE we need to configure will point to the default network attachment), I think it is weird to put an NSE customizing the default network, and for it to really apply / affect the UDN attachment (the IPAMClaim reference, SR-IOV device). After all, you'll need to pass the data corresponding to the default network, but pass the parameters you want to customize for the primary UDN. That, IMHO, doesn't make for a good fit in this particular case for the API. I can argue it's more confusing.

Still, we're evaluating it, and PoC'ing how would it really look.

When that day comes, everything will look even weirder because then you would be passing a NSE and you will be asking yourself on which of the two annotations the ipam claim reference has to go. Best case, we support and be forced to maintain both options.

Hopefully it won't come to that; when we release this feature in 4.18 we'll have picked one option, and IIUC we would support just that one. I.e. whatever option we end choosing will be the way to specify this feature.

Moreover, I think there is really no specific reason we don't allow NSE for primary UDNs at this point.

I think we simply wanted to make it impossible for users to specify a primary UDN as a secondary network.

I guess this could be relaxed to allow it when said NSE is selected as the default via the multus annotation.

@qinqon / @tssurya do you have an opinion on this subject ?

@jcaamano

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I just don't understand how this would look like - I'm assuming the NSE would have to request the attachment of the default network. For clarity, could you paste how you envision the NSE for - let's say - a primary UDN requesting an IPAMClaim and SR-IOV device ?

Assuming I've understood this correctly (and the NSE we need to configure will point to the default network attachment), I think it is weird to put an NSE customizing the default network, and for it to really apply / affect the UDN attachment (the IPAMClaim reference, SR-IOV device). After all, you'll need to pass the data corresponding to the default network, but pass the parameters you want to customize for the primary UDN. That, IMHO, doesn't make for a good fit in this particular case for the API. I can argue it's more confusing.

It doesn't look weirder that using the default network name to refer to the primary network status in the network status annotation, as we plan to do, from ovn-org/ovn-kubernetes#4671:

[
  {
    "name": "ovn-kubernetes",
    "interface": "eth0",
    "ips": [
      "10.244.1.6",
      "fd00:10:244:2::6"
    ],
    "mac": "0a:58:0a:f4:01:06",
    "default": false,
    "dns": {}
  },
  {
    "name": "ovn-kubernetes",
    "interface": "ovn-udn1",
    "ips": [
      "10.128.0.3"
    ],
    "mac": "0a:58:0a:80:00:03",
    "default": true,
    "dns": {}
  }
]

So at this point we have to come to terms that from the perspective of this API, when addressing the default network name, we might actually be addressing the default network for the namespace whatever that might be.

I am not really sure there is a valid argument to say its fine to do it in network-status but not in the NSE in terms of API (or UX).

Copy link
Contributor Author

@maiqueb maiqueb Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look weirder that using the default network name to refer to the primary network status in the network status annotation, as we plan to do, from ovn-org/ovn-kubernetes#4671:

[
  {
    "name": "ovn-kubernetes",
    "interface": "eth0",
    "ips": [
      "10.244.1.6",
      "fd00:10:244:2::6"
    ],
    "mac": "0a:58:0a:f4:01:06",
    "default": false,
    "dns": {}
  },
  {
    "name": "ovn-kubernetes",
    "interface": "ovn-udn1",
    "ips": [
      "10.128.0.3"
    ],
    "mac": "0a:58:0a:80:00:03",
    "default": true,
    "dns": {}
  }
]

I disagree; to me, it totally does - they are in different elements in the array. The only thing confusing there is the network name; and you can still differentiate using the interface name.

IMHO that's totally different from using an NSE like this one:

[
    {
        "name":"default",              # relates to a NAD with the default cluster network
        "namespace":"blue-ns",  
        "ipam-claim-reference": "as-we-have.now" # customizes the primary UDN attachment
    }
]

In case I am making an error with the NSE contents, please let me know; I've asked for it in #1673 (comment), but you probably didn't see it.

So at this point we have to come to terms that from the perspective of this API, when addressing the default network name, we might actually be addressing the default network for the namespace whatever that might be.

I will give it another go at the interface reporting PR to see if I can actually reply with the network name of the NAD for the primary UDN. If I manage to, I'd say your comment wouldn't apply any more.

EDIT: I don't think the above is doable.

I am not really sure there is a valid argument to say its fine to do it in network-status but not in the NSE in terms of API (or UX).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree; to me, it totally does - they are in different elements in the array. The only thing confusing there is the network name; and you can still differentiate using the interface name.

Then we agree we disagree.

That's not the only confusing thing. The fact that "default": true is set for the network that is not the default network is also confusing.

That fact that the array has two elements and therefor its fine it's a tunnel vision specific to your use case, it doesn't mean it is API fine.

Using interface name to distinguish relies on the unspecified convention that eth0 is the main interface, that wouldn't be good either.

We agreed we had to redefine what default means, and maybe the NSE on the default network affecting the primary network fits under that new definition.

However, the NSE might as well be defined addressing the primary UDN network name, nothing would be preventing us being able to handle it as long as OVNK is handling the CNI call at the end, it doesn't really matter if the CNI call is for the primary or the default network, we can be aware and handle it.

Then the network status would be the opposite, and will report twice the primary network. I guess that is fine as well if it is fine the other way around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying you couldn't; I'm pointing out that we'd need to reverse the flow of operations. At least in that we can agree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replying in this thread to some of your comments.

That's not the only confusing thing. The fact that "default": true is set for the network that is not the default network is also confusing.

This aligns with your next comment - there is a need to redefine what "default" means.

We agreed we had to redefine what default means, and maybe the NSE on the default network affecting the primary network fits under that new definition.

The work-group discussed "default" would be up to the implementation to define; that's the reason why it is now proposed for the interface order to matter in that array - you want to have a different default ? Put that first in the list.

@dougbtv could you keep me honest in this last comment ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to keep this discussion short, I think it is feasible to use the NSE and reasonable to highlight the positives of that option in the alternative. That would be my whole ask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to keep this discussion short, I think it is feasible to use the NSE and reasonable to highlight the positives of that option in the alternative. That would be my whole ask.

I agree that it is feasible, and have committed to experiment with it already. We're currently PoCing it.

I've also said I would highlight the positives of it in the alternative section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jcaamano
Copy link
Contributor

The work-group discussed "default" would be up to the implementation to define; that's the reason why it is now proposed for the interface order to matter in that array - you want to have a different default ? Put that first in the list.

That's fine but if you don't see the confusion in saying that default is the one that comes first in the list while it has default: false and the second in the list has default: true; and then say it is confusing to use the network name "ovn-kubernetes" and having us define it un-distinctively as the default or the primary network in the NSE; while we are already conveniently doing that same thing in the network-status, and also now the we agreed the definition of "default" is implementation defined; then I would think you are conveniently driving your train of thought to your specific use case.

I hope I am making my point across here. It's like, yeah its fine to do this here but not here because we can't apply the same interpretation that makes it fine in the former to make it fine in the latter.

I'm not saying you couldn't; I'm pointing out that we'd need to reverse the flow of operations. At least in that we can agree.

Not sure if I understand what you mean. The way I am thinking about is that once we recognize that we got a request for a primary UDN, the we could just do the exact same thing we do now.

@maiqueb
Copy link
Contributor Author

maiqueb commented Sep 12, 2024

The work-group discussed "default" would be up to the implementation to define; that's the reason why it is now proposed for the interface order to matter in that array - you want to have a different default ? Put that first in the list.

That's fine but if you don't see the confusion in saying that default is the one that comes first in the list while it has default: false and the second in the list has default: true; and then say it is confusing to use the network name "ovn-kubernetes" and having us define it un-distinctively as the default or the primary network in the NSE; while we are already conveniently doing that same thing in the network-status, and also now the we agreed the definition of "default" is implementation defined; then I would think you are conveniently driving your train of thought to your specific use case.

Wait - no. There's some confusion here. The default network would be first interface reported in the CNI result. Not in the network-status annotation. Hence, the implementation has full control on what it considered to be the default - just return whatever interface you consider the default as the first interface in the CNI result. Maybe I just don't follow your initial, and last premises. Which are (for clarity):

  • saying that default is the one that comes first in the list while it has default: false and the second in the list has default: true
  • we agreed the definition of "default" is implementation defined

Then (maybe it's just me ...) I find it confusing to define in the NSE information for the primary UDN interface but point to the default cluster network (via the name attribute). Yes - I find that confusing: it's within the same element in an array. It's very structure makes me think it points to the same object.

And I find it not as bad (which is not saying I find it absolutely fine) to return multiple interfaces with the same name in the status. Because those "live" in different elements in the array, and can be identified by the name of the interface. In this scenario, the structure of the data makes me interpret it as belonging to different things. Maybe I'm silly by thinking this is not as bad.

I hope I am making my point across here. It's like, yeah its fine to do this here but not here because we can't apply the same interpretation that makes it fine in the former to make it fine in the latter.

I think I understand your point. You see reporting multiple interfaces with the same name as exactly the same as customizing the primary UDN while pointing to the default cluster network in the NSE.

I'm not saying you couldn't; I'm pointing out that we'd need to reverse the flow of operations. At least in that we can agree.

Not sure if I understand what you mean. The way I am thinking about is that once we recognize that we got a request for a primary UDN, the we could just do the exact same thing we do now.

Yes.

is responsible for configuring networking up to the pod interface), we still
need to extend connectivity from the pod interface into the VM.

During live-migration - once it is is scheduled - and the destination
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: double is.

the *destination* pod). Once this pod is ready, the *source* pod transfers the
state of the live VM to the *destination* pod via a connection proxied by the
virt-handlers (the KubeVirt agents running on each node) to the *destination*
pod.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to add here some points about how this works as it has implications on the solution:

  • The destination pod is pretty much identical to the source pod and both pods are expected to be in a "running" state for the period of the migration (which may take tens of seconds).
  • During the migration time, only the source VM is active and running, the destination one is being prepared. At the very end, the source VM is paused and the destination started.
  • The source pod may still be around for some time before finally removed post a successful migration. I.e. the VM is already stopped on the source, even removed, but the pod may still be in a teardown process waiting for other dependencies to be cleaned up.


KubeVirt uses a concept called bind mechanisms to extend networking to the VM;
we plan on using the
[passt](https://passt.top/passt/about/#passt-plug-a-simple-socket-transport)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it is enough to summarize here only that a dedicated binding is chosen, but not providing too much details. This should allow us to change the details without the need to update this proposal.

But this is up to the author how to manage this.

To allow the user to specify the IP address for the workload we plan on adding
an attribute to the KubeVirt API
[interface](https://kubevirt.io/api-reference/main/definitions.html#_v1_interface)
named `ipAddress`. This approach suits the product, and meets the user's
expectations because the user can configure the MAC address for an interface in
that same `Interface` resource.

Regarding the implementation, OpenShift Virtualization would extend what it
does for the MAC address: define it in the k8snetworkplumbingwg de-facto
standard as an IP request, and the OVN-Kubernetes CNI would attempt to honor
it.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubevirt is agnostic to IPAM in terms of definition. It only reflects the status of the interfaces IP details.
Unlike the MAC address, which controls the emulated VM vNIC, the IP is is an external service.

The current solution with the IPAMClaim is also an external add-on to Kubevirt and I would prefer this to be continued.

There is also the topic that was recently raised about the necessity and priority of this requirement. If we can add this as an optional future enhancement, it will be great.

This other alternative has the advantage of not requiring a KubeVirt API update
but adds a lot more work in OVN-Kubernetes, and requires updating the de-facto
standard.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the suggested alternative is not limited to a cluster admin, it is just a matter of how it is eventually implemented. IMO the concept itself is the most important part, managing IP pools and ranges in a centralized location and treat individual specific VM control as an exception (e.g. a pool of a single IP in it, but there are probably many other ways to approach the need).

Comment on lines +437 to +489
If the tests show that the traffic disruption takes too long, we should
prioritize the
[seamless live-migration epic](https://issues.redhat.com/browse/CNV-27147)
which makes use of the multi requested-chassis OVN feature, in which a single
logical switch port can be bound to multiple chassis (and have traffic
forwarded to them).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective (representing Kubevirt), this is a must for a seamless migration.
The amount of packets that may be dropped is dependent on the time it takes for the source pod to be removed, something that is dependent on many factors in the Kubevirt context.
If you are interested in testing, I suggest to examine this with setups that are expected to cause delays in the pod teardown.

The Kubevirt design proposal that suggests the binding to be used, mentions this one as a pre-requirement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seamless migration is not a goal - live migration is.

I cannot prioritize that epic; PMs must.

FWIW: I think we can tackle this feature in the 4.18 time-frame if we have SDN team help.
It would be especially interesting to know how would the "feature" be configured, since the exact same concept can be used for port-mirroring.


### Dev Preview -> Tech Preview

There will be no dev or tech preview for this feature.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very big risk. How about we mention this in the risk section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can, but I don't think this is debatable.

And I feel stupid listed something as a risk and not provide any mitigation for it.


## Alternatives

### Binding mechanism
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section will need to sync with the U/S Kubevirt proposal.
Hopefully that proposal will be posted to the Kubevirt community in a few days with the plan that fits the technical aspects, maintainability and the timeline limitations.

integration. This would leave service mesh / ACS / metric integration as the
drawback of this option.

### VM interface IP address configuration
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest this and its need to be reconsidered or mentioned as futuristic optional.

If this need will still be targeted for 4.18 or a future version, I think there are other options to manage the IP pools which are closer to an add-on service, rather than using the IPAMClaim to solve a per-VM-vNIC IP manual allocation.

Copy link
Contributor Author

@maiqueb maiqueb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews so far.

Comment on lines 38 to 41
Virtualization users are commonly used to having layer2 networks, and to having
their tenant networks isolated from other tenants. This is something that
opposes the Kubernetes networking model, in which all pods can reach other pods,
and security is provided by implementing Network Policy.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 630 to 638
While this would be a native way to integrate with the rest of the persistent
IPs feature, it would be a clumsy flow for primary UDNs, since the user would
be expected to use network selection elements to configure the primary UDN.

Furthermore, we currently not allow primary UDNs to be attached via network
selection elements. We would need to break this convention.

This route could be interesting in the future though, since we expect this to
become the way to request SR-IOV resources for primary UDNs.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +437 to +489
If the tests show that the traffic disruption takes too long, we should
prioritize the
[seamless live-migration epic](https://issues.redhat.com/browse/CNV-27147)
which makes use of the multi requested-chassis OVN feature, in which a single
logical switch port can be bound to multiple chassis (and have traffic
forwarded to them).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seamless migration is not a goal - live migration is.

I cannot prioritize that epic; PMs must.

FWIW: I think we can tackle this feature in the 4.18 time-frame if we have SDN team help.
It would be especially interesting to know how would the "feature" be configured, since the exact same concept can be used for port-mirroring.


KubeVirt uses a concept called bind mechanisms to extend networking to the VM;
we plan on using the
[passt](https://passt.top/passt/about/#passt-plug-a-simple-socket-transport)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just link the binding enhancement. Simpler.


Provisioning the VM in the system will be "watched" by two components; KubeVirt
`virt-controller` and KubeVirt `ipam-extensions`. These components will then:
- virt-controller: template the pod where the VM `vm-a` will run
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### Dev Preview -> Tech Preview

There will be no dev or tech preview for this feature.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can, but I don't think this is debatable.

And I feel stupid listed something as a risk and not provide any mitigation for it.

api-server -->>- kubevirt-ipam-extensions: OK
end
rect rgb(0, 255, 255)
Note over kubevirt-ipam-extensions: VM / VMI controller
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the KubeVirt VM / VMI controllers.

We have a controller on the ipam-extensions project that tracks the VM lifecycle (creates IPAMClaims on behalf of the user for instance).

I'm just trying to point out this does not run in webhook, unlike the section just above this one.

Comment on lines 286 to 275
- cloudInitNoCloud:
networkData: |
version: 2
ethernets:
eth0:
dhcp4: true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks; good suggestions from both of you.

Signed-off-by: Miguel Duarte Barroso <[email protected]>
We realized we do not understand the use case for allowing the user to
specify the IP address for their VMs. Thus, we have postponed that goal
to a future release, while re-scoping the goal for this one to be about
persisting the IP address of the VM migrating from another
virtualization platform into OpenShift Virtualization.

Signed-off-by: Miguel Duarte Barroso <[email protected]>
Copy link
Contributor

openshift-ci bot commented Sep 20, 2024

@maiqueb: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants