We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Network interface hotplug for Kubernetes

00:00

Formal Metadata

Title
Network interface hotplug for Kubernetes
Title of Series
Number of Parts
287
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Design and implementation of dynamic network attachment for Kubernetes pods and KubeVirt VMs. Immutable infrastructure is the law of the land in the cloud native landscape, promising benefits to software architectures run in Kubernetes. … except sometimes the rules must be broken to achieve certain use cases; take for instance the dynamic attachment of L2 networks to a running VM: to hotplug an interface into the VM running in a pod, you first need to hotplug that interface into the pod. This feature is particularly of interest (required, actually) to enable scenarios where the workload (VM) cannot tolerate a restart, or when the workload is created prior to the network. When thinking about strategies for tackling this problem, we faced a recurring question when trying to come up with a modular design to provide this functionality: "should the changes be located in KubeVirt, and thus solve this issue for Virtual Machines, or should we take the longer path and address this issue also for pods ?" We chose the latter, which unlocks dynamic network attachment for pods, thus also benefiting the Kubernetes community. This talk will provide the audience with a basic understanding of KubeVirt, CNI, and Multus, and then propose a design to add (or remove) network interfaces from running pods (and virtual machines), along with the changes required in Multus and KubeVirt to make it happen. It will also factor in a community perspective, explaining how we pitched and got both the Multus and KubeVirt communities involved in a working arrangement to deliver this functionality.
Computer networkCloud computingBoom (sailing)Demo (music)ImplementationComputer clusterData modelMeta elementPlug-in (computing)Object (grammar)MetadataPlane (geometry)Control flowDefault (computer science)Virtual realityHuman migrationOperations researchVirtualizationQueue (abstract data type)Distribution (mathematics)Projective planeSoftware developerDiagramMultiplicationPlug-in (computing)Computer fileBinary codeMeta elementDependent and independent variablesCASE <Informatik>Event horizonComputer networkDefault (computer science)Type theoryConfiguration spaceNetwork topologyNeuroinformatikVirtual machineExtension (kinesiology)Endliche ModelltheorieComputer networkComputer clusterRun time (program lifecycle phase)InformationServer (computing)AdditionString (computer science)Multiplication signComputing platformOperator (mathematics)Arithmetic meanWorkloadoutputProxy serverAddress spaceResultantHeegaard splittingDirectory serviceHuman migrationImplementationUniform resource locatorElectronic mailing listComputer architectureKey (cryptography)Attribute grammarSemantics (computer science)Declarative programmingConnected spaceDifferent (Kate Ryan album)MathematicsOrder (biology)Context awarenessSheaf (mathematics)Single-precision floating-point formatSelectivity (electronic)Latent heatProof theoryTexture mappingDivisorSlide ruleLine (geometry)Group actionStandard deviationInstance (computer science)Physical systemDemosceneDigital electronicsAssociative propertyService (economics)AreaWordMereologyReduction of orderElement (mathematics)State of matterExecution unitObject (grammar)Prisoner's dilemmaPerturbation theoryMessage passingGoodness of fitData structureFile systemPoint (geometry)Error messageRevision controlExistenceMobile WebDiagramComputer animation
Computer clusterServer (computing)ArchitectureComputer networkStandard deviationComputing platformWorkloadComputer networkImplementationGame controllerObject (grammar)DemonPatch (Unix)Common Language InfrastructurePCI ExpressComputer networkVirtual machineWorkloadGame controllerDomain nameNetwork socketComputer networkControl systemClient (computing)File systemVirtualizationContext awarenessQuicksortMultiplicationLoop (music)Multiplication signOrder (biology)Network topologyDefault (computer science)ImplementationDifferent (Kate Ryan album)Type theoryElectronic mailing listPatch (Unix)PlanningLatent heatSystem callProcess (computing)Sheaf (mathematics)CASE <Informatik>Graph drawingConnected spaceTap (transformer)Bridging (networking)Extension (kinesiology)Limit (category theory)Single-precision floating-point formatConnectivity (graph theory)MathematicsOperator (mathematics)Self-organizationPlug-in (computing)Attribute grammarDeclarative programmingState of matterDynamical systemSlide ruleRootSet (mathematics)Computer architectureEvent horizonStandard deviationServer (computing)Goodness of fitParity (mathematics)NumberLevel (video gaming)Instance (computer science)Run time (program lifecycle phase)CodeMereologyPhysical systemRevision controlResultantSubsetView (database)FreewareCoefficient of determinationPerturbation theoryForcing (mathematics)Touch typingQueue (abstract data type)Asynchronous Transfer ModeUniform boundedness principleExecution unitInflection pointHand fanVideoconferencingGroup actionWebsiteIntelligent NetworkWordClosed setGraphics processing unitPhase transitionDemosceneAutomatic differentiationBlogConnectionismComputer animation
Game controllerPCI ExpressImplementationLoginPasswordPhase transitionData typeLink (knot theory)Gamma functionWitt algebraComputer networkLocal GroupOperator (mathematics)Demo (music)Virtual machineType theoryComputer networkSingle-precision floating-point formatComputer clusterLatent heatComputer networkRight angleGastropod shellArithmetic meanVirtualizationProcess capability indexQueue (abstract data type)Perturbation theoryView (database)Logic gateMachine visionCoefficient of determination2 (number)Electronic mailing listComputer animation
PCI ExpressGame controllerMassPasswordVideo game consolePhase transitionData typeGEDCOMMultiplication signTraffic reportingGame controllerPCI ExpressRootProcess capability indexComputer networkDifferent (Kate Ryan album)Video game consoleDemo (music)Attribute grammarNumberGastropod shell2 (number)Right angleComputer networkView (database)Square numberQueue (abstract data type)Computer animationSource codeJSON
Game controllerPCI ExpressLoginGamma functionPhase transitionVideo game consolePasswordGEDCOMData typeIP addressAsynchronous Transfer ModeStatisticsLink (knot theory)EllipseBroadcasting (networking)Address spaceMultiplicationData modelReverse engineeringComputer networkVirtual machineDemo (music)State of matterDomain nameVotingView (database)VirtualizationMathematicsRadical (chemistry)Correspondence (mathematics)Phase transitionExecution unitType theoryComputer animationSource code
Computer networkGame controllerPCI ExpressType theoryComputer clusterLimit (category theory)CodeCode refactoringStandard deviationMultiplication signSoftwareVideoconferencingPresentation of a groupInformationExecution unitNumberQueue (abstract data type)RoutingSingle-precision floating-point formatRow (database)Computer networkDefault (computer science)Virtual machineGame controllerCodeType theoryProof theoryMathematicsProcess capability indexRootComputer clusterXMLComputer animation
Connected spaceObservational studyDemosceneIP address1 (number)Virtual LANComputer networkMultiplicationComputer animationMeeting/Interview
Computer animation
Transcript: English(auto-generated)
Good afternoon, my name is Miguel Duarte, I'm a software developer working for Red Hat on the OpenShift Virtualization Networking team. OpenShift Virtualization is the downstream distribution of the Qvert project, essentially a virtualization plugin for
Kubernetes, allowing users to interconnect both pods and virtual machines in the same orchestration engine. I'm here today to present a talk about Network Interface Hotplug for Kubernetes and for Qvert in Fosdem's Virtualization and Infrastructure as a Service Developers Room.
Let's start with the agenda for this presentation. To prepare the audience for this talk, the introduction section must feature brief explanations of the CNI, Multis, and Qvert projects. Once these concepts are clear, we can then explain what our motivation is for hotplugging network interfaces into Kubernetes pods
and Qvert VMs, and from there be able to specify the problem and set clear goals for the implementation section. Afterwards, we will briefly describe how this proof of concept was developed, explain the changes required in Multis and in Qvert.
We will then demo the feature and finalize with the conclusions and the next steps for this work. To provide some context to the audience, we first need to address the Kubernetes networking model. The Kubernetes networking model is quite simple, and according to it, all pods can communicate
with all other pods across different nodes, even when directly connected to the host network. Furthermore, the Kubernetes agents can communicate with any pod on the node where they are located. In order to implement the networking model, Kubernetes relies on CNI, which
stands for Container Network Interface, and is a cloud-native computing foundation project. CNI is a plugin-based networking solution for containers, and it is also orchestration engine agnostic. This means that Kubernetes is in fact just another runtime for CNI.
CNI will implement Kubernetes networking model by reacting to the following events. Whenever a pod is added, it will create and configure a networking interface in the pod and connect that to the cluster-wide network.
On the other hand, whenever a pod is deleted, it will perform cleanup of the allocated network resources. It is also interesting to say that Kubernetes chose to use CNI in a very minimalistic way to implement their network model. They configure a single interface on the pod, which essentially means there
is a single cluster-wide network connecting all the pods across the cluster. Regarding how it works, the CNI plugins are simply binary executables hosted on the host file system. They are spawned by the runtimes, in this case, Kubernetes, upon certain events whenever a pod gets added or removed, as we've discussed previously.
The input configuration is passed via standard in and is basically a JSON encoded string. And the structured result is reported via standard out and cached on disk, and is also a JSON encoded string.
In this slide, we can see a very simple example of a CNI configuration for a known plugin type, which I'll use to explain how the runtime knows which CNI plugin to invoke. This type of attribute in the CNI configuration must match the name of a binary executable located on a well-known directory on the host file system.
Its default location is slash opt slash CNI slash bin. It is also interesting to say there are standard keys in the configuration, in this case,
for instance, you have name, CNI version, type and IPAM, but there are also plugin specific keys. In this example, the Kubernetes key, which is used to indicate the path to the kubeconfig, is one example. As indicated before, Kubernetes chose to only provide a single network interface per pod to interconnect the entire cluster.
If for whatever reason you require more than one, you need to search for answers outside the realm of Kubernetes. This brings us to multis. Multis is a meta CNI plugin, meta in the sense that it will in turn invoke other CNI plugins named delegates.
It enables a pod to have more than one interface. It even allows for an end-to-end interface to network association, meaning you can have multiple connections to the same network or connect many different networks,
each implemented by a different CNI plugin. After having multis deployed in your system, requesting additional network interfaces from it is quite simple. You just have to specify a list of attachments using a special annotation on the pod. Its key is kubernetes.v1.cni.cncf.io.networks and its value is a JSON-encoded string featuring the list of network selection elements.
The featured example is quite simple. The attachments just state their name. But you can use this to specify more complex scenarios, like requesting a specific IP or MAC address for the particular attachment.
The JSON-encoded string featuring the CNI configuration is found within the kubernetes datastore in an object-of-type network attachment definition. Multis will query the API server for the attachment whose name is indicated in the pod's annotations,
and then will use the CNI API to invoke the correct binary with the CNI configuration passed via stdin. This object-type network attachment definition is a kubernetes API extension which is provided for and installed by Multis upon its deployment.
In these diagrams, we have two scenarios. The left diagram represents a typical vanilla kubernetes deployment.
The right diagram depicts a deployment with Multis. Multis is deployed as a default cluster network CNI binary and will, in turn, always invoke a common cluster-wide CNI plugin responsible for creating the pod's primary network.
If no other networks are specified in the pod network's annotation, Multis is just a proxy between the original cluster network's CNI and kubernetes. When additional networks are requested via the pod's network annotations, Multis will query the kubernetes API for the attachment information and then
proceed to invoke the correct CNI passing the aforementioned configuration via stdin. The delegate plugin will then create and configure an additional network interface on the pod. Now that we've understood how CNI is used to implement kubernetes networking model and how Multis is
used to enable pods to feature multiple network interfaces, it is time to present the kubernetes project. kubernetes is essentially a virtualization plugin for kubernetes that allows the users to run virtual machines inside kubernetes pods.
It gives users the ability to run, manage, and interconnect both virtual machines and containers within the same platform, kubernetes, following its philosophy and semantics. A good example of this is that the VMs are described using the kubernetes declarative API.
Given disadvantage, the most common use case is a migration path from virtualization workloads to a containerized, microservice-based solution, where you little by little decompose your existing virtual machines
to a microservices-based architecture by splitting the virtualized workloads into tinier pieces that fit containers. Obvious advantages of this approach are a single common platform for the development and operation teams. I will now use this architecture slide to reference and explain the most relevant actors of the kubernetes architecture.
On the right side of the slide, we have N launcher pods, each encapsulating the libvert plus QAM processes for every provisioned virtual machine. There is a dedicated pod per node in the middle, running the Qvert agent.
It will ensure the virtual machine's declarative API is enforced by making the declared state converge into the VM's observed state. Finally, to the left side, we have a cluster-wide virtualization controller pod that monitors all things related to virtualization.
This component is also responsible for owning, specifying, and managing the pods where the virtual machine is run. Now that the audience understands what CNI and multis are, and also has a basic understanding of Qvert's architecture, we can indicate the motivation for this feature, specify the problem we're trying to solve, and list the goals for the implementation.
The motivation for attaching new interfaces to running virtual machines without requiring a restart stems from the fact that some VMs run critical workloads which cannot tolerate a restart without impacting service.
A common scenario is when such a VM is created prior to the network. Imagine for some reason an organization's network topology is updated and the VM running the critical workload must connect to a newly created network. Furthermore, adding or removing network interfaces to running VMs is an industry standard
available in multiple virtualization platforms with which Qvert wants to have feature parity. Given this, we can now define the problem as providing the dynamic attachment of L2 networks without requiring the restart of the workload, whether it is a pod or a virtual machine.
The goals for the implementation stage are then to add network interfaces to running virtual machines, remove networking interfaces from running virtual machines. Finally, a virtual machine can have multiple interfaces connected to the same secondary networks.
It is very important to highlight here that plugging an interface into a VM requires an interface to be first plugged into the pod where the VM is running. Now that we've explained our motivation, defined the problem and set clear goals, we can move into the implementation section, starting with the changes required on Multus.
Remember that Multus is a CNI plugin and as a result, it is a simple binary executable on the Kubernetes NOS file system invoked by the runtime upon a certain set of events. Adding or deleting a pod, for instance. Multus is now required to watch for the pod network's annotations update and then trigger the correct delegate CNI whenever the annotation changes.
For instance, when a request for a new attachment is added to the pod's network annotation list, this control loop should reconcile the pod by invoking the delegate CNI with the add command.
On the other hand, whenever an entry is removed from the pod's network annotation list, the delegate CNI should be invoked, this time around with the delete command. The big question here is where should we put this controller code? In order to host this control loop code that reconciles the workload pods, we have the first re-architect Multus as a thick CNI plugin.
A thick plugin is characterized by a client server architecture where the client, the Multus shim on the picture, is just a binary executable on the host file system. It still implements the CNI API that we've shown previously, but all the
heavy pulling is executed by the Multus controller, also shown on the picture. The Multus server side will expose a RESTful API on top of a Unix domain socket that is bind-mounted into the host, thus enabling the client to contact the server.
The pod reconciliation loop, described previously, will be implemented in the Multus controller, thus allowing Multus CNI to react to custom events, in this case updates to the pod's network annotations. Now that we've understood the changes required in Multus, which add or remove interfaces to from the pod, we can now proceed with the changes required in QVERT to extend this connectivity from the pod into the running VM.
To do so, I'll start with showing a network diagram of a pod running a virtual machine. As you can see, there is a pod interface created by CNI, connected to an in-pod bridge, having another connection to a tap device that QEMU used to create an emulated network device for the VM.
A good API for interface hotplug for QVERT VMs would follow the same approach we described for pods, where updates to the VM spec, whether you add or remove interfaces, would trigger the interface hotplug or unplug.
Unfortunately for us, that is not possible, since updates to the VM spec are only allowed to the QVERT control plane entities. As such, we have to update the VM spec via a newly added subresource, which is triggered by QVERT CLI. When a QVERT user triggers the add interface or remove interface command, it will send a REST PUT request to the
add interface subresource of the VMI, whose handler will in turn patch up the VM's interface and networks list on its spec. The cluster-wide virtualization controller is continuously monitoring the VMs. Whenever it sees a difference between the interface list on the VM's
observed interface status and its interface spec, it will recompute the pod's network annotations and update the pod's spec with this data. Once the multis controller sees the update of the pod's network annotations, it will mutate the pod's networking infrastructure
by adding another pod interface via its CNI delegate, whose networking must yet be extended into the running virtual machine. Once the cluster-wide VM's reconcile loop notices there are interfaces listed on the pod's status that are not reflected in the VM's interface status, the virtualization control loop will
mutate the VM's interface status with these new interfaces, indicating their pod counterparts are already available. The control loop of QVERT's agent, which only focuses on the VMs running on the node it manages, will then see this update and act accordingly.
Acting accordingly in this context is doing two different things, the first of which is to create all sorts of auxiliary networking infrastructure to extend network connectivity from the pod interface into the virtual machine. This last step will create another in-pod bridge that interconnects both the pod's interface,
which was previously connected by the delegate's CNI plugin, and a newly created TAP device. Finally, QVERT's agent will converge the VM's specification with its observed status. It will invoke attach interface for new networks
listed on the spec, and call detach interface for interfaces listed on the status but not present on the spec. Once libvert processes the dynamic attachment operation, the newly created emulated network device will be available inside the running VM.
The last thing I want to address on the implementation section is related to QEMU's machine type. This attribute can be seen as a virtual chipset that provides certain default devices for the VMs, graphics card, ethernet controller, etc. QEMU supports two main variants of machine type for x86 hosts, a legacy chipset PC and Q35.
The most modern machine type available, Q35, has a limitations definition. By default, it supports a single hotplug operation. When users require more than one, they must prepare in advance by requesting an appropriate number of PCI Express root port controllers.
Our solution for this was to mimic OpenStack Nova's implementation, and expose a knob where the users can specify the number of root port controllers they want available on the VM.
The first demo we'll see is of a hotplug operation against a Q35 machine type VM. The first thing we'll do is start our scenario. It will need to update Q-verse feature gates, indicating this non-generally available feature as available to the users.
Secondly, it provisions a network attachment definition, holding the specification of the network which will be hotplugged later on. Lastly, it provisions a VM, having a single network interface in it, overconnecting the VM to the cluster's network.
The top right corner of the shell will be used to monitor the VM's interface status, while the bottom right corner will list the associated pod network annotations. We will need the pod name for that.
As expected, the VM's observe status features a single interface, and the pod's network annotation features an empty list, meaning there aren't any secondary networks available to the pod. I will now request youvert to hotplug an interface in our running VM. The add interface command
will be issued, requesting a new interface connected to the dedicated network to be made available on VMIA. As we can see, a new secondary network was listed in the bottom right corner, which triggered Multus to create a new network interface in the launcher pod.
After a while, Qvert proceeded to extend networking from the pod's interface into the VM, and this new interface is now listed as available within the virtual machine. If we try again to use the hotplug feature for this virtual machine, we'll see the plug operation fails in Qvert, since there aren't any available PCI slots.
In this second demo, we will present the exact same scenario, but this time, making sure to request more PCI Express root port controllers. It once again starts by provisioning the scenario. As before, the VM's interface status will be monitored
in the right shell. This time around, there is no need to monitor the pod network annotations. As you can see, the only difference between the current and previous scenarios, other than the
name of the VM, is a newly exported attribute indicating the number of PCI root port controllers. As can be seen on the left side of the shell, there is a single interface available within the VM, and we will now request a new interface for it via the Qvert CLI.
Again, as you see on the right side of the shell, we can see the newly added interface after just a few seconds. Let's now try to hotplug another interface, this time with a different name. As you can see in the right side of the terminal, another interface status is listed corresponding to the newly attached network interface.
We finally log again over the console to the VM, where we can see the three network interfaces. This concludes our demo. This last demo will feature the reverse flow, unplugging an existing network from a virtual machine.
As usually, the left side of the terminal will be used to interact with the virtualization workloads, while the right side will be used to monitor the virtual machine interface status. As can be seen, both terminals show three different interfaces in the running virtual machine.
Let's now invoke the remove interface CLI command to remove one of those interfaces from the VM. As you can see in the right terminal, the hotplug status type changed to a pending unplug operation, and after a short while, the entry disappears altogether.
When we check the state over the VM's console, we see the corresponding interface, Ethernet1, was removed from the domain. As for conclusions, the first one should be pretty obvious by now. To plug an interface to the VM, it must first be plugged to the pod.
At the pod level, I'd like to highlight that unplugging the default cluster network interface from the pod is not possible, and is entirely off scope of this feature. Furthermore, plugging and unplugging to from the pod is implemented by Multus, which is of course a requirement of this feature.
Finally, some QM machine types require VM spec updates, indicating the number of PCI root port controllers. Otherwise, the users will get a default which will leave them able to hotplug a single network interface to the VM. To conclude this talk, let's quickly enumerate the required next steps for this feature.
The software we're running in this presentation is essentially a proof of concept. None of the code was actually merged at the time I recorded the video. As such, we first need to get the Multus code changes merged, and afterwards focus on the Qvert code and productify it.
And this is all, we've reached the end. I thank you for your time, I hope you learned something, and I will leave you with some interesting resources so you can get more information about this subject. Bye!
So I see here one question, I'm going to read it out loud.
We have Fruity Welsh asking if it is possible to implement multiple interfaces on the same CNI. So that multiple IPs on the same VLAN or multiple VLANs on a trunked interface. Yes, it is possible.
I have to figure, one second, I need to find a way first to stop this.
So again, it is possible to have that. One of the goals we had from the beginning was that you're able to have multiple connections to the same network.
So yes, this is possible.