Need to connect your k8s pods to multiple networks? No problem [with calico/vpp]!
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 542 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/61598 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2023158 / 542
2
5
10
14
15
16
22
24
27
29
31
36
43
48
56
63
74
78
83
87
89
95
96
99
104
106
107
117
119
121
122
125
126
128
130
132
134
135
136
141
143
146
148
152
155
157
159
161
165
166
168
170
173
176
180
181
185
191
194
196
197
198
199
206
207
209
210
211
212
216
219
220
227
228
229
231
232
233
236
250
252
256
258
260
263
264
267
271
273
275
276
278
282
286
292
293
298
299
300
302
312
316
321
322
324
339
341
342
343
344
351
352
354
355
356
357
359
369
370
372
373
376
378
379
380
382
383
387
390
394
395
401
405
406
410
411
413
415
416
421
426
430
437
438
440
441
443
444
445
446
448
449
450
451
458
464
468
472
475
476
479
481
493
494
498
499
502
509
513
516
517
520
522
524
525
531
534
535
537
538
541
00:00
CollaborationismMultiplicationComputer networkScalabilityConjunctive normal formInterface (computing)Uniform convergenceSingle-precision floating-point formatPresentation of a groupProcess (computing)Virtuelles privates NetzwerkPlane (geometry)Open sourceVertex (graph theory)Scale (map)CryptographyCache (computing)Virtual realityInterface (computing)ArchitectureArmSource codeSpacetimeComputer configurationControl flowIntegrated development environmentDemonOperator (mathematics)Chemical equationScheduling (computing)Interrupt <Informatik>Asynchronous Transfer ModeWeb pageElectronic program guideProxy serverDevice driverData modelAddress spaceLatent heatOrder (biology)Interface (computing)AbstractionPlanningStructural loadDevice driverBitCartesian coordinate systemCASE <Informatik>Software engineeringÜbertragungsfunktionSpacetimeDisk read-and-write headDemonSoftwareConfiguration spaceAxiom of choiceLogicFunctional (mathematics)Integrated development environmentScalabilityControl flowStandard deviationUniformer RaumConnectivity (graph theory)Point (geometry)Different (Kate Ryan album)Plug-in (computing)INTEGRALResultantLevel (video gaming)Direction (geometry)Service (economics)Interface (computing)Series (mathematics)Game controllerComputer programmingMultiplicationPresentation of a groupRootScaling (geometry)Software maintenanceSet (mathematics)Open sourceGene clusterConnected spaceEvent horizonKeyboard shortcutCollaborationismComputer animation
06:58
Computer networkProxy serverDevice driverData modelAddress spaceVirtuelles privates NetzwerkChi-squared distributionMeta elementRegular graphInterface (computing)Overlay-NetzCluster samplingSystem callCarry (arithmetic)Texture mappingEmulationQueue (abstract data type)Interface (computing)Read-only memoryService (economics)Type theoryGateway (telecommunications)Chemical equationService (economics)Disk read-and-write headEncapsulation (object-oriented programming)InjektivitätBitInterface (computing)SoftwareCASE <Informatik>ConsistencyLastteilungIP addressQueue (abstract data type)Connectivity (graph theory)MereologyParameter (computer programming)Interface (computing)Endliche ModelltheorieMultiplicationGreen's functionNumberSpacetimeFunctional (mathematics)Ferry CorstenType theoryHypothesisStandard deviationPlanningConnected spaceLibrary catalogPatch (Unix)RoutingTupleArithmetic meanKeyboard shortcutConfiguration spaceGame controllerComputer animation
13:50
Gateway (telecommunications)Virtuelles privates NetzwerkService (economics)Chemical equationNetwork socketDefault (computer science)Interface (computing)Standard deviationComputer networkMobile appLevel (video gaming)CryptographyKernel (computing)DatenpfadStack (abstract data type)SpacetimeDirect numerical simulationSocket-SchnittstelleRegular graphIntelFingerprintMaxima and minimaBit rateNumberCore dumpChi-squared distributionFrequencyTurbo-CodeInterface (computing)Queue (abstract data type)Instance (computer science)Scaling (geometry)Frame problemSingle-precision floating-point formatBefehlsprozessorCommunications protocolLattice (order)Plane (geometry)Cartesian coordinate systemConnected spaceArithmetic meanScaling (geometry)Mathematical optimizationNumberGraph (mathematics)Regular graphInterface (computing)System callFunctional (mathematics)Standard deviationSlide ruleInterface (computing)Semiconductor memorySoftware testingSoftware frameworkMultiplication signSoftwareKernel (computing)Direction (geometry)Different (Kate Ryan album)BefehlsprozessorVariety (linguistics)Open sourceLevel (video gaming)CASE <Informatik>Graph (mathematics)Limit (category theory)ResultantProcess (computing)CodePatch (Unix)Network socketStack (abstract data type)Computer animation
20:42
TelecommunicationComputing platformMathematical analysisInformation securityWorkloadSet (mathematics)ArchitectureMicroprocessorScalabilityBefehlsprozessorIntelLocal GroupComputer networkMeasurementCoprocessorInterpolationCore dumpConfiguration spaceBackupEuclidean vectorProduct (business)Category of beingFaktorenanalyseSoftware testingComputer hardwareSoftwareBenchmarkRevision controlTurbo-CodeState transition systemInterface (computing)Game controllerNumberRead-only memoryTotal S.A.CodeInstance (computer science)Communications protocolQueue (abstract data type)DatenpfadInterface (computing)Stack (abstract data type)SpacetimeSocket-SchnittstelleRegular graphDirect numerical simulationMereologyShared memoryNetwork socketDirection (geometry)Socket-SchnittstelleDigital electronicsOpen setProcess (computing)Computer animation
22:36
Program flowchart
Transcript: English(auto-generated)
00:05
Hi everyone. So that's the last speaker of the day. So I'm going to talk about Kubernetes connecting to multiple networks. So Doug already
00:20
spoke about this in a slightly different way. I'll take a slightly different approach. So first a few things about myself. I'm a software engineer at Cisco working on container networking things and I'm a maintainer of Calico VPP, which is going to be the topic of this talk. This talk is also a bit particular. It's a result of a collaboration effort with many
00:43
awesome people like Tajera, Intel, mostly, and Cisco, and direct collaboration with Maritika Ganguly, which is a P at Intel, but she sadly couldn't be here today because it's quite far from the US where she lives. But I'll do my best to present her work. So first a bit of a
01:04
background story of this work. So in the world of endpoint applications, Kubernetes has really become the solution of choice when it comes to deploying large scale services in various environments because it provides the primitives for scalability. So Metal LB that we
01:24
saw in a previous talk, services, health checks, and so on. It also provides the uniformity of deployment and it's also far from the second, so you don't need to know what you're running on. But coming from the CNF land, trying to deploy network function in this environment, the story is not the
01:43
same. So I'll define a bit more what I mean by CNF because it's a bit different between the standard CNF use case, the 5G one. What I mean by that is, so I'll take an example for the sake of this presentation. So typically what
02:00
I mean by that is the wire guard head end. For example, you have a customer and you want to deploy a fleet of wire guard head ends to give that user access to a resource in a company network. So typically a very private printer that everybody wants to access to because people like to print. So
02:22
the particularity of this use case is that it's dynamic enough to benefit from the abstraction that Kubernetes brings. I've lost my mouse. So typically load balancing, scheduling, and those kinds of things. But it has a lot of specific needs. For example, ingress has to be done in a
02:42
peculiar way because you have the wire guard traffic, so typically you want to see which IP it's coming from. You also constrain on how you receive traffic because typically, and that's the place where you need multiple interfaces to go into your pod, and you also require high performance because of encrypted traffic. So typically you want those things to run
03:03
fast and you have a lot of users using them. So not for that printer, but assuming it's a bigger use case. So we tried to design a solution for that. So there are lots of components at play. I'll try to go quickly into them. So on the top we have our application. So here are the
03:25
wire guard VPN headend. We want to deploy it on top of Kubernetes. So we had to choose a CNI. So we want with Calico, mainly because of the cuteness of their cats, but also because it provides a really nice
03:41
interface into supporting multiple data planes and also a nice BGP integration that allows you to tweak the way we process packets. And for clearing packets, we use the FDIO's VPP as a data plane. That gave us more control on how packets are processed. And so that allowed us to go deeper into
04:05
the way networks actually managed at a really low level. There are also other components that are going to play, but more on this later. So I'm gonna go quickly over Calico and VPP because they have been presented many times.
04:23
So in short, Calico is a Kubernetes CNI providing a lot of great features and policies, BGP support for really huge clusters. And the point that's important for this presentation is that it has a very well-defined control plane data plane interface, allowing to plug new performance-oriented
04:41
software underneath it without too much hassle, and that's what we are going to leverage. So we choose to slip VPP underneath Calico, first because we were originally contributors in this open source user space networking data plane, so it was a solution of choice. But also it has a lot of cool
05:03
functionalities that are built in, and it's extensible. So I am doing a bit of publicity for the software I'm coming from, but it was a good tool for this use case. And also it is quite fast, so it really fits the needs for this application. So how did we bind it together? What we do is we
05:25
built an agent running in a daemon set on every node, so deployment is the same as a simple pod, just with more privileges. We register these agents in Calico as a Calico data plane and use their gRPC interface and their
05:40
APIs that they expose to decouple control and data plane. That agent listens for Calico events and then it programs VPP accordingly. And we also built a series of custom plugins for handling NAT, service load balancing, and so on. And we tweaked the configuration so that things that behave nicely in a container-oriented environment. And with all this, we have every break to
06:05
bring VPP into the clusters, and so to have control on everything that happens in the Kubernetes networking. How does that happen? So what happens exactly under the hood? What we do is we swap all the
06:23
network logic that was happening in Linux to VPP, so from this configuration to there. The thing is, as VPP is a user space stack, we have to do a few things a bit differently compared to what was
06:42
previously done in Linux. So in order to insert VPP between the host and the network, we will grab the host interface, the uplink, and consume it in VPP with the appropriate driver. And then we restore the host connectivity by creating a ton interface in the host root network namespace, so that's the ton tab here. And we replicate everything on that
07:04
interface, the addresses, the routes. So basically we insert ourselves as a bump in the host, but it works pretty well in that configuration. And that way, we restore pod connectivity as before with ton tabs instead of the viv. We create
07:24
an interface in every pod, and then everything runs normally. The calico control plane is running normally on the host, and it configures the data plane functions in VPP via the agent. So now we have the green part covered, so all those components run neatly. And what we achieve with that is that when
07:45
we create a pod, Kubernetes will call calico, calico will call VPP, and we can provide an interface that we fully handle on a network layer directly in VPP. But for this specific way you got an application, we need a little bit more than that. We need multiple interfaces, and we
08:04
also potentially have overlapping addresses, so we don't really manage where the IPs are going to end. So for the multiple interface part, our goal to choose was to go with mortars that provides multiplexing,
08:21
and we chose also a dedicated IPAN that we patched, which we were about because it was quite simple to patch, and brought those two pieces in. So when I mean multiple interfaces, what does that exactly contain? So the thing is, the typical Kubernetes deployment looks like this.
08:48
So each pod has a single interface, and the CNI provides a port-to-port connectivity, typically with an encapsulation from node to node. But in our application, we want to differentiate the encrypted traffic from
09:03
the clear text traffic, so before and after the head end. But we still want Kubernetes as the end to operate, so we still want the nice things about Kubernetes, so service IPs and everything. So it's not only multiple interfaces, it's really multiple interfaces wired into Kubernetes, so it's
09:22
more multiple isolated networks. So conceptually, what we needed was the ability to create multiple Kubernetes networks, and each network behaving a bit like a standalone cluster, stacked on top of each other. So with this, we request networks that provide complete isolation
09:42
between each other, meaning that traffic cannot cross from a network to another without going to the outside world. And so that means that we had to bind Calico, VPP, so on, integration, and moltres together to create a model where everybody is aware of that definition of
10:01
networks, have a catalog of isolated networks, specify the way they are going to communicate from the two nodes via vxlan encapsulations, and have a way to propose to attach to those networks with annotations, and so that in the end Kubernetes is aware of these networks, and we can still maintain the SDN port of the logic. So the way this works quickly is that the CNI
10:27
interface will call Calico once per pod, so the thing is moltres will call the CNI Calico once per tuple pod interface, and we will in
10:41
turn receive in our agent those calls, and we can map those with annotations and do our magic to provide the logic. And having also the IPAM patch allows us to support multiple IPs and to have different realms where the IP lives and gets located from. So from a user's perspective, what we expose is a
11:02
network catalog where networks are defining CRDs for now. We are sorting a standardization effort to bring that into Kubernetes, but that will probably take time. So right now we kept it that simple, we've just specifying a VNI using vxlan by default, just passing a range, and we
11:20
also keep a network attachment definition from moltres with one-to-one mapping to network, so that we don't change too many things at once. And then we use those networks into the pod definitions, so we reference them the moltres way. We can reference them as well in services with
11:42
dedicated annotations, and so that way we tell our agents to program VP in a way where the services apply only in a specific network. The policy is the same way, and also that gives us the ability for pods to have a bit more tweaking on parameters exposed on the interface, so specified number of queues
12:05
we want, the queue depth, and also support multiple types. So that gives a lot of flexibility to get the performance right, and first to get the functionalities, so the fact that we have multiple interfaces, and so also size
12:20
them so that the performance is appropriate for the use case that we want to achieve. Last nice feature of this is that as we have GobiGP support, we can pair those networks with the outside world if we have a fabric that's vxlan and if GobiGP supports it. So that part is still a
12:41
big work in progress, and there are a lot of things to get right, but that's the end picture we want to go. So if we put everything together, we would get probably something like that, that looks like that. So basically when the users want to connect to this
13:02
hypothetical VPN and that hypothetical printer, it will get into the cluster via GobiGP peering, so traffic is going to be attracted to the green network, heat a service IP in that network to get some load balancing across the nodes, then it's going to be deciphered in a pod
13:25
that then encapsulates traffic and passes it, for example, to a NAT pod running in user space. So here I put another type of interface that is more performance oriented, and then exits the cluster on a different vlan
13:41
paired with the outside world. So some parts still need to be done, but the general internal logic of the cluster is still something that works, and that brings the ability for container and networking functions to run unmodified with their multiple interfaces directly in a somewhat regular
14:05
cluster. So we spoke about improving performance of the network, of the underlying interface, but we can also improve the performance
14:22
with which the application in the pod consumes their own interfaces. So the standard way applications usually consume packets within pods is via socket APIs, so it's really standard, but you have to go through the kernel, and it's a code path that wasn't designed for the
14:42
performance levels of modern apps, so that's why GSO came up as a network stack organization. But here with VPP running, it would be nice to be able to bypass the network stack and pass the packets directly from VPP, not touching the kernel. So fortunately VPP exposes two different ways to consume those
15:03
interfaces. We'll mostly go into the first one, which is the memory interface. So basically, it's a packet-oriented interface standard relying on memory segments for speed, and this can be leveraged by an application via a
15:20
simple library, so either go memf, libmemf in C, or DPDK, or even a VPP, and provide a really high-speed way of consuming that extra interface in the pod. And the really nice thing about this is that it brings
15:40
also the connection between the Kubernetes network, Kubernetes SDN, and the pod into user space, meaning that now that that connection lives in a regular C program, we can also leverage, so it's easier to leverage CPU optimizations and new features, and that's where the silicon
16:05
re-enters the picture, and the work from Maritika from Intel and their team, so they benchmarked this kind of setup, and also introduced an optimization that's coming into the fourth-generation Intel Exeons
16:24
that's called Data Streaming Accelerator. Basically, it's a way to optimize copies between processes on some CPUs, and so what they did is compare the performance that we get with Kubernetes clusters, multiple interfaces, and a simple pod, so not bringing in the old VPN logic, just
16:46
doing L3 patch and seeing how fast things could go between regular kernel interfaces, the TAN, the memory interfaces, and the memory
17:01
interfaces leveraging those optimizations in the CPU. So that gives those graphs that have a lot of numbers in them, but basically I tried to sum up quite quickly what this gives. There are two
17:22
MTUs, 1500 bytes and 9,000 bytes here. The performance for TAN interface is in dark blue. Blue is the first MAMIF, and the DSA optimized MAMIF is in yellow. Basically, what this gives is that the
17:44
performance is really... so it brings really a huge difference between... throughput with DSA is 2.3 times faster than with regular
18:05
MAMIF for the 1500 bytes packets, and if you go with DSA enabled, it's 23 times faster than TAN-TAP, and with a 9,000 byte MTU, basically you get more than 60 times faster with the
18:22
MAMIF that's optimized with DSA. Basically, the digit, the number that's really interesting is that bar, is that with 200,000... so basically you get a single call doing 100 Gs with that, and that without too
18:42
much modifications of the applications. So basically you just spin a regular cluster. If the CPU supports it, you use a regular library, and you're able to consume packets at really huge speeds without modifying the application too much. So there is another graph looking into the
19:05
scaling with number of calls, both with small MTUs and large MTUs. Basically, that shows that we can spare calls when going... so TAN-TAP does not scale very well. So for MAMIF scales with one to six calls,
19:27
and DSA achieves the same results with two to three less calls than its regular MAMIF counterpart. So basically you achieve 100 Gs,
19:40
which was the limit of the setup with a single call in the case of large MTUs, and three calls in the case of smaller MTUs. So that's all for the talk. Sorry, I went into a variety of different subjects because that topic goes into a lot of different
20:03
directions. Basically, that was to give you an overview of the direction we are trying to go, trying to bring all those pieces together in a framework that allows us to make those CNFs run into a community environment. This work is open source. There are the details of the
20:24
tests that were done in the following slides. You can find us on GitHub, and there is also a Slack channel open where you can ask questions. And we have a new release coming up in beta aiming for GA that's going to go out soon. So thanks a lot for listening, to hear all the details,
20:45
and I'm open for questions if you have any.
21:01
Just one question for the sake of it. Have you ever thought about some shared memory between the different parts to eliminate the needing to copy over the packets? We thought of this, so there are different ways to do that.
21:24
There is the VCR, which I haven't spoken about, which is a way of opening the sockets directly in VPP. Basically, you do a listen in VPP for TCP, UDP, or a given protocol, like the circuit APIs, and that supports
21:42
directly. Basically, the data never leaves VPP, and you can do direct copies between processes without having to copy, because everything stays in VPP in the end. For MEMIF, we don't support that out of the box, but nothing forbids you to spawn two pods,
22:03
make them share a circuit, and it's only shared memory, so you can directly do it without having to spin up the whole thing. You could even do that in any cluster, or directly in bare metal. MEMIF is really a lightweight protocol, so you can do that
22:20
just with a regular socket. Thank you very much.