We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Introducing FreeBSD VPC

00:00

Formal Metadata

Title
Introducing FreeBSD VPC
Subtitle
Introduction to Virtual Private Cloud
Alternative Title
Virtualized Networking for Cloud Computing in FreeBSD
Title of Series
Number of Parts
45
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
FreeBSD’s use in virtualization workloads has been hampered by its lack of Virtual Private Cloud (“VPC”) functionality. While the bhyve(4) hypervisor has proven to be robust and performant Hardware Virtual Machine (“HVM”), it has lacked the necessary companion networking stack in order to be used as a first-class hypervisor for cloud computing workloads. The FreeBSD vpc(4) subsystem was designed to augment the capabilities of bhyve(4) in order to support the demands of cloud workloads. FreeBSD’s use in virtualization workloads has been hampered by its lack of Virtual Private Cloud (“VPC”) functionality. While the bhyve(4) hypervisor has proven to be robust and performant Hardware Virtual Machine (“HVM”), it has lacked the necessary companion networking stack in order to be used as a first-class hypervisor for cloud computing workloads. The FreeBSD vpc(4) subsystem was designed to augment the capabilities of bhyve(4) in order to support the demands of cloud workloads. After experimentation and extending with the existing network interfaces (e.g. bridge(4), tap(4), ptnetmap(9)), it became clear that it would be necessary to implement a new networking subsystem custom built for virtualization workloadschange course. We settled on implementing vpc(4) by extending the iflib(9) framework, a generalized NIC interface in the FreeBSD kernel. Using iflib(9) we created a suite of network services that allow FreeBSD to be used as a performant and flexible hypervisor for cloud workloads. Depending on the configuration and policies, it is also possible to use vpc(4) for desktop applications, too. We outline the intial performance achieved, both with ptnetmap(9) and iflib(9), the list of services in vpc(4), and how to deploy a cloud environment.
14
Thumbnail
53:06
40
Thumbnail
53:12
Virtual realityVirtualizationMultiplication signDegree (graph theory)VirtualizationState of matterJSONXMLUMLComputer animation
Magnetic stripe cardPopulation densityBefehlsprozessorRead-only memoryPerfect groupSpeicherschutzVirtualizationBefehlsprozessorPerfect groupGoodness of fitTerm (mathematics)Extension (kinesiology)Semiconductor memoryComputerCuboidComputer networkPhysical systemMultiplicationComputer animation
Computer networkCore dumpPhysical systemCuboidComputer networkComputer configurationSet (mathematics)Flow separationSoftwareGoodness of fitAreaUMLComputer animation
Fibonacci numberComputer networkOverlay-NetzComputer networkMultiplicationComputer networkWorkloadSoftwareServer (computing)Internet service providerSoftware frameworkGraphics tabletModal logicSingle-precision floating-point formatIP addressEncapsulation (object-oriented programming)Right angleComputer animationUMLProgram flowchart
Bridging (networking)Instance (computer science)BitSoftwareRight angleEncapsulation (object-oriented programming)Particle systemCommunications protocolProcess (computing)Data structureBridging (networking)Computer animationUMLProgram flowchart
Bridging (networking)Bridging (networking)Population densityMereologyCore dumpProgram flowchartComputer animation
Bridging (networking)Overlay-NetzComputer networkTap (transformer)Computer networkEncapsulation (object-oriented programming)Bridging (networking)Right angleComputer networkTap (transformer)Overlay-NetzoutputSoftwareVirtualizationMessage passingSoftware developerInternet service providerSpacetimeComputer programmingPerturbation theoryWebsiteRAIDUtility softwareSpring (hydrology)Level (video gaming)Computer animation
EncryptionAdvanced Encryption StandardRight angleEncapsulation (object-oriented programming)Perturbation theory2 (number)Modal logicBridging (networking)Remote procedure callAzimuthDifferent (Kate Ryan album)Data centerProcess (computing)BenchmarkSoftwareMultiplication signSoftware testingComputer networkUMLXMLComputer animation
Computer hardwareComputer networkImplementationTerm (mathematics)BuildingReduction of orderContext awarenessAbstractionPoint (geometry)Multiplication signLimit (category theory)Tap (transformer)Computer networkCASE <Informatik>Process (computing)NumberContext awarenessXMLComputer animationUML
Computer networkComputer hardwareBuildingReduction of orderContext awarenessAbstractionTerm (mathematics)ImplementationAuthorizationDisk read-and-write headComputer networkLevel (video gaming)Context awarenessProcess (computing)Limit (category theory)Kernel (computing)Endliche ModelltheorieComputer hardwareTerm (mathematics)AbstractionHost Identity ProtocolTheory of everythingBitRight angleUMLXMLComputer animationProgram flowchart
MultiplicationEmailValue-added networkSimilarity (geometry)UDP <Protokoll>SoftwareGroup actionCASE <Informatik>BitCuboidTrans-European NetworksMachine codeServer (computing)Single-precision floating-point formatImplementationFrame problemInternetworkingProgram flowchartComputer animation
Frame problemMultiplicationSoftwareReading (process)Communications protocolFrame problemVirtual LANEncapsulation (object-oriented programming)Intrusion detection systemMatching (graph theory)Limit (category theory)XMLProgram flowchart
Computer networkUsabilityEncapsulation (object-oriented programming)2 (number)Device driverReal numberEquivalence relationInformation securityOrder (biology)Series (mathematics)1 (number)Execution unitComputer networkRight angleSoftware frameworkPlanningMedical imagingSoftware testingQuicksortSoftware developerTerm (mathematics)Perspective (visual)Level (video gaming)Filter <Stochastik>Group actionFirewall (computing)Encapsulation (object-oriented programming)Kernel (computing)Link (knot theory)Different (Kate Ryan album)RandomizationComputer animation
Scheduling (computing)Computer clusterFigurate numberUser interfaceUtility softwareExecution unitWritingInternet service providerBoss CorporationSystem callBitBlock (periodic table)Data centerCycle (graph theory)View (database)Element (mathematics)Kernel (computing)Cloud computingBefehlsprozessorJSONXML
FlagEquivalence relationKernel (computing)Object (grammar)Query languageControl flowInformation securityComputer networkIntrusion detection systemKernel (computing)Physical systemoutputFunction (mathematics)Ocean currentSoftwarePrimitive (album)Computer animation
Object (grammar)Query languageControl flowInformation securityComputer networkIntrusion detection systemKernel (computing)Flow separationOperator (mathematics)System callPhysical systemPrimitive (album)SoftwareFlagGame controllerPlanningMultiplication signSet (mathematics)Perspective (visual)System callSpline (mathematics)Axiom of choiceComputer fileView (database)Open setData managementComputer animation
Information securitySemantics (computer science)Revision controlPhysical systemFlagKernel (computing)Scheduling (computing)TheorySystem callComputer clusterSemantics (computer science)Revision controlInformationOrder (biology)Object (grammar)Open setType theoryFlagFlow separationPerspective (visual)Identifiability1 (number)UsabilityComputer animation
StrutData typeAddress spaceDefault (computer science)FlagDefault (computer science)BitAddress spaceObject (grammar)InformationType theoryField (computer science)Multiplication signUsabilityMachine codeSpywareSeries (mathematics)System administratorPerspective (visual)Computer animation
Object (grammar)Address spaceEuclidean vectorFlagStrutRevision controlNumberMedical imagingSound effectComputer animation
Telephone number mappingFlagRevision controlObject (grammar)Type theoryData typeSoftware developerOpen setContent (media)Bit rateObject (grammar)Revision controlField (computer science)BitPresentation of a groupExtension (kinesiology)Type theoryMereologyFlagComputer animation
Object (grammar)Operations researchPhysical systemOvalTelephone number mappingMereologySystem callFinite differenceType theorySet (mathematics)Object (grammar)Operator (mathematics)NumberMultiplication signCASE <Informatik>Computer animationXML
Telephone number mappingOperator (mathematics)MathematicsLink (knot theory)CASE <Informatik>NumberBitState of matterSoftwareMereologyInformationVirtual LANView (database)Right angleDifferent (Kate Ryan album)Computer networkXMLComputer animation
Telephone number mappingObject (grammar)Computer networkPerspective (visual)NumberConfiguration spaceQueue (abstract data type)Computer hardwareComputer clusterKernel (computing)Scheduling (computing)Term (mathematics)Computer animation
EmailOperations researchCuboidCompilation albumPhysical systemFinite differenceChainSoftware developerWindowPhysicalismStability theoryComputer animation
Overlay-NetzMultiplicationQueue (abstract data type)Computer networkTerm (mathematics)Operator (mathematics)SoftwareCASE <Informatik>Order (biology)Physical systemServer (computing)Operating systemRoutingRight angleVermaschtes NetzEntire functionComputer animationProgram flowchart
Point (geometry)CASE <Informatik>Order (biology)Bit ratePerspective (visual)Router (computing)Equivalence relationType theoryMereologyInformationDefault (computer science)Address spaceDomain nameSoftwareView (database)Program flowchartComputer animation
Series (mathematics)Right angleComputer clusterDifferent (Kate Ryan album)CASE <Informatik>View (database)Address spaceVirtual LANIntrusion detection systemXMLUMLComputer animation
Value-added networkOrder (biology)Right angleDefault (computer science)Frame problemNP-hardSoftwareMereologyFigurate numberServer (computing)Dependent and independent variablesBroadcasting (networking)Data centerXMLUMLComputer animationProgram flowchart
Broadcasting (networking)Value-added networkPoint (geometry)Computer wormUtility softwareOverlay-NetzKernel (computing)InformationTupleCache (computing)Figurate numberOrder (biology)BitBroadcasting (networking)Image resolutionMultiplication signGoogolSoftwareSpacetimeAreaPoint (geometry)Dependent and independent variablesAddress spaceLibrary (computing)Moment (mathematics)Process (computing)IP addressOverlay-NetzKernel (computing)Element (mathematics)Computer animationProgram flowchart
Scripting languageRoutingControl flowPlane (geometry)Point (geometry)Computer wormData bufferUtility softwareInformationOverlay-NetzKernel (computing)Cache (computing)TupleValue-added networkCASE <Informatik>Cache (computing)IP addressWordDigital electronicsMessage passingMetadataRow (database)DataflowRoutingDemonSelectivity (electronic)Virtual LANServer (computing)Computer programmingImplementationOrder (biology)Goodness of fitCommunications protocolMechanism designDatabaseStreaming mediaLatent heatPoint (geometry)Broadcasting (networking)InformationOverlay-NetzOperator (mathematics)Reverse engineeringSystem callAddress spaceInstance (computer science)Graphics tabletComputer networkSampling (statistics)View (database)Right angleECosNetwork topologySymmetric matrixLine (geometry)Sign (mathematics)Computer animationProgram flowchart
Point (geometry)Computer wormData bufferUtility softwareInformationKernel (computing)Overlay-NetzCache (computing)TupleScripting languageRoutingControl flowPlane (geometry)Internet service providerDecision theoryLatent heatComputer clusterStability theoryImplementationCommunications protocolIP addressComputer hardwareSoftwareElement (mathematics)Flow separationComputer networkData managementInternetworkingComputer animation
Computer networkSoftwareTraffic reportingSoftwareLevel (video gaming)Slide ruleContext awarenessPersonal identification numberIP addressSpacetimeMechanism designDirect numerical simulationUsabilityLaptopAuditory maskingComputer programmingRaw image formatGraphics tabletInternet service providerRandomizationSound effectRight angleComputer animation
Kernel (computing)Network topologyMachine codeSoftware frameworkKernel (computing)Bit rateBranch (computer science)BitOpen setProjective planeComputer clusterPlug-in (computing)WordSoftwareComputer animation
Value-added networkComplete metric spaceComputer networkSoftwareRight angleSingle-precision floating-point formatStreaming mediaComputer hardwareMultiplication signEncapsulation (object-oriented programming)NumberOperator (mathematics)Program flowchartXMLUMLComputer animation
Value-added networkNormal (geometry)Complex (psychology)Virtual LANNumberConfiguration spaceRight angleProcess (computing)Different (Kate Ryan album)Frame problemHydraulic jumpBitPhysicalismView (database)GradientCountingIntegrated development environmentLimit (category theory)Queue (abstract data type)Computer-assisted translationMathematicsSystem callMultiplication signProgram flowchart
Frame problemMultiplicationFrame problemEmailRight angleVirtual LANIP addressPortable communications deviceRule of inferencePhysicalismGame controllerPoint (geometry)Thermal conductivitySoftwareResultantService (economics)Line (geometry)Online helpBroadcasting (networking)NumberConfiguration spaceRouter (computing)Machine codeCASE <Informatik>Series (mathematics)SpacetimeQueue (abstract data type)Default (computer science)Process (computing)Game theoryBridging (networking)WordOpen setComputer hardwareMathematicsMassSpherical capAuthenticationBitFilm editingCharacteristic polynomialTerm (mathematics)Auditory maskingInternet service providerComputer networkDifferent (Kate Ryan album)Electronic mailing listComputer configurationGoodness of fitSlide ruleMereologyLibrary (computing)HoaxKernel (computing)Computer programmingOrder (biology)Proof theoryInfinityVideo gameParsingServer (computing)Mechanism designVirtual machineXMLProgram flowchart
Transcript: English(auto-generated)
So FreeBSD VPC. I gave this talk at AsiaBSD. At that time, it was about 85 degrees in the room, and it was the last talk in the day.
And so I went through it reasonably quickly. Fortunately, this is still the last talk of the day, but in an air-conditioned room. So you're going to hopefully stay awake for it. So FreeBSD VPC. What's the state of virtualization in FreeBSD? And why do we care?
There we go. So right now on FreeBSD, the extent of what we've got in terms of virtualization is effectively compute isolation. We have beehive, it's performance. We have disk.
We can do something with the z-ball and pass that through. We can stripe and pass multiple z-balls into a guest using lvcreate if you've got Linux guests. We've got really good CPU isolation, actually. I was pleasantly surprised by how good the CPU isolation was and perfect memory isolation.
So we've got a solid story on the compute isolation side of things. But this leaves a fair amount to be desired. On the compute side of things, if you wanted to go do virtualization, you can put together and jerry rig up a system that basically looks like this, where you can take two separate guests and you can put them on the same box using beehive
and it works. And they can be potentially hostile and everything's great. The networking side of things, it's not so clear. The networking side of things, we have vNet and that's it. The problem with vNet is that on the underlay side of things, on the underlay network, and I'm going to explain what that means in a second,
if you're not familiar with that terminology, you've got very limited options. And that's potentially problematic. So on the underlay side of things, and with a unified underlay network, it's very possible to go and have two customers communicate
to each other. You create two tap interfaces, you plug them into a bridge, and you've got network isolation for customer B over here. Customer A, you haven't done anything with. And there's no good way of keeping customer A's traffic separate from customer B. So on the provider side of things, if you're in the business of providing a hosting solution
or providing a cloud framework or whatever, there's no compelling way to go about doing this. Because if you look at what you're doing, as a cloud provider, you've got an underlay network where you have the IP address of the physical server, then you have VM workloads. There's no way to separate that traffic
and provide the necessary encapsulation easily. So if you've got two servers, however, and you move beyond the single server scenario, you've got customer B potentially spread across two different servers. How do you bridge these two networks and provide network isolation?
So you can do something where you jerry-rig a tap, you plug it into a bridge, and that allows the guest to talk out. But like I said earlier, we've broken network isolation. Tap 50 or customer A, those packets can get over to customer B's traffic, and that's exactly what you don't want.
So fine, we'll improve upon this. Because we have no isolation here, and everything is fully meshed together, bridged together, we can do better. And we can actually provide encapsulation. We can do that using this protocol called VXLAN.
And FreeBSD has VXLAN interface. It's fantastic in the sense that it's protocol compatible. It works. And for simple things, it gets the job done. So that changes the structure of this. And you can plumb something together, where now instead of plugging bridges together,
you've got a tap to a bridge, bridge to a VXLAN, which you use to tag the VXLAN, so with the VNI. And now you've actually got isolation, because when the packet comes back through, bridge two, packets from customer A and customer B won't be able to see each other because they've been
encapsulated appropriately. But this kind of isolation has a really big problem. If you're trying to do compute density, the performance is awful. It's atrocious. If you look at, and if you have done anything inside a bridge, you'll know what I'm getting at, or tap for that matter, there's a lot of locking and a lot of copying, and it's piss poor performance.
It's one to two gigabits. It's uninteresting. So tap's slow. Bridge is slow. VXLAN takes the packet, runs it through IP input twice. So you're running things through the net stack twice. VNET allows you to virtualize the underlay network, but that's not the problem that we have.
The problem that we have is we need to virtualize the overlay network, not the underlay network. So the first pass at trying to provide network virtualization was this program that we wrote called UVX Bridge.
It's based off of NetMap. NetMap's the hotness right now on the networking side of things. We really liked its design, the fact that we could iterate on it rapidly in user space. And that's exactly what it did. We were able to go from nothing to something in a matter of weeks. And that, because of developer workflow, was really important. UVX Bridge is all in C, and it worked quite well.
So in the new hotness here with UVX Bridge was we took all these tap interfaces and we plugged them into UVX Bridge. And UVX Bridge provided us with all of the necessary encapsulation and isolation guarantees.
We were able to get 21 gigabits per second, 15 across the wire. We were able to do AES between different hosts so we could actually send traffic to remote AZs or data centers with hostile network in between, and that was fine. But this was just our POC, and it got the job done.
And we were able to pass on the VXLAN side of things with unencrypted traffic. We were able to pass traffic back and forth between Alumos and FreeBSD, and it worked. So 15 gigabits across the wire, it's OK. It was not what we wanted. It got us across the goal that we had at the time. We had an internal benchmark goal of getting to 8 gigabits.
We got to 15. We did that, and it was like three weeks or something like that. So the problem was we really wanted to do better than that, actually. Amazon came through. This was around the time of re-invent, and they're like, ha ha, 25 gigabits. And we're like, OK, so the bar got raised. And that's what happened to technology.
This was actually kind of exciting, because at that point in time, we knew what the limitations were in this piece of technology. We had to necessarily copy traffic from TAP to the NIC. In our case, we were using Chelsea, not EM. Sorry, that was supposed to be an inside joke, because EM can't do what we needed.
So we needed to figure out a different way of doing this, where we were reducing the number of times that a packet was copied. We needed to reduce the number of context switches. We talked to the netmap author. I forget his name off the top of my head. I met him at H-U-B. Not Luigi.
I'm sorry. Who is it? Vicentio something? Vicentio? Yeah. Context switches is actually the problem there. You're going back and forth in context switching between customer traffic and the kernel. And you have to go to a user land process called UVX Bridge, which is actually running, to go and do that. And so you run into context switch limitations.
Who knew? So instead, what we wanted to do is revisit the problem and actually think about this in terms of hardware. We wanted to go to a device-centric model. So we came up with this totally new kernel subsystem. It's based off of iflib. And the abstractions that were there inside of iflib
allowed us to both iterate and develop really quickly. And we ended up with something that roughly looked like this. So it's back to conceptually what we originally wanted. It was very simple. And we were able to understand how to plumb things together and have these pluggable interfaces and do what it was that we wanted.
But we were able to get the performance that we wanted. And in theory, this is kind of conceptually what it looks like. But actually, what we did was we ended up extending it a little bit more to look closer to what we actually would have in the physical world where you have a switch with ports. You plug a port into a NIC. You give a NIC to a guest. And it's like the hip bone's connected to the toe bone,
and you're off to the races. So we were getting really close. And we wrote a new soft switch, a learning switch. You were talking about veil switch earlier today in the eBPF. This is potentially an interesting chunk of code. It performs pretty well. We started out with veil as a reference implementation
and optimized it quite a bit. So we wanted to be able to pass traffic, obviously, because being able to do hosting on just a single box doesn't make any sense when you have tens of thousands of servers that you need to go in and run the software on. So we need to go and provide the network isolation so we'd use VXLAN. So in this case, more correctly,
who knows what VXLAN is? About half the room. So VXLAN is a UDP packet and encapsulates the ethernet frame effectively and allows you to virtualize the underlay network or reuse the underlay network.
It's not a very sophisticated protocol. But unlike VLAN tagging where you're encapsulating using an L2 packet, we're now encapsulating using L3 packet. So when I talk about underlay network, we're passing traffic for the encapsulated ethernet frame that would come from a guest.
So everything that's here in the green came from a guest VM. Everything that's in the yellow came and comes from the hypervisor or the host. And then the thing that gets dropped on the wire is that red ethernet frame. So what we wanted to do was be able to pass VXLAN traffic.
And so that if you had a customer guest there passing traffic to a different customer guest, they would be encapsulated not using a VLAN tag, like if we were using VLANs. Instead, we're using VXLAN encapsulations. We pass and tag all the traffic using VNIs or VXLAN IDs. So customer A can only talk to customer A
because their VNIs match. Same thing with customer B. So simple, right? So in order to make this all happen, and the way that we were able to do this was we constructed a series of pseudo-interfaces that were all created using iflib. So the first thing we had to do was we created a switch because we
knew that we were going to have multiple ports that we would then plug NICs into. So the switch is one per customer. We don't have to do any isolation inside of a given switch. The isolation happened just by the virtue of the fact that we had different ports and different NICs plugged into the switch.
And we just assumed that traffic would be able to flow. We created lots of switches. Or we could have one switch per customer. Then we needed to create a new NIC. So this is actually one of the things that was really interesting. This was we didn't want to change the kernel interface for a guest. We wanted to have a guest be able to use a vert net guest
driver unmodified because we were going to potentially take random Linux images. And we didn't want to go into the guest what we had to do with the veil work that we did earlier where we had a PT net map guest driver for Linux in order to get the top end performance. We used a vert net.
And we wanted to basically have something unmodified. Well, on the back side, on the host side of things, we had to implement a new kernel driver. So we did. That's VNIC. We plug a VNIC into a VCPP. That is the switch port. And that's where we do our filtering. That's the equivalent of what Amazon would call a security group or firewall. It's implemented inside of the switch port right here.
Testing was an important thing. And this actually became interesting. And I think this is all of a sudden like the scope of what FreeBSD VPC is became really interesting. As soon as during the development, we decided that spinning up a VM to go test, it's slow. It's kind of a pain in the ass.
And I really don't want to work in Linux. So we created a different interface that bypasses the host interface and allows us to test things. And that's this VPCI. And that actually became really interesting because then we would go and use VPCI, pass it into a jail, and we could run iperf tests from FreeBSD using the new VPC framework. And we would get VXLAN encapsulated traffic.
So from a testing perspective, this was really neat. Question? That's basically what an E-pair is. But it's not implemented in terms of E-pair. Yeah, we got scream and performance out of this. We did I think the highest single on a 100 gig NIC,
which is what we were actually playing with. I think we got 94 gigabits a second. So these are real numbers. This was fun. So then coming out the back end of things, the thing that's doing the actual encapsulation itself is ethlink. We did VPC link for doing the encapsulation and then
ethlink if we wanted to test without doing VXLAN tag traffic. So small detour. Who has written cluster schedulers or know what cluster schedulers are? A couple of people. Who knows that none of them are written in C.
So if you're writing something and you're a cloud provider and you want to go and distribute work, you have to have a way of being able to issue an API call, have that API call. A scheduler goes in just like a CPU scheduler, has to go and figure out where in your data center you're going to go and allocate a unit of work.
The Kubernetes, Nomad, Mesos is C++. It's close to C. But specifically, Nomad and Kubernetes, they're both written in Go. And if you were in my BoF talk earlier, you kind of heard a little bit of hints and rumblings about why you never want to use C Go. So from an interface design, user interface design, or user land interface design, we explicitly avoided libc.
And we went and spent some cycles trying to figure out how to integrate the kernel element of things, FreeBSD VPC, with the user land utility so we could integrate this with cluster schedulers. So having been around the block, the way that you would normally do this in FreeBSD
is you would just create a new ioctal. Because that's what everybody does. They create a brand new ioctal, and they just cram more things through the system interface. But it's a pain in the ass for basically everybody. Nearly impossible to secure. And it's just this generic dumping ground for input-output in and out of the kernel.
So then we were like, that's probably a poor way of going about this. So we could go do something that is storage-esque, or I don't know where exactly this kind of design primitive came from, where or why people copy it. But we could go and do the dev whatever interface. And we were thinking about whether or not
that was an appropriate thing. And it was like, well, do we really want to mix and match the network with VFS primitives? You know, devd, who knows? I wish I was actually in Warner's talk earlier. But so we decided that matching the network with the network primitives with VFS primitives was not OK.
So we moved on from that idea. And we decided that we were going to go and create a new set of system calls. So we created something that's very file descriptor like on purpose. So you have, and if you've done FileIO, this is going to be somewhat similar, except for we're not doing FileIO. We're configuring iflib or VPC interfaces.
So the first thing we did in VPC is you can open an interface or create an interface. Those are different flags to VPC open. Once you have an open descriptor, then you can go and manipulate and manage it. When you open a descriptor, you assign the capabilities and privileges to that descriptor, which means that from a sudo perspective
or a reset UID, you actually can build administrative tools around this and not have to worry about securing it. Because it's done inside of, you don't have to do anything different more correctly. And then the last one is at the time we were trying to be compatible with Triton, which
is the control plane that the giant used to have. And so we used UUIDs for everything, which is kind of an interesting choice, because UUIDs and FreeBSD, they exist, but they're not widely used. So what's the VPC open syscall look like?
So you open up an ID, you generate an ID in advance in user land, because you're the one telling the kernel what this identifier is. The cluster scheduler knows where everything's in theory going to be laid out. So you generate the ID in advance, you pass in the object type along with the ID, and then whatever miscellaneous flags that you have,
you pass close to it, just like you would anything else, and you've got these privileged separation semantics. One of the things that was interesting also that we deliberately did in purpose, because we're avoiding libc, was we encoded version information into the syscall. So in order to avoid the possibility of having to use libc in the future, we
took on ownership of versioning our own API. So we did that. And that actually ended up being really convenient, and hopefully will be reasonably future proof. So what's a VPC ID? Because it's UUID-like. It's not exactly a UUID. We actually encoded the object type in a UUID.
So if you get a large, if you've ever had done anything with UUIDs, they're opaque, random garbage, right? So from a usability perspective, we decided that was not OK. And we decided to hijack various bits inside of the UUID. The node object back here provides a default for the MAC address.
So we decided to reuse some of this stuff. You can override the MAC address from a UUID or from a VPC ID. But it turned out that there's some convenience to having duplicated information here and there. And then we also have the object type. This means that from an administrator and from a tooling perspective, you can write a rando tool that will take a VPC ID
and it will encode some information and it'll tell you what exactly is there. So we only hijacked the last couple of fields. And that was all that was necessary for the time being, knowing that if we needed to, we could potentially take more bits and steal more bits if we needed to.
I'm not going to go into any of that. I just said it. Because we took ownership of the interface, we also, when I said that we were going to create a new handle and get something out of this, where am I? Version number.
We have the ability. We padded this so that we can potentially expand the version number. We think that we only need 16 versions. If you've ever done API design, if you're getting close to 16, you probably don't know what you're doing. And you need to go and re-evaluate a bunch of things. Most of the time, hopefully, we'll get to version two or maybe version three. And that'll be it. But we wanted to take ownership of the future effectively.
And like I said, avoid having to do anything libc. There we go. So object types. We took these particular constants, and we cooked them into both the VPC ID,
and we passed it in the open. This checksum ended up being convenient because it ended up being really important as we were doing development because this acted as basically a checksum for us. Anyway, the gist of it is we have 16 versions available. We can have 255 objects, object types, and if we needed to go and chew into our padding,
we can get up to 40, 80 objects. And that should be enough for everybody, right? There we go. So flags, they're nothing sophisticated here. It's just a bit field. The other thing that we did end up doing after I put this presentation together
was we began to add extended privilege capabilities for the object type so that we could cook that into the flags. I'm not going to get into that, but VPC Kettle. So this is actually the meat and potatoes. You do VPC open, you get a descriptor back. The more interesting part is what happens in VPC Kettle.
This is how you interact with this. This is basically a iOctal-like syscall that you can do get and set operations with. Yeah, so there's a bunch of different operations.
Each set of operations is key to a particular object type. We did this so that we didn't have to go back. So if you have your VPC, it's just dying. That's what's going on. So VPC opt, that is per object type. The operations are tied to the object
that you're interacting on, which was determined at the time that you opened up the descriptor. So the number of operations is specific to the object type. In the case of a VPC switch, there's only a certain number of operations that we needed to do. We need to be able to add, update, change the link state, do some of the basics there.
So we could extend this if we needed to. There's no shortage of available bits. On the port side of things, we need to potentially go and set the V tag. So if you wanted to have different subnets within the same customer, the way you should do that in a physical network is you would have a different subnet, which would probably be on a different VLAN.
So you can set the VLAN tags for a given customer. You push that information into the port. So you have to open up the VPC port on a switch and say that all traffic coming from this port is now a part of a different VLAN. On the NIC side of things, from a performance perspective,
which is where we originally started, the thing that was most important for us to be able to set and get was the number of queues. And at the end of the day, we had effectively an arbitrarily complex configuration that we needed to decompose and figure out how to interact with from both user land and then how to organize it inside of the kernel.
Using these two syscalls, we were basically able to conquer that and put together a compelling interface that we could then integrate into either a cluster schedule or a user land tooling. But then really at the end of the day, because we did do this in terms of iflib,
performance was really nice because all of this was just a thin wrap around the hardware capabilities that were there. So at the end of the day, we had something that because we did this with both a stable API, KPI, we had something that we could then foist into Go, which provided us with a tool chain that allowed
us to do development on a Mac. Or we had some people doing Windows development, and they could just work together on different operating systems without having to have a physical FreeBSD box nearby. And this was really novel. This allowed us to iterate at a nice clip.
And if you've ever had to do cross compilation, this sucks. And it was easy for us to do in Go.
So ELI5, this is a Reddit term, explain like I'm five. VPC had a few assumptions. The first one was that the host was going to provide multiple TXRXQs in the host. We'd be able to plumb that through to a guest. The guest is going to be running Ubuntu or CentOS. Those were our primary two Linux operating systems. If you were outside of that, who knows?
In order to make this entire system work, the assumptions that we had were that all physical servers were able to route traffic. So on the underlay, we had a fully meshed network in the sense that we could pass traffic between any other underlay host. And that, in this particular case in the next diagram,
all the hosts are on the same subnet. We did not tackle routing to begin with, but we do have a VPC router construct that is on its way at some point in the future. So in this case, this is what I was showing you earlier. We have ports, NICs, switches, and then ethlink if we wanted to have untagged traffic.
So in order to create this, what's it look like from a userland perspective? So you have your awful UUID that is totally unusable, except for it's got your type information. 01, you see that?
I would, this is the laser pointer. There are no batteries. So the 01 there is the type, and then the FA, that's the default MAC address. Anyway, create a switch, and I say all traffic on the switch is part of a VNI, or has a particular VNI, which is the equivalent of a network domain.
So we created the switch. Then we went and created a series of NICs. So we said create the NIC, take the switch port, add the NIC to the switch port. You take a switch port, and you connect an interface, which is the VNIC that you created up there. So we've basically created a VNIC,
plugged it into a port, into a switch. This is where we are right now. Then we wanted to go and do the exact same thing for guest three in this case. So we did that. The only thing that's different between any of this stuff was we changed some of the IDs up here.
So the MAC address was just a little different. Other than that, it's all the same. That's the important thing. Same VNI and same VLAN ID. In order to drain packets out of the switch and basically drop those guest frames out to the underlay,
we went and created an uplink port. So switches have a designated port on the switch that's called an uplink port. And it's basically a magic default port for switches. And then you specify the ethlink or the drain ID. And there you go.
But obviously, that didn't tag the traffic. So how do you do that? How do you figure out? How do you get from customer one guest A? How do you figure out that on the underlay network, the traffic needs to go to 1065.162? This is the hard part about VXLAN, is if you're customer one guest A, how do you do ARP?
It's not possible to do ARP. You can't broadcast to everybody. If you've got 20,000 physical servers in a data center, you can't broadcast to all other physical nodes, hoping that all servers with VNI 123 and VLAN tag 456
are going to somehow respond. That's an atrocious problem. You can't do that. So what do you do? You've got no multicast. I don't know of any cloud provider, actually, that has multicast. That's out of scope. And so we don't have multicast, we don't have broadcast.
How do you figure this out? Well, in order to do that, you have to capture broadcast coming out of the switch. You have to make an up call out of the kernel that says, I need to go and do what's called VTAP or VLAN tunnel endpoint resolution. I need to figure out where the IP address and the MAC address
and all the necessary bits are to get guest traffic from overlay network on host A to overlay network on host B. So we added a KQ interface and a KQ filter so that we could just listen like we would on any other KQ interface.
We put a filter for VPC. We've just passed the raw packet up. We process and parse the packet in user space using Go. Google has released a bunch of very nice packet processing libraries in user space. So we use those. At that point in time, we rip apart the packet.
We figure out what it is that we need to do. We use VPC Kettle in order to push back or push down into the guest, the response. And then that reinjects it back into the switch and sends the response packet back to the guest. So now the guest, we've basically completed ARP here. And then the switch holds onto that so that we don't have
to perform that up call again. One of the things that we do that's a little tricky is we hold onto the MAC address permanently because we make MAC addresses immutable, minus one detail that I probably won't get into in this talk. But that's an important element. Where are we now?
So go ahead. No, we don't have. So yes, I guess fair enough. So in this particular case, it's a good question. In this particular case, what happens is in order
to get traffic from the two hypervisors on these two hosts, we create the guest NIC. We pass it in. And in the NIC, the switch, everything along the way, we've basically plumbed this, I don't want to call circuit, but I'm going to use the word circuit. We've plumbed this circuit so that everything coming out from guest one customer A all the way through
is encapsulated with VNI 123 and VTAG 456. The ARP discoveries? They don't though. That was the point. That was the point. So I missed the detail, I guess, that I need to explain. So the switch, VPC switch zero plugged into guest one customer
A does a K note write that notifies a userland daemon that says we've got a packet here, a broadcast packet, for a guest coming from VNI 123 VLAN 456 pleased to be satisfying this ARP request.
And so there is a program running on this top server on dot 161 that's running in the host that receives this up call, does a look up in our metadata database, and that's what has the information and tells us that the VM, let's see, that guest one on server dot 162, right?
Or this, I'm sorry, I've been saying something wrong here. So 1065 5161, that is the IP address in the overlay. That's the IP address of the guest, not the host.
Yes. Yes. So there's never a direct information, they just ask the database. Correct. Only for the first packet. Yeah, sorry. So that was a very important detail. I did gloss over that. I apologize.
It's VXLAN. So everything coming from guest one customer A on the host one to guest one customer A on host two, that is this stream of UDP packets. It doesn't.
So we do the exact same thing. So if you're sending a packet from guest one customer A, from red to red, on the egress, we do the look up on the top host. Yeah, go ahead. Thank you.
Yes, that's right. Sorry, there was a couple of assumptions that I skipped over. I apologize. Yeah, good. So it's not compatible with any other VXLAN implementation?
So this is kind of the semi-dirty secret. So the on-the-wire protocol is compatible. The mechanism for doing VTEP is implementation specific.
So we can interoperate and pass traffic between, in this case, FreeBSD and LUMOS, and that works. Now, what, for instance, LUMOS would do in this case, they use a totally different metadata look up mechanism than what we did in FreeBSD. However, the traffic still passes. The reverse traffic, let's assume,
going back to the original example, where these are both FreeBSD hosts, in the return pass, so if you send a ping, in order for the echo reply to come back, on the first packet, this switch on the second host has to do the same metadata look up on its side of things in order to complete a symmetric route for 40.
It's a Postgres look up. It's a single select. And because we have the VNI, the host, the MAC address,
like the everything, it selects to a single row. We navigate a B-tree, and then we return a packet, and then we cache it permanently. So we don't do this per packet. We only do this for the first flow between two separate hosts. Yeah, yeah, go ahead.
You do. That's a more complicated thing. What we do is that we actually treat this closer to ARP, and we just expire the entries. We just age it up. It's hard for us to go and do this large distributed cache invalidation. There is something we have to do fancy for IP address moving, which I'm going to try and avoid if I can.
We have something that's kind of called Cloud ARP. We're not sure what the hell we're going to call it. So the thing, though, that is provider specific or implementation specific is this thing called VTEP. When you're not doing multicast or anything else
like that, the VTEP protocol is provider specific on the wire is well-defined. I get that you're not doing it because you don't play it. I have zero interest in going down multicast anywhere, and I wouldn't wish that upon anyone.
Anybody interested in stable multicast? So things that are outstanding is firewalling. That's going into VPC port. Routing, how do you go and pass traffic between separate subnets? How do you go from a private IP address out to the internet? The work is not complete. As somebody that I work with told me,
there's a small speed bump, and we're working through it. So yeah. And integrating beehive, which is the VM management, and hardware isolation element things with the network isolation. So it's a single tool. What does the actual tool itself look like?
This is where we're going with this. So we're going to wrap beehive with the VPC commands. So you just have VPC VM, because there's tight coupling between a VM and the network context that it's running inside of. So interesting.
Can I put a pin in that real quick? I'm almost done, and then we'll come back to the slides. Yeah, we'll go off the rails here shortly. True story. We will. I watch. Just watch. So one of the things that was really interesting, who has ever used Packer before, or VMware on your laptop? Everybody.
So when we were doing this, we were like, this would be a great mechanism to go and have VMware on a laptop, effectively. And when I say VMware, I mean I need on a laptop to go and spin up a VM, and I don't want to have a priori knowledge of the IP address space that I'm working in. That sucks. What I really want to do is I want to go and create an IP address, a VRF, or an IP address
space. I want to put an IP address or DHCP and just be able to spin up a VM and not have to configure anything. And you can actually do this now. This was one of the novel things that came out of EPC. As soon as we started talking about NAT, we were like, we can actually have something that's usable on the laptop to go and be able to spin up random sandboxed
environments and IP address spaces and provide program isolation. Right now, if you're using Vagrant or Packer, there's actually a hard dependency for both of those tools on PF and DNS mask. And that's fine, but I don't want to live in that world. I definitely don't want to live in a world where I'm depending on PF. I will say that.
Open BSD? So the kernel work is joint FreeBSD. It's on the projects of EPC branch. The kernel bits for Go, so all of the Go interfaces, you'll see go.freebsd.org. I'm going to be talking to cluster add-in again this trip, and I'm going to get that finished and pushed out
so that we can begin to vendor and merge and integrate some of this stuff and begin using it because this framework is being used by iflib to configure iflib devices and interfaces. Having seen the rate at which we were able to get things done with iflib, I'm a huge proponent of its use. OK, now I can go back to questions.
I was like, I'm almost there. So your question was packet check some offloading? So we're not using any of the traditional software that was in FreeBSD. This is completely net new. So where this says EM0, this is a Chelsea ONIC,
we used a single PF for everything, and I didn't talk about the performance. So we ended up doing ethlink traffic.
We were doing 86 gigabits of iperf traffic from a single VM. On a single TCP stream, we were doing 56 gigabits, I think, on a single TCP stream with 1,500 MTU packets.
This was really good performance. When we went and cranked up the number of TCP streams, we were doing 86 gigabits. This was three times faster than Linux, using Mellanox and VFs and hardware offload. When we did VXLAN encapsulation using VPC link,
and we were VXLAN encapsulating here, we were doing 67 gigabits, and we screwed things up a little. And so that number is actually not right. At that point, we were doing 67 gigabits of encapsulated traffic, but we could have done more. The reason we didn't do more is because we didn't do the pacing right. So we were actually overrunning the physical switch here,
and it was dropping packets. And if we would have done a better job of pacing at the time, then that number would have been higher and probably much closer to the 86 gigabits that we were getting with just the VLAN tag and capped traffic. So this was phenomenal. And there's nothing that was preventing us from doing this over LACP.
Hold on, real quick. Yeah, yeah, sorry. We did, so we did do 1450 if we needed to, but for some of our environments,
we also did 1526 here. So we could do 1500 to the guest without fragmentation. I don't want to call them jumbo frames, but they're not 1500. Right, they're not 9K was the important part, right?
They don't actually, not until, like it depends on your switch vendor, not until you hit like 1900 bytes, and then there's like this huge jump to like seven or 8K frame size. And that's because switch vendors have like long had to deal with things like Q&Q and other miscellaneous traffic that like slightly bumps the size of the frame to be a little bit larger than 1500.
So that's not out of this world. So anyway, you can run it with a lot of people outside of something smaller. Yep, yep. All you have to do is change the guest here, the guest MTU size to be 1450. If you did want to do full 1500 for the guest and you wanted to do IPv6 here, then you set the NPU on the underlay to 1576 or 1574.
I forget the number, but yeah, it works. Yeah, if you're a guest traffic here, or guest, it's how do you get your traffic out to the internet?
Like we did try, so we were able to saturate and it was less than a percent difference. We were really impressed by the performance that we were able to get, and jumbo frames didn't buy us anything. So, question? I have one more question.
You don't have a layer of VLANs. If I've got, if these customers here are hostile, I can't have them on the same VLAN. Yeah, you can't have them on the same VLAN, but you can. But I'm gonna run out of VLANs. I've only got 4096 VLANs. And VXLAN is a, what was that?
It's number of customers and the number of subnets per customer. So VXLAN is a 24-bit number, so that's 16 million. And then I throw a VLAN tag on top of that, and I've got another 4096 per customer. So I can have 4096 subnets per guest,
or per customer I mean, right? And then I can have 16 million customers. Okay, so it's not about the size of the data center, it's all about customers. Yeah, and customer configuration complexity. I don't care how many subnets really you have, because I will spill you over to a different VNI
if I have to. But we have account limits, like this is why you have to go talk to Amazon and get them to raise your account limits. What is this? You have the external 24-bit. So, yes, so the traffic coming into the guest is a normal 1500 MTU.
When it goes through the VPC switch, or it's actually the VPC port here, just one. There's one VLAN tag, and then there's one VXLAN tag. Okay, so the customer still gets all the normal VLAN ports. Yep, we don't change that at all. We take the entire packet and pass it through, right?
Now I get it, yeah, yeah, yeah, I get it. Can you pass through all four VLANs and all that kind of stuff? On the underlay, like the frame, the IP, the header, and inside of here, in the Ethernet Threader, we can put a normal 802.1Q tag, right?
Doesn't matter. And whatever our customer's doing on their encapsulated side, that's their business. I don't care if they're spewing garbage. I'll bill them. Self-correcting problem. Next question.
Yeah, good question. So, queues, NIC queues. So, we were using, in our case, we were using Chelsea. We had 1024 queues, I think, available. And what we were doing is we took eight queues, and we figured out that this is actually about the sweet spot for Linux, because eLinux.
We pass eight NIC queues through to each guest. And we would actually map the queues straight from the NIC to the guest. We didn't use a VF or anything like that, but we would use eight queues. And that worked out okay. We kept it as a configurable option inside of the VNIC. We did make the interfaces,
I didn't kinda gloss over it, but I think I wrote it up on the slide, was we made VNICs immutable, so once you set the number of queues, we didn't wanna have to deal with like notification of a guest or bouncing of a guest for a dependency change. We just forced the recreation of that. But eight by default works pretty well. And we just mapped those through.
Can I answer your question? Okay, other questions? Yes. So, we would push that into the VPC switch port
that was here. Yes, but we were aware of this. We didn't have to do programming in the physical underlay network, right? That was one of the important things is, this gave us, we took all of the control and configuration that would be necessary to push into hardware, and we offloaded it into software.
Yeah. Yeah, so we actually supported, this is an interesting point. So we actually supported VLAN tagged and un-VLAN tagged traffic coming in at guests,
because we wanted to support routing of guests, like the guests to be able to bridge their own networks. Yeah, so we allowed for that, but we did have a mask in here so that we could actually, like on the VPC port, just like you would have on a normal physical switch, you can say like, we're only allowed VLANs 10 and 11, and you tried passing something in VLAN 11,
and that's still on the overlay. It's not underlay here. Nothing customer facing could influence the operational characteristics of the underlay network itself. Yeah, next. Yeah.
So this is, so speaking of, going back to when I was talking about NAT and firewalling, I'm just gonna hand fake with this. So what you can do is you can actually pass in a VNIC, and instead of EM0, we actually have to create a second interface here with a different MAC address,
and that has the underlay IP address, and then we can create a VNIC that we plug into the cloned interface, and that would allow for public IP addresses potentially to be mapped directly to a guest, or for a provider managed service to go from a public IP address, NATed by the host, and then passed through as what Amazon would call an ENI,
or an ENI, but an EID. I totally didn't answer my question. I wanted exactly this question to us. How do we provide the customer access to the outside? Yeah. I mean, my new space would be different. I have a physical PDSD machine on which I would like to terminate the VNIC's VLAN,
like without any VMs on it. So that would be just a router process, like a router, and that's not finished. So you would take an unencapsulated frame that would hit the VPC switch. You would run it through the ethlink. It would hit a router,
and then it would drop an encapsulated frame, or an unencapsulated frame if necessary, to be able to talk to another host. It's really like we deliberately built it so that it was a matter of configuration and plumbing things in differently, as a series of space, if lib is basically a bunch of handler callbacks,
so that it's semi-arbitrary how you want to configure this. This, however, is like the common case. There's, you know, if you need to do portability, or you need to integrate with an existing infrastructure where you're getting untagged frames that's a part of a given subnet, that's a supported config. Go ahead.
Yeah. In theory, you should be able to, you can't because you can't use bridge, if bridge, because if bridge doesn't play by nice rules in terms of performance. So you would use a VPC switch, but yes.
It does show up, if you do if config, you can see all of these VPC interfaces. It's a really big list if you get carried away. Yeah, so there is IP checking. So if you, when you put an IP address on a VM NIC,
inside of the guest, you have like, I didn't get to DHCP and please don't ask about it, but the guest will have an IP address and then that's actually enforced at the port as well. Don't ask. Life's better that way. DHC, you have to do something special,
just like you capture broadcast for V4 ARP, or V6 neighbor discovery. We also have to capture DHCP. And then we up call that through the same interface that I was talking about earlier.
Now, we didn't get to DHCP yet, which is why I'm saying don't ask about it. There's only so many hours in the day. But yeah, we didn't get to that, but that's what effectively would have to happen. There's some, yeah.
Ask me afterwards about ARP. Can, or not ARP, but moving IP addresses. How do you do? So that's what we did to begin with. That was one of the things that we did do, is we captured everything and we had a DHCP listener
that masked out and prevented broadcast traffic from hitting the underlay, and then we had a sidecar go process that was actually acting as our DHCP server.
Yeah. We could. Yeah, we could have done that. We already had a mechanism already for parsing and ripping open IP addresses, and we already had a K notification in order to do the up call, so yeah. 6.01, not a bad idea,
because we did do something very similar in the past. There's a little bit of complications there when you're bridging between the encapsulated network and the unencapsulated network, but yeah. Authentication. Postgres piece that actually does the VTAP.
Is that also out there, or no? Actually, the schema design is, yeah. Okay. Yeah, it's there. The story is, it's still proof of concept, and once you finish, it gets used, or is... That is... Find me later. Find me later.
What about the people that, what about outside of your work, the people that are dying to use it, or that are very interested, what... Very interested in seeing this move forward, and yeah, we should talk about that. There were a few other hands that were up a second ago, but this is a complicated story.
Other questions? I get the feeling you know some of this. I appreciate the troll, sir. Yeah, so the work is interesting.
It's moving forward. We did do a number of things. We went and created a new m-buff, mvec, which we need to go and unify. I apologize, I didn't get back to you on email about that actually, now that I remember, and yeah, there's a substantial body work. This was north of 100,000 lines of kernel code that showed up over three months.
Huge thank you, infinite thank you to Matt Macy for making a lot of this happen, and the community as well. There was a number of things that went into this, so Navdeep was in particular helpful along the way. I know I'm gonna forget others, so yeah. With that, sold.