We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Weave Net, an Open Source Container Network

00:00

Formal Metadata

Title
Weave Net, an Open Source Container Network
Subtitle
Five years with no central point of control
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
A tour of the internals of Weave Net, one of the most popular container networks: design challenges and lessons learned from five years in the wild. Including Kubernetes integration and how CNI was born. Weave Net is written in Go, using many Linux kernel features such as veths, bridges and iptables. Aimed at developers rather than network engineers, Weave Net tries to be self-configuring and find the best available transport between nodes. The control plane operates via gossip, with no central point of control.
Control flowExpert systemKernel (computing)Device driverSoftware developerWhiteboardScheduling (computing)Computer animation
Expert systemSlide ruleProjective planeMultiplication signSoftwareExpert systemComputer animation
Open sourceComputer networkInstallation artEnterprise architectureRevision controlSoftwareOpen sourceEnterprise architectureOpen setWindowRevision controlComputer animation
Computer networkModal logicGoodness of fitNamespaceVirtual machineSoftwareNumberQuicksortArithmetic meanPoint (geometry)Shape (magazine)Flow separationKernel (computing)DiagramBitComputer animationLecture/Conference
Computer networkData modelLine (geometry)Revision controlErlang distributionExpert systemEndliche ModelltheorieTheoryLevel (video gaming)Line (geometry)Revision controlCodeProgrammer (hardware)Row (database)Erlang distributionBitXMLProgram flowchartComputer animation
Revision controlVirtual machineEndliche ModelltheorieBridging (networking)ImplementationInjektivitätNamespaceWebsiteDifferent (Kate Ryan album)VirtualizationTap (transformer)Computer animationProgram flowchart
DemonImplementationAddress spaceMultilaterationTerm (mathematics)SpacetimeSoftwareComputer programmingDifferent (Kate Ryan album)Open sourcePhysicalismComputer animation
MassComputer programmingMemory managementSpacetimeKernel (computing)MeasurementSoftwarePerformance appraisalSet (mathematics)Computer animation
DemonDatenpfadModule (mathematics)SoftwareForceDemonRevision controlCalculationPlanningProcess (computing)CodeProjective planeBridging (networking)Open sourceKernel (computing)Similarity (geometry)Open setMeasurementGame controllerComputer animationProgram flowchart
DatenpfadPoint cloudService (economics)Exception handlingComputer animation
Scripting languageSoftwareRule of inferenceBitNamespaceGastropod shellTable (information)Bridging (networking)Computer animation
Continuum hypothesisMenu (computing)Complex (psychology)Scripting languageProjective planeScripting languagePhysical lawMultiplication signComputer networkOpen sourceAsynchronous Transfer ModeLine (geometry)Different (Kate Ryan album)Gastropod shellCodeComputer animation
EncryptionTable (information)Library (computing)Software frameworkRule of inferenceGoodness of fitQuicksortGreatest elementLink (knot theory)SoftwareOpen setFunctional (mathematics)Virtual machineBitData managementProjective planePosition operatorCryptographyCodeKernel (computing)SpacetimeFerry CorstenKey (cryptography)Game theoryState of matterPlanningFitness functionComputer animation
EncryptionTrailSinguläres IntegralCondition numberNetwork topologyLevel (video gaming)Internet forumSoftwareFerry CorstenCentralizer and normalizerPoint (geometry)Data managementConsistencyData storage deviceGame controllerPeer-to-peerInformationRootComputer animation
Network topologyData managementAddress spaceConsistencyIdentity managementSoftwareIP addressNumberData managementHash functionFitness functionRing (mathematics)Type theoryTable (information)Data structureCategory of beingSpacetimeCentralizer and normalizerLaptopNP-hardShared memoryInformation securityTerm (mathematics)Computer animationProgram flowchart
Default (computer science)Computer networkComputer networkAtomic nucleusNumberQuicksortRootEndliche ModelltheorieBitSoftwareMultiplication signProjective planeFigurate numberVirtual machineInterface (computing)Installation artFrequencyBridging (networking)Goodness of fitOpen sourceSet (mathematics)Rule of inferenceOperator (mathematics)Line (geometry)Default (computer science)StatisticsNetwork topologyElectronic mailing listTransmitterCountingProcess (computing)GoogolMereologyRoutingPrandtl numberData storage deviceComputer animation
Computer networkInterface (computing)Run time (program lifecycle phase)Computer networkRun time (program lifecycle phase)Projective planePlug-in (computing)SoftwareConfiguration spaceField (computer science)Interface (computing)BitProcess (computing)NamespaceSoftware maintenanceMultiplication signRevision controlDemosceneQueue (abstract data type)Computer animation
DemonDirectory serviceDirectory serviceComputer networkDemonProcess (computing)Server (computing)ConsistencyPlug-in (computing)Computer fileBackupCentralizer and normalizerSoftwareDifferent (Kate Ryan album)Set (mathematics)Multiplication signComputer clusterConnected spaceKnotComputer animation
Content (media)Point (geometry)Open sourceAddress spaceDesign by contractSoftwareImplementationPlanningGreatest elementProjective planeEnterprise architectureState observerCodeOverlay-NetzComputer networkRevision controlInstance (computer science)Line (geometry)RoutingComputer animation
EncryptionSinguläres IntegralCondition numberTrailCondition numberCASE <Informatik>Open sourceSoftwarePatch (Unix)Multiplication signRow (database)Link (knot theory)Address space2 (number)Direct numerical simulationNatural numberDesign by contractKernel (computing)Greatest elementStress (mechanics)Physical systemLibrary (computing)Computer animation
Computer networkComputer networkData modelCore dumpFacebookOpen sourcePoint cloudComputer animation
Transcript: English(auto-generated)
Well, thank you for coming. My name is Brian Borum. I work for a company called Weaveworks. So let me just check who's in the room. Who had heard of WeaveNet before they read the schedule?
About half the room, that's OK. And who identifies as like a kernel developer, or a device driver, or DPDK developer, or oh, they've all left, OK, that's great. So we can, because I put on this slide, I am not a networking expert.
I'm like a programmer, and I've been looking after this project for five years. It's been downloaded 250 million times, it's starred 5,000 times. There's certain things to be proud of.
But fundamentally, I don't think I did anything clever. So I'm kind of glad that all the clever people have left the room. I put my smiling face up somewhat because I'm going to put a bunch of people up.
This is a talk somewhat about the technology and somewhat about the people and the history of this project. So hopefully that's interesting. So what is WeaveNet? It's a container network, and I'll talk about that in a minute.
The primary thing we were aiming for is that it's easy to install, it just works. It runs anywhere, and there's a little asterisk because we mean anywhere that is Linux. But Windows runs Linux now.
We have tried this. You can actually run WeaveNet on WSL, so nearly everywhere. It's open source, it's Apache licensed. We never made an enterprise version. So enjoy.
Yeah, what is a container network? We had one definition from Justin Garrison, who works for Disney. I'm an admirer of his work, and generally the work of Disney.
It's good stuff. But he said there's no such thing as container networking. So that was a bummer, because I've been working on it for five years. But actually it turns out he uses WeaveNet, and it just works.
So yes. So more seriously, what is a container network? So the point of containers is, or one point at least, is isolation. Through namespaces and kernel features, one thing sort of believes it's completely separate from another thing, including network.
They have separate network namespaces. So now how do these things talk to each other? Well, whatever the answer to that question is, that's a container network. That's the definition I'm going to work with. Okay. Let's go back.
Okay. Let's go here. Conceptually, what does this mean? So I'm going to draw, I'm going to put up a large number of diagrams that look a little bit like this. So the meaning of the shapes, the big darker blue shape is a machine, whether it's bare metal or VM or something like that.
That's your kind of node in the network machine. And the light blue blobs are containers. Okay. So sometimes containers are talking to each other on the same machine. Sometimes they're talking to each other on different machines.
And by and large, we're going to have lots of these things, and they're all going to be talking at once in different amounts. Okay. So that's the kind of theoretical high-level model that we want to, that's what we want to do. Let's go back five years, five and a half years. This smiling face, Mr. Sackman, wrote the first version of what became WeaveNet.
He was out of RabbitMQ, in fact the founders of WeaveWorks all came from RabbitMQ. So here's an Erlang programmer, and the code, a lot of the code has quite a strong, it's
all written in Go, but it's a strong kind of Erlang flavor to it, which is kind of cool if you want to think about that. 3,400 lines was the first commit, spoiler, we're at like 30,000 lines now, so it's
grown a bit. Anyway, so basically what we do at that version, fundamentally what we do is we put a bridge, a Linux bridge on each machine, we connect all the containers on one machine to the
bridge, and then we tunnel the packets from one bridge to another bridge. So that's the conceptual model taken to an actual implementation. Let's take that down a layer further.
So specifically we set up a bridge, for each container we set up a Vith, virtual ethernet device, one end is inside the container namespace, one side is attached to the bridge in the host, and we listen on that bridge using PCAP. Whoa, the room went quiet.
No, seriously, we tested three different ways to do it, like tap devices and whatever, and as at five years ago in Go, as it stood, PCAP worked best. So there it is. So if you've got two containers on the same host, they're both attached to the same bridge,
they'll just talk to each other over the bridge, and WeaveNet doesn't get involved. If you have two containers on different machines, then we're going to pick up those packets, we're going to put them in another UDP packet, it's kind of homegrown encapsulation, and
we're going to send that over the network and deliver it to the bridge again via PCAP, packet injection on the other side, deliver it to the bridge, and that's going to end up at its destination.
This is, I like to call it at least, a distributed ethernet switch. WeaveNet implements an ethernet switch, it's a layer two network. It works purely in terms of MAC addresses, it does what a dumb ethernet switch does, it learns MAC addresses by seeing a packet come in, and it observes the source address
that it came from, and later on it has a packet to send to that destination, it uses what it learned, and it delivers the packets that way. So a physical ethernet switch is going to deliver on different cables, we're in software, we're
going to send to different hosts, but it's exactly the same concept. We also have the same fallback behavior, if we don't know where it's supposed to go, send it everywhere. That behavior will come in useful later. This is a real website, somebody was kind of, I didn't put their smiling face up.
Somebody was kind enough to make this website, pointing out something that we did actually know, but if I step back, can I step back? Yeah, okay.
So yeah, the reason it's kind of slow is, we start off in user space here, in the program that's trying to get some work done, we go down into the kernel here, we go back up into user space through PCAP, we put it in a UDP packet, we go down into the kernel again, across the physical network, and then do the same thing again. We go up, down, up, down, up, down, up, down, and yeah, it's kind of slow.
It's terribly slow. We used to measure it, like five years ago, we used to measure it at 300 microseconds per packet extra latency, and now you have to set this aside, what are you actually going to do with those packets? If the next thing that happens is they get delivered to a massive heap of PHP code,
then 300 microseconds is not your problem. But whatever, yeah, it's kind of slow. Okay, so next step in the evaluation, we implemented what we call the fast data path, because we have no imagination when thinking of names.
So kind of similar picture, the packet starts off in a container, again, we've attached a vth. The other end of the vth now is in a different device, which is an Open vSwitch data path. So this is implemented by a kernel module from the Open vSwitch project.
This is the only piece of the Open vSwitch project that we use. So we're in these kind of, these daemon processes are basically implementing our control plane independently of Open vSwitch, but we're using their kernel module.
So it takes the place of the bridge, at least in this version of the code. And we add a few kind of bridge-like behaviors to it to get everything we need out of it. But once a source destination MAC pair has been seen to be talking on the container network,
we set up a VXLAN tunnel, and that goes kernel to kernel. So the packets don't do this up, down, up, down, up, down thing. They are encapsulated, which costs you a little bit, but we used to measure this on a 10-gigabit network, which we thought was fast in 2015.
On a 10-gigabit network, we'd measure 8-gigabits of throughput. So it wasn't that bad. It's doing encapsulation, but it's kernel to kernel, and it's delivered to its destination pretty fast.
So the person that did this was Mr. Rag, Dr. Rag, I should say. Almost everybody that worked on WeaveNet has a PhD, except me. Sorry. Yeah, so like I say, I like to put up the smiling faces.
So that's the fast data path. Fixed our main kind of obstacle that we had in the marketplace, which was that it was kind of slow. Let's talk a bit about how we set all these things up.
So right from the very beginning, we need a bridge. We need vists. We need to step into network namespaces, step out again. We need to set up some IP tables rules. We need to set up some CISCables, a bunch of things we need to do. So how do we do all that?
In a shell script. We borrowed liberally from this project called Pipework by Jerome Petazzoni, who was at Docker at the time, I think.
And this project is a shell script. It turns out it's actually really concise to do the kind of IP, NetNS, blah, blah, blah stuff. So we had our own shell script, which is called Weave. And it started off of 350 lines.
And it has these commands like weave launch and weave attach and so on. At peak, it got up to 2,500 lines. It's not a very nice place to be to maintain a 2,500-line shell script.
So I sat down and reimplemented a lot of it into Go. So it's currently, I mean, this is the latest commit. It's at 1,600 lines or so. The features keep getting bigger on the option. It works in all kinds of different modes and so on. So that bloats it. But one thing is when recoding from these things
in shell script to Go, the code gets like 50 times bigger because Go is notoriously verbose. But there we are. So what else? We do encryption.
We do that both ways with the fast data path and the slow data path we rename sleeve because slow data path didn't seem like a good branding position corporately. You know, sleeve is a thing that encapsulates something. I don't know. Anyway.
So a cunning metaphor here. In user space, we use the NACL library. NACL, sodium chloride, salt. Yeah. Okay. To do our encryption, in kernel space, we use the XFRM framework
and there's a wonderful explanation on that link at the bottom. All the minute details of how we do this. One interesting tweak. We couldn't get this to work at all for months.
Essentially because the open vSwitch data path doesn't provide any way to drive the packets through the XFRM framework. We can't set a policy to say everything on this data path go through here. And eventually, the idea how to fix this,
we stole from Docker. We put all the packets through an IP tables rule which marks them and then set a policy on that mark. So we have an IP tables rule whose only function is to glue these two bits of software together inside the kernel that otherwise don't play.
And that's kind of... A lot of the history of this project has been sort of fighting with things that didn't quite want to do what we wanted them to do.
The history is there in the code and some of it I can remember. Anyway, so we encrypt the packets. We're doing key management up here. We did not roll our own crypto.
And yeah, some people like that feature. It's encrypted on this side. It's encrypted when it hits the underlying network. It's not encrypted here. So if you've managed to get onto the machine and you can sniff this vith then you'll see the plain traffic there.
But I always reckon if you've got that much access to be on the machine and sniffing a vith then you probably lost the game already. So who knows? What else? Oh yeah, Martinez wrote this. Martinez did all the gluing things together
at the XFRM level. He now works on Selium at iSurveillant which is that vendor. So that's Martinez. Okay, so change tack again. WeaveNet is a peer-to-peer network.
The title of this talk is No Central Point of Control. And it's a pun on the management style and the technology. We wanted it to just be install and run whether you're running it on your laptop or in the cloud or on a hundred hosts or whatever.
And what most people did to put together a container network was they rely on something like etcd to be a central consistent store of what's going on to hold all the container information all the routes, all the whatever.
And we didn't do that. So WeaveNet is completely peer-to-peer. You can start with one peer and you can start adding more peers. They talk to each other via gossip So I've given each one a little flag.
Each peer has an identity on the network. And that peer can be present on the network or it can go away you know, you can close your laptop and take it on a plane and open it up again and it'll still work on the network. So the way we do that is all the shared data structures are implemented
as CRDTs as eventual consistency data structures. They're specially designed so that we can do that like somebody can be absent for any number of hours and come back again and the data reconciles. It all fits together.
That is incredibly hard work. Anyway, it has this property that you don't have to set up etcd before you get started but it is very hard work. We do this for IP address management
as well. We do it for several things but one of them, so we basically take an IP address space and map it onto a ring like a distributed hash table type ring, that idea and then spread that across the network and gossip updates to that ring. That's how we do
IP address management. Yeah, okay I wanted to talk about the community a little bit. I think I have a chart here. When it says installs
we get a count from Docker of the Docker pool operations. It's running well over a million it was up at two million this is one year, the last year two million a week down to about one and a half million a week
We see this software fire up a lot. As an open source project we don't have a very good idea of who's using it. People write in when they have a problem sometimes but they generally don't tell us just that they're using it and they're happy with it.
So this is one of the few bits of evidence we have. The thing gets fired up in some sense a million or two times a week. Compared to that we get very few PRs
We get lots of people coming along and saying things like this is just one I picked on because it came up recently and this is over a period of a year and a half, people complaining about a setting and saying why don't you change the default it's one line send a PR
People don't know how, maybe So most of the work has been done by people being paid by Weaveworks This is the GitHub contributors list A fun fun statistic after being the lead
on this project for five years I'm the second highest contributor Matthias Radistock who is also xRabbitMQ he was co-founder of Weaveworks he's still the number one contributor but all these people work for Weaveworks Mike Bryant
is the biggest contributor who doesn't work for Weaveworks We do have a long tail of people who did manage to come up with one or two PRs which is great and I would like to encourage that but it is a little bit dispiriting
when people just want to complain about the software and demand that it does something else Kubernetes This is what you were promised This is the theme of this day
So Weavenet is quite popular with Kubernetes I thought I'd just kind of run through what does that mean exactly and what is it doing there and how does it work Kubernetes doesn't just talk about containers it talks about pods
a pod is a collection of containers on the same machine and so in Kubernetes world conceptually the blue blobs are pods but the same stuff is going on they're talking to each other and Kubernetes has a very small set of rules one of which is that any pod can talk to any other pod without going through NAT
and funnily enough the rules the sort of model, the networking model of Kubernetes matches very well to Google's network so I don't know if we'll ever figure out how that happened
but if you run at GKE Google's commercial Kubernetes then this thing with the bridges they just have roots IP root layer 3 roots from machine to machine so they have the same thing with the bridges but they don't have anything else other than the
Google network to transmit packets between machines they just use Linux routing and let the underlying network deliver the packets to a bridge at the other side and that just works if you Google it pretty much doesn't just work anywhere else
so there is a need for something to take that place and Weave.net is one of the things that people sometimes choose to take that place so back around the time this was getting popular which is about four years ago now
the project Rocket which came out of CoreOS which was kind of a competitor to Docker they had they had this very simple model for network interfaces where they would exec a process that would add a network interface so that became CNI
essentially some people including Weave workers got in a room and said yeah that should work and it got named and it got turned into a project and I am a maintainer of the CNI project
but the CNI is supposed to be really really thin I just thought I'd walk through what exactly that is so CNI is not coupled to Kubernetes it is like I said it came from Rocket it's completely independent of network and what we call a runtime
so Kubernetes is in the place of a thing we call the runtime and CNI speak and physically it's the bit of Kubernetes called the kubelet which is the bit that runs on on each node so the kubelet calls a CNI plugin and it calls it right now
the interface is exec it execs a process in the host namespace not in a container it supplies a JSON config which lists a few things out like maybe which subnet you're supposed to be using something like that the plugin then talks
conceptually you've got a network somebody showed up with a network you bought one from juniper or you installed weave net or you're using cilium somebody's got a network so the job of the plugin is just to be that little bit of glue in between to interpret this JSON spec
and to cause the network to attach itself to a container so that's that's the idea of the CNI project and I think it's worked fairly well in its goal of being agnostic and kind of
staying out of the way of people I do quite often hear complaints that CNI doesn't do this and CNI doesn't do that the unfortunate news is it's never going to do those things because it's trying to be the thinnest possible layer that could work for everybody this is JSON if you want to say extra things in JSON
just add them party on just add fields ok that's CNI how do we get weave net installed so I just mentioned the plugin runs in the host as a process on the host
and everything we're talking about is containers which are isolated so we get we get run that by devious trickery we mount a directory off the host and start when weave net starts up it copies the file into the host directory so now it's in the host
I as far as I know I invented this trick but everyone does this now so maybe I copied it from someone else tell me at the end if it was your idea yeah so Kubernetes has this concept of a demon set which basically means run the same thing
one on every node and restart it if it dies that kind of thing so that's how we fire up we just we arrange for someone to ask for that demon set that fires up a copy of our software on every node we do this trick with
copying a file onto the host and now we're away kubelet is now going to call the plugin plugins going to call backup into the demon and that's how that all works I observed
about not having any kind of central consistent idea of what's going on and of course in in Kubernetes you have exactly that you have the central what's called the API server does know everything that's going on in a Kubernetes cluster
and so a few times we thought about abandoning the eventually consistent stuff and just rely on what Kubernetes is telling us which is what everyone else does and never quite got around to doing that anyway it's an idea if you want to submit
a PR that'd be great we do we implement Kubernetes network policy which was mentioned if you were in a couple of the previous talks so like saying who's allowed to talk to who we do that by relying on what Kubernetes tells us because you know it's the only thing that knows all the labels on different things
and and somewhat excitingly the the network is implemented at layer two and the network policy is implemented at layer three and they essentially have no connection between them we just run them as two separate processes in the in the same pod anywho is that all I wanted to say?
skip over that yeah that's that's pretty much what I wanted to say does anyone have any questions yes I changed it from weeb at some point because ok so
well that's not a question that's an observation but the observation was that I oh ok do we have any plans to support IPv6 so weebnet has no support for IPv6 in two ways it doesn't support IPv6 inside the overlay
and it doesn't support IPv6 as a target on the underlying network so which of those two did you want? you wanted both of them and may may I ask a question I mean the whole point of overlay networks generally is that you have some problem that stops you just routing across from one container
to another and that problem is very often an addressing problem in IPv4 so do you know what problem in IPv6 you're solving by having a like why can't you just route between the two containers? ok
so you said your answer your point was that you need some pods now my suggestion is that all pods can have globally reachable IPv6 addresses so you don't need anything else you don't need
me to write any code because IPv6 will solve your problem alright I think we have to take that offline yeah I mean you know bottom line
nobody did why doesn't it support IPv6 because nobody did the work it's an open source project we've worked as a company found something much more exciting to do which is called GitOps and you should all buy that we never managed to
monetize we've net we never made an enterprise version we never we never found anyone that was for instance willing to pay us enough money to do an IPv6 implementation thank you for the question any more? one over here
question is what was the contract race condition? so I should put Martinez's smiling face up oh I'm pressing on the wrong button to put Martinez up
so in particular that link at the bottom is not the right one to look for I should have changed that sorry ok the the I'm just trying to see if I can give you a short explanation basically it shows up when doing DNS requests particularly on Kubernetes
particularly using the Muzzle C library and what happens is it does two requests at exactly the same time for the A record and the 4A record
notwithstanding the fact that we don't support IPv6 the the the two DNS requests go out from the same source address to the same destination address same source port, same destination port and they hit a race condition in contract and one of them gets dropped
no it's fixed in the Linux kernel yeah like I say we spent most of our time not on our own software we spent most of our time fighting other people's software including Linux and and in some cases fixing it
so that yeah so Martinez wrote two patches he found three race conditions he wrote two patches to fix two of the race conditions so this is a you can google like why do I see a mysterious five second delay in my Kubernetes system
so this is not the only reason why people see mysterious five second delays but it's certainly a very popular one the the nature of the the requests they are from the same source address, same destination address same source port, same destination port
and contract does not know how to deal with that well time's up, thank you