We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

FreeBSD Operations at Limelight Networks (part 1 of 2)

00:00

Formal Metadata

Title
FreeBSD Operations at Limelight Networks (part 1 of 2)
Subtitle
An Overview of Operating at Internet Scale
Title of Series
Number of Parts
41
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In this talk, we'll look at Limelight's global CDN architecture and the practice of large scale web operations with FreeBSD. We'll investigate how FreeBSD makes these tasks easier and the strategies and tools we've developed to run our operations. We'll then look at why the engineering team chose SaltStack to further improve our operations capabilities and reduce deployment and fault handling times. Finally, we'll finish up with an overview of metrics and monitoring at scale with Zabbix and OpenTSDB. Limelight Networks is one of the "Big Three" CDNs and runs its edge using FreeBSD.
16
Thumbnail
39:54
Computer networkOperations support systemHand fanLemma (mathematics)EmulationWide area networkMetropolitan area networkInfinityScale (map)FreewareComa BerenicesGoogolTouchscreenSummierbarkeitWorld Wide Web ConsortiumGraph (mathematics)Object (grammar)Type theoryContent (media)Data storage deviceLocal ringSoftwareYouTubeInternetworkingProduct (business)Total S.A.Game controllerContent delivery networkService (economics)Charge carrierFiber (mathematics)Multiplication signIn-System-ProgrammierungTerm (mathematics)Server (computing)NumberTouch typingIP addressGroup actionPoint (geometry)Denial-of-service attackData centerOperations support systemScalabilityStatisticsTurbulenceModal logicScaling (geometry)National Institute of Standards and TechnologyBitMusical ensembleDegree (graph theory)Physical systemLine (geometry)Commitment schemeDirected graphDebuggerLimit (category theory)Instance (computer science)Uniform resource locatorWebsiteComputer animationDiagram
ArchitectureInternetworkingPolygon meshBitPoint (geometry)Data centerStrategy gameDatabase normalizationLink (knot theory)Instance (computer science)Router (computing)Server (computing)Right angleTable (information)Shortest path problemRoutingGroup actionPeer-to-peerNetwork socketBefehlsprozessorWhiteboardCuboidSocial classGraph (mathematics)SoftwareFreewareLine (geometry)Service (economics)Fiber (mathematics)WeightNetzwerkverwaltungLoop (music)SpacetimeContent (media)Physical systemUniverse (mathematics)AreaLocal ring2 (number)Single-precision floating-point formatCache (computing)Thomas BayesComputer animation
Service (economics)World Wide Web ConsortiumInternetworkingSystem programmingConfiguration spaceOperations support systemInstallation artPhase transitionMathematical analysisSoftwareState observerEquals signFreewareOpen setWeightIRIS-TScale (map)Physical systemKey (cryptography)MereologySoftware testing1 (number)Computer hardwareComputer fontMereologyMultiplication signSoftwareSoftware testingHypermediaCASE <Informatik>State observerMotif (narrative)Scaling (geometry)NumberMeasurementLine (geometry)Server (computing)Context awarenessGradientUnitäre GruppeDistribution (mathematics)Utility softwareAxiom of choiceArmInstance (computer science)Mathematical singularityTerm (mathematics)CompilerSlide ruleSequelTrailProduct (business)BitDrill commandsWorkloadData centerCuboidComputer fileFeedbackRight angleConfiguration managementINTEGRALExecution unitSource codeRelational databaseOperating systemExistential quantificationAreaOperator (mathematics)DatabasePower (physics)Personal digital assistantPoint (geometry)Modal logicLevel (video gaming)Type theoryProcess (computing)Graph (mathematics)Configuration spaceFlow separationService (economics)Physical systemDifferent (Kate Ryan album)Internet service providerWebsiteCategory of beingImplementationElectronic program guideStress (mechanics)Endliche ModelltheorieOperations support systemData storage deviceError messageKey (cryptography)Unit testingDirected graphComputer animation
Metric systemWritingSource codeFormal languageDatabase transactionCartesian coordinate systemSemiconductor memoryStructural loadBefehlsprozessorPlug-in (computing)Computer programmingDependent and independent variablesType theory2 (number)Subject indexingMobile appEndliche ModelltheorieServer (computing)Cross-correlationRevision controlKernel (computing)Inductive reasoningData structureMetric systemScaling (geometry)DataflowGoodness of fitDatabaseScalabilityFunction (mathematics)Time seriesResponse time (technology)CASE <Informatik>Computer hardwareDifferent (Kate Ryan album)AverageInstance (computer science)Decision theorySoftware developerPhysical systemMultiplication signGraph (mathematics)1 (number)Matrix (mathematics)Level (video gaming)Moving averageFormal languageWebsiteGraph (mathematics)Office suiteEquivalence relationQuery languageLoginView (database)Condition numberGraph (mathematics)QuicksortStatisticsInterface (computing)Computer animation
Information managementLoop (music)FeedbackAreaPhysical systemConfiguration spaceBus (computing)NetzwerkverwaltungRepetitionArtificial neural networkMaxima and minimaOperator (mathematics)Integrated development environmentSoftware developerLaptopBootingComputer-generated imageryPatch (Unix)Branch (computer science)Disk read-and-write headPhase transitionDevice driverQueue (abstract data type)MultiplicationImplementationMathematicsKey (cryptography)Service (economics)Block (periodic table)BuildingProcess (computing)Stability theoryNetwork topologyStaff (military)Directed graphBitServer (computing)Algebraic closureCurveClient (computing)Revision controlPhysical systemBus (computing)Configuration spaceVirtual machineComputer fileCuboidIntegrated development environmentVapor barrierLaptopBranch (computer science)MathematicsDisk read-and-write headTouchscreenService (economics)ImplementationProcess (computing)Multiplication signAlgorithmSoftware bugQueue (abstract data type)Game controllerSocial classCycle (graph theory)Video gameConsistencyFigurate numberLink (knot theory)Function (mathematics)Core dumpConfiguration managementModulare ProgrammierungÜberlastkontrolleOperating systemData centerProgramming languageKey (cryptography)Phase transitionDevice driverPatch (Unix)SoftwareLevel (video gaming)Product (business)Scripting languageFlow separationWindowGraph (mathematics)FeedbackLoop (music)Module (mathematics)Directory serviceAttribute grammarQuery languageSynchronizationSoftware developerCountingUsabilityFitness functionSpacetimeElectric generatorOcean currentComputer hardwareBefehlsprozessorEncryptionSteady state (chemistry)Multi-core processorDatei-ServerLastteilungReal-time operating systemNumberOverhead (computing)Line (geometry)Rule of inferenceInterface (computing)Drill commandsProjective planeIdentifiabilityPosition operatorGroup actionScheduling (computing)Event horizonHuman migrationDecision theoryGastropod shellOnline helpState of matterElectronic mailing listEmailType theoryMedical imagingMereologyDeclarative programmingoutputInteractive televisionComputer clusterRight angleData structureDegree (graph theory)Operations support systemTangentFreewareVolume (thermodynamics)Instance (computer science)NetzwerkverwaltungAbstractionImage resolutionEndliche ModelltheorieWeightGraph (mathematics)Student's t-testTotal S.A.Band matrixRandomizationBlogDefault (computer science)Power (physics)Arithmetic progressionSystem callSoftware testingDifferent (Kate Ryan album)Series (mathematics)FrequencyLibrary (computing)Computer animation
Transcript: English(auto-generated)
I'm going to go ahead and get started. Stay here if you want to hear about FreeBSD scale-out operations. Just a quick shout-out to the other Limelight folks here.
I'm the guy at the top, Kevin Bolling, Sean is here, SourceCommitter, Hiren is somewhere, another SourceCommitter. Jason is back there, and Chris is back there as well. Various roles at Limelight on engineering side.
And Johannes is a contractor for us doing some cool stuff with stats in the Linux ports. So that's actually more or less the totality of our BSD effort. We've got a couple other people, and I'll touch on that in a little bit.
So just an introduction to what Limelight is. We are a CDN, and this is a cute graphic our marketing folks came up with. Basically what we do is put servers close to users. So these are in data centers that are rich with eyeball networks and backhaul.
We run our own fiber backbone. This actually differentiates us from most other CDNs, which are generally going over Internet transit or some type of carriers.
If they're putting, for instance, their appliance in an ISP location, they have to backhaul over the ISP's network. So this kind of lets us get over the turbulence of the Internet. We can also accelerate non-cashable content via our backbone.
We do have some other services aside from content delivery. We do video, so we've got a pretty comprehensive system around that. It's basically like a private YouTube that you can drop into a site. A lot of local news channels, for instance, use this.
Let's see. We've also got object storage. This is similar to S3. It's much more targeted to being an origin for our caching service, but people do use that as a generic storage, basically an S3 type of object storage.
We've got a DDoS attack mitigation that can either be used with our content delivery products or as a network defense, as long as we can take control of the front-end IPs. So as far as numbers go, we're somewhere north of 10 terabits of egress at this
point of actual bandwidth, and that's peering transit, paid peering. So we're pretty big in the CDN market. We're generally between one and three, depending. Well, I don't think we've ever been one, but number two or three, depending on the
time of year. And we have somewhere north of 100 data centers. Again, these are just POPs in large metro areas with lots of fiber and hopefully lots of eyeball networks. So a POP looks pretty not, you know, there's not a lot going on inside of them in terms
of like the equipment. We've got DWM gear. This basically runs a local fiber loop between, generally we don't go into one data center in a metro area.
We'll have two or three. The DWM gear lets us, you know, over a single pair of fibers, cram like 10 gigabit lines. So that creates a loop between the, we basically treat all of those data centers as one point of presence, and we do get a little bit of redundancy out of that, but that's how
that works. At the actual data centers, we have a pair of generally the largest routers you can get from somebody like Brocade with a full route table, and this is what our peers are coming into and our transit.
Behind that, we'll either have a couple or more large chassis switches. You know, these look just like the routers. They're like half, three quarters of a rack with tons and tons of 10 gig ports going out to the systems, or we're pulling 40 off to a spine network.
Generally, we're using a rest of 40 gig switches here, and those will go to top of rack switches. There are pros and cons to both approaches. Price usually dictates which we do, as well as the size of the pop. Then we've got a ton of servers that look just like this. A lot of people use Supermicro.
We're in that camp. We generally throw one CPU into these. This is good for FreeBSD because we don't have NUMA problems. There's just a single NUMA node. We're using all SSDs at this point on these edge boxes.
We've used some Samsung. I think we've evaluated Micron as well. All of those bays will be generally 480s at this point. We're looking at going up to terabyte class SSDs because that affects our cache retention time, which lets us, for long tail content, we can get faster throughput the more space
we have. On the back of this thing, it's actually two servers in the 2U. The reason we do this is we get four extra drives in the 2U versus 1U servers. It does cause some problems with asset management.
We've mostly worked that out, but for instance, if you pull one of those nodes and put a new one in, how do you handle that? It's a pain, but it's worth it for the four extra drives. On the back, generally, at this point, we're using Intel 10 gig fiber Ethernet. It drops into this little guy right here.
We're trying to work with Chelsea right now and see if we can get a Chelsea board to go into this thing because if you don't populate the second CPU socket on these super micro boards, you don't get to use these, unfortunately.
I don't track that. Nobody on my team does either. We're a little bit higher level than that, but I would assume so. We're trying to get more and more efficient, so that will be part of the effort,
but at this point, it's purely performance driven. We can do so much more with the SSDs. It does, but SSDs have dropped to the point where they're big enough and cheap enough that it doesn't matter.
We're in colos. We only have a couple of our own data centers, so we don't care too much about that as long as the data center does a good job. Again, the point of this talk, what actually motivated me to do this was
a lot of people talk about embedded use. There's a lot of appliance vendors talking about FreeBSD, but I haven't seen a lot of people talking about large scale installations, and there are a few of those out there. I want to just show you what we do, and hopefully people can learn or be motivated to come and talk about their own stuff.
The main difference between an ops type of workload and an appliance workload is the systems are very fluid. These things are changing quite regularly in terms of software and in terms of configuration. We're pushing configuration several times a day, either for customer turn-ups or to test new packages
or whatever the case may be. This is very common. This is all of the hot stuff you see at startups and whatnot. This is large websites, API-centric companies, and service providers. They're all in this category of ops, I would say. And with that, the workload is basically internet-facing.
We're not a storage appliance that has to have 100% availability because a ton of servers are hanging off of it. We've got lots of cheap nodes, and we can deal with failure in different ways.
This is more or less the about me. I think it's kind of important before we get to the other slides. I was a Linux guy for 10-plus years and very deep into that culture. Although I was doing that professionally, I kind of played around with other operating systems.
I ran MonoWall when I was still in high school, and that was a thing. I switched to PFSense when that started gaining traction. I would play around with other OSs just for fun. I'm kind of curious about the design trade-offs and why people do things. I also like old hardware.
I kind of played a role with those ones at the end. I start at Limelight Networks, and I'm intrigued by the BSD edge because this is our bread and butter. There's over 10,000 machines, and there's not a lot of people doing anything to make that happen. I'm curious because on the Linux side,
either at Limelight or other companies I've been at, there's a ton of people per whatever measurement you want to use per X number of servers. At Limelight, that wasn't the case. There was maybe a handful of people really involved in the design and implementation of the CDN,
and that kind of piqued my interest and got me going on this stuff. When I started digging, what I found was this BSD software and mindset were really responsible for that, and that sucked me in. I'll try and explain more of that in my talk as I'm talking about some of the tools we use,
and hopefully that makes a little more sense. One motif to keep in the back of your mind when I'm doing this, observability trumps everything else. This is kind of stolen, I think, from Brendan Gregg. He meant it, I think, in the context of tracing and figuring out how software works,
but I think it's even deeper than that. We were talking about how BSD pulls you into the source tree, and you, for instance, know how your compiler, at least what it is and what it's calling out in terms of other utilities last night. In the base system, you know what's part of your distribution.
It's not just this substrate that you're trying to fire up JVMs on top of and be done with it. You actually kind of get involved in your operating system. I'll dive into some of our tool choice. These are pretty airy slides, so feel free to interrupt me. We use Zabbix. We're generally happy with it.
It was somewhat hard to scale because it uses a relational database to keep track of all these incoming values, so the answer to that was Fusion IO. We run MySQL on top of Fusion IO, and it works well enough for the current workload.
The key insight here, though, aside, I wouldn't necessarily say use Zabbix unless you're a small or medium shop. It's a little bit pushing it for what we're doing, but use an API-driven monitoring system. There's a couple out there, or more than that,
but make sure that the way you're interacting with your monitoring system isn't like writing config files manually. You want to be pushing configuration into this, and that should ideally be part of your configuration management toolbox, and I'll get to that when I talk about salt. Operationally, monitoring has to be part of your entry into production.
If you have people putting stuff customer-facing up without monitoring, you're going to have a bad time. I mean, you're going to have problems, and there's going to be this fire drill, and then you're going to wonder why you didn't do that to begin with. This is something we've learned a few times over. I think we've gotten a little bit better at it recently.
And then the other thing where we want to go is getting monitoring as part of our testing in QA. A lot of people write QA toolkits or what have you to run unit tests or integration tests, but when you're doing ops, you actually need to think beyond just the piece of software.
You need to think about how it's deployed and how it integrates with other microservices or databases, whatever the case may be. The answer that we think is plug it into monitoring. That's what's going to tell you when something's wrong in production. If you can catch those errors as part of QA, then you have a nice little feedback loop.
Just as part of this, don't use Nagios anymore. It's not very good. We can do better than that as an industry. Opposite of monitoring is metrics,
and this is more or less time series data coming into some type of scalable database. We have TSDB in place right now. I'm not really happy with it. I was involved in trying to un-FSCK it a few times
and didn't get very far, but I was talking to Sean Chittenden. He's a Groupon guy here at BSDCan, and he's like, so basically what you have is a metric dumping ground. We have something that's easy to put a ton of data into and not really anything to get good stuff out of it. So I think there are better answers here.
One of the things we've been experimenting with is a startup called Jet. It's kind of a hybrid hosted onsite application. This guy in the back, Chris Chris, can tell you all about it if you're interested, but it's actually pretty cool. It's a dataflow language, which is something that's been around, dataflow programming has been around for a long time,
but they kind of put it right here in your face. So if you've ever used Splunk, it's just next level beyond that. So for instance you can query some type of, for instance here they're showing querying an asset database. And basically the question was,
using these metrics like our average response time and our kilobits per second, how can we see how our different hardware models are influencing that? So in this example, this particular device is doing quite a bit better than these other devices. And somebody looking at this could make a case to say,
well we should deploy a lot of these and deprecate these because that wins us business or whatever. So metrics is a pretty important thing for making decisions at scale. I can talk a lot more about this or I can move on if anybody's interested.
So basically what we're trying to do, we feed a ton of just stats coming off a server. So our main ingest is a program called CollectD. It's just a C agent with plugins. And this is looking at things like your CPU usage, load average, Gstat on FreeBSD, memory.
And then we try and get application metrics too. This requires the application developers to get involved, but they can push up things like transactions per second or average, some type of percentile response or things like that.
Once we get it into one of these systems, then we can query it. This is actually the bare bones TSDB interface. There are some better ones, Grafana. But basically then what you do is try and correlate things. So you can say, this is actually a brilliant example. It's like, can I correlate server model to response time?
But maybe I want to look at backbone saturation to response time or swap in versus response time, things like that. When you have the data, you can start asking questions. And with them in a scalable database,
you can ask them post facto. So you don't lose that after an incident. You can go back and say, why did we do something wrong or imperfectly there? Yes.
Sure. So I said it's not quite metrics because basically both of these things are taking in log data. For instance, you're pushing in syslog or app logs. Then they have indexers that can put that
into an efficient structure. So you can query it and roll it up into different things. A lot of times you can turn that back into metrics. So for instance, we can use Splunk to get metrics off of like an access log or something. ELK is more or less equivalent to Splunk, just open source.
I'm trying to think. The other thing you can do here is just query if you are looking for, for instance, a panic or something that's coming off a syslog, you can go into Splunk and try and make inductions
based off of that, try and correlate things, kernel version or things like that. Does that help? These two are more textual.
It's how do you deal with logs at scale? A person can't go view the syslog output of 10,000 servers. It's just overwhelming. So what you try and do is get it into here and then look for anomalies or create canned searches that know certain bad conditions, things like that.
You can use that to then feed an alarm into your monitoring system. But by itself, it's very freeform. It's like a search index for text.
I'll go ahead and move on then. So this is something we've invested a lot of work into in the past year. We were a CF Engine 2 shop and then we had some Chef through acquisitions. But we did kind of a bake-off and we looked at what was out there
and what would work for our implementation and we found Salt. And we've been pretty pleased with this decision. The key insight with Salt is that you have configuration management built on top of an orchestration bus. So rather than running your CM system on a scheduler or a cron, you actually have agents permanently running on the systems
and then they're always connected to these master systems. So this is kind of interesting. You can react to different events. For instance, when CM runs on one system and something changes, that can push something over the bus and make something else happen. For instance, add a host to a load balancer
or something in real time. You don't have to do this on synchronous schedules. So I gave a talk at SaltConf where we go really deep into how we deal with changes to the CM system itself. We basically have a workflow where we have a steady state CM
and then when somebody wants to change that policy, we spin up a new salt master in a container and then let them point their machines to that and verify it, you know, in a sandbox environment or even in production for certain changes. And when that's ready, that's then accepted and promoted into that steady state.
This has been pretty cool. So basically what you're trying to do with configuration management, if this is new to you, is move system state from something like shell scripts or interactive input into declarations. You want to describe what a machine is supposed to do
rather than step by step how it is to do it and then let the system figure out what's changed or what needs to be changed and what order it needs to happen and to make it do a thing. So basically policy is greater than implementation with configuration management.
With salt or with most systems, you can do things programmatically when you need to. One of the key insights is you kind of want to build those programmatic structures up so then you can use them in your declarations and salt makes this really easy. This is a state that deploys network time or NTPD
and using a map file this works on our FreeBSD hosts, our Red Hat hosts and our Ubuntu hosts. So that's kind of what you can do with CM. You can abstract things out a little bit and make it easy to understand what a host is doing at an abstract level. The other thing we get with salt
is this orchestration bus. So a kind of neat example we had recently, we ran into some weirdness in the TCP stack where we have a customer with a very bad network that's sending out of order packets in the initial burst. And then it's actually sending acts left of the window
and there's actually an RFC that none of us knew about where this is supposed to be a good thing. So we found, basically we wanted to see how prevalent this was in production to gauge the severity. So we wrote a detrace script and actually ran this on 2,000 production machines
and just watched a counter for 10 minutes. And we found out it's actually very, very rare. So that helped us kind of triage a bug from, oh wow, we better get a handle on this real quick to okay, we can take our time and figure out what's actually going on here and how do we wanna fix that.
Should I pause here on salt? Any questions or comments? Sure. Yes, we do. So we've got, I'm trying to think of a good example. So we sync SSH keys out to the edge.
This is just one I wrote so it's on the top of my head. To do that, the module goes and makes a LDAP query for the SSH attribute in the directory and then pumps that to the master. Then the master can use the salt file server to push that out to our edge nodes. It's just a way we log into our systems.
We've also written modules to do different services. I can't think of, one of them is actually this workflow, like how this thing spins up containers. That's a module.
Couple screen pools at most. It's easy. I'm really pleased with salt. Everything's pretty straightforward. The docs are a little bit hard to get started but once you kind of grock it, it's pretty easy to keep going.
Very little. Very, very little. It's zero MQ underneath right now and they're actually working to get, make an even more optimized transport but like as far as bandwidth, there's no noticeable, I mean I think the machine's been up for, with a large client count and it's done like 100 gigs over a couple months of,
sure. So we've got one master, just a single master right now with a total of like 2,000 hosts on it and we've got a couple other pools and that's doing fine. That's handling all the encryption and everything
and you'll see the CPU spike a little bit. You don't want to skimp on hardware there but for that I think you'll be all right. We've got dual, so we went dual CPU.
Whatever the current generation is, like eight core, dual eight core and then like 100, RAM actually didn't matter but we've got like 128 in there. We also did SSDs just because we didn't need a lot of space and they're affordable for us.
Sure, so I looked into Ansible on my own. We didn't consider it for work. We looked into Chef, CF Engine 3 and Puppet aside from salt. What I saw in Ansible was a lot of the same thing but it didn't do the bus thing that we really like.
That's kind of a key insight to us. I think it's a great configuration management system and it's really easy to get started. Their docs are fantastic. It just seemed to me that salt was a better fit for what we wanted to do.
We didn't see really a tremendous gain like from two to three. What we wanted was easy templating like this. That would actually be a lot more stuff in CF Engine 3 and then the ease of writing custom modules. We've got a lot of people that know Python
to varying degrees. So the entry to changing both the server and the client are pretty low. Most of us that have worked on the salt implementation have actually been contributing patches like drive-by patches to the salt upstream. You just go in and do it and you're done for the day. You don't have to have a huge learning curve.
I think that's a key win actually over some of the other systems where they've started bifurcating the agents in Ruby and now you've got a closure server or whatever. It starts making things harder for casual development.
I'm going to go ahead and move on. How do we actually get FreeBSD onto our edge machines? This has changed in recent times. We're trying to get a little bit more formal with it because we've got source committers on staff now and we're starting to do more interesting stuff.
So basically we use Git at Limelight as our version control system. So we're using the... There's a semi-official GitHub mirror of the FreeBSD SVN tree. So we have two branches. We have head and stable. And these follow SVN head and currently 10 stable.
And what we're doing here is taking... We're deploying 10 stable, but we develop against head because we want to kind of stay ahead of the curve and make sure that what we're doing is going to be fine when the next release comes along. So we take these two branches
and we grind them through a Jenkins job. This produces our images that actually go out to the edge. And I'll kind of go off on a tangent here. We have this vagrant thing that... This is actually part of our salt deployment. We're taking these images and pushing them out as vagrant boxes
so developers can run this stuff on their laptop. The insight here is we want them to have a very low barrier to entry to writing configuration management and working with our actual production images. If you're developing against a vanilla image that might not have all of our customizations,
maybe you don't run into a problem early enough and it becomes a problem, that kind of thing. So with vagrant, we're able to actually get very low barrier to entry to our very production-looking environments. And this is all a big feedback loop. Packer is a thing that we use to make those box files.
It's a little bit more important in the Linux world because we just have these ISO images that we have to enhance with our changes to packages and config. But in FreeBSD, we've got the build system so we can do whatever we need to there. I'll go more into our source stuff in a bit.
So phase two, after I'd been at Limelight for a little over a year, what I kind of saw was that this BSD stuff was awesome and we needed to do more of it. We needed to be deliberate about it. So we brought on Sean,
and that's been awesome. He's been helping us upstream all the things. So we had a stack of patches, not a huge list, not like some of the appliance people, but enough to try and get that stuff either fixed upstream or at least reported upstream, so it could be fixed in perhaps a better way.
And we're trying to get better about how we actually use the ports tree and build packages. This is an ongoing thing, but the key here is Podreer and package-ng. These are really awesome. I think they're kind of the best software packaging experience that I've seen on any operating system to date.
And again, this is all about just being very deliberate about what we're doing. A lot of things that up until this time were done just because they had to get done, and now we're trying to take a look at it and say, okay, here's how we should do it going forward and we'll be more efficient and better. So how did we start a source team?
So for instance, I found Sean on the jobs mailing list. This is pretty low volume, but you can either post your resume there or post a rec there. You can come to conferences like this and look for people that are doing stuff. And of course, if you do cool stuff sensibly,
generally people will come to you. We're trying to do that. I hope we're getting better at that. But there's plenty of people using BSD that are doing that. The benefits of starting the source team were we were on FreeBSD 8 when we started, and 9 had come out, 10 had come out.
And getting from 8 to 10 was actually a lot more involved than we thought. Even with this small patch stack as an operator, it was quite a bit of work, both because there were actually bugs in the 10.10 and 10.1 release
that we've had to work through, and then we have a binary blob that we actually deployed to production. We bought a pluggable congestion control algorithm before that was a thing in FreeBSD that does some network magic. So we had to kind of figure out how we could keep the interface consistent so we could keep using that in the 10 life cycle
while we figure out what we want to keep from that and implement ourselves where we can as source changes. Some of the other things we've done, Sean worked on this multi-queue EM driver. The EM driver is like a gigabit class ethernet controller.
From Intel, it only uses one NIC queue, and what we saw was that a lot of our machines were actually kind of stuck in the TCP path. So what he found through reading some of the ARC manuals was that you could split this out to at least two queues on some of the chips, and then now we can get two or more cores,
I think two to four cores doing that TCP output path. And this actually got us with the two link gags from like 1.1 gigabits reliably. Now we can more or less max those two interfaces out. So that was a really nice thing. We also started doing some profiling
with DTrace and PMCstat, and we found that we were paying actually a pretty hefty IPFW penalty on our outbound path, and we don't have any outbound rules. This is because by default, even if you don't have any rules, there's an accept rule, and then you have a bunch of setup
and teardown with IPFW. So here in added, I think it was like a two line change, and we'll probably try and push this upstream if people want it. Just assist control to say, ignore any IPFW overhead on the outbound path. And we got an appreciable gain out of that as well.
Sean did this PLMTUD implementation. This was basically if people are blocking ICMP traffic.
So, do you want to? Blocking ICMP. So this was something that I think, I don't know if it was a customer request or something that we just noticed in production, but that was a cool thing that we got knocked out.
CalloutNG, this was really fun. For some value of fun. The callout system was broken up through 10.1 release, and you don't actually notice this on, you know, if you're running a small fleet of systems,
the panics that you'll see from this are rare enough, but when we had such a large number of machines, we could actually daily see machines panicking. So this was, we didn't actually develop the fix, but we were kind of following along in the review and poking people and testing the patches.
So we think this is fixed in what will, 10 stable, what will become 10.2. So that was actually quite a bit of work, just figuring that out. And again, Sean and Jason were key in doing that. We're looking into TCP customization.
A lot of this will go upstream where we can, but some of it might be where we're kind of deviating from the spec or whatever. And then we're also doing MFCs of stuff sometimes early or sometimes if somebody can commit something to current and they don't, for whatever reason, want to MFC it, we'll pull it back on the upstream project.
So some of the insights of working with source, we want to always develop against head. We don't want to get into this situation that other vendors have gotten in where they're married to a release, then they have to do this huge drill to get back to the current release.
We want to know what's changing in head while it's changing so we can influence that and kind of sound the alarm or hopefully prevent problems from happening. So this is our LL head branch. Then we pull those changes back to our LL stable, which is following 10 stable.
When we're ready to ship this, we do an internal release engineering process. Basically this is running our build job, doing some smoke tests, and then deploying it to Canary hosts. And then finally we'll release this to our systems over a longer period of time. So again, one thing I'm kind of reiterating here
is these feedback loops. This thing called the OODA loop is kind of an interesting way to think about it. Basically it's like observe, orient, decide, act. So we want to kind of see what's changing, get ready, position either the people or the machines to do what they need to do,
do the work, and then make sure what we did is effective. That's all we're doing on a lot of this stuff, either in operations or in development. So where I'm at now, what I want to do is kind of identify and support key features in the community at large.
So there's a couple of ways we're trying to do this. We're trying to kind of look out and see what features in FreeBSD we want to either push an agenda on or push our resources to implement. We want to support the community with finances, so we've made a donation to the FreeBSD Foundation.
Internally, we want to show the company that we're doing good work, that our BSD people are effective, and I think we're doing a good job of that. We've got a relatively small number of people versus the footprint and the impact of these systems, and I want to bring other people in the company
into that fold and help them use these tools to do the same. And how we'll do that, we want to empower our service owners to do cool stuff. The base system, again, it's incredibly observable. You can kind of figure out what it's doing and how you assemble it to make
whatever you're trying to actually do be efficient. Podrarian packages are huge for developers when you're pulling in libraries or whatever. You don't get stuck on ancient versions or whatever. You have a ton of control in figuring out how you want to manage your dependencies in your programming language environments.
And then SaltStack has also been massive. This is something we want to push as a self-service out to the groups that are doing product development. So that kind of – those four things are where we're at today. Where I'd like to go is really kind of around jails and IO cage.
This is kind of stuff I've been playing around with on my own time, but what I think would be cool is to kind of detach the metal OS from the user land. So as a source team, we can start evolving this stuff that's touching the hardware faster than the product guys can validate their own changes.
The reason we want to do that is we're trying to test and minimize the amount of releases we have in production. So when we're doing driver work or whatever, these guys don't care too much about that. They just need it to work. But we need to keep their ABI compatible and everything. So for instance, I can envision in the near future,
in the next year or so, we'll want to start deploying 11 to production, and if we can do that without rebaking all of this user land stuff, that might be interesting for a migration period, and you can support that for a couple years or whatever.
ZFS is kind of instrumental to the jails thing. You want to be able to push jails around to work around hardware problems or data center migrations, things like that. So I already mentioned this. This was actually a lot of work in a corporate environment.
You have to figure out how you can make people understand a good idea is like a good idea. Luckily, we had a founding engineer at the company that was able to help us kind of make that case and get our name up here. So that's the end of my deck, and just the one thing I want to say is don't be afraid to push BSD to production
in these type of roles. It's fine. A lot of people are doing it. There's plenty of resources out there, plenty of mailing lists and things that you can go to to reach out for help. And if you're doing this, I hope to see more talks about stuff like this because I think it's something that we're kind of quiet about right now.