FreeBSD Operations at Limelight Networks (part 1 of 2)

Video in TIB AV-Portal: FreeBSD Operations at Limelight Networks (part 1 of 2)

Formal Metadata

FreeBSD Operations at Limelight Networks (part 1 of 2)
An Overview of Operating at Internet Scale
Title of Series
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
In this talk, we'll look at Limelight's global CDN architecture and the practice of large scale web operations with FreeBSD. We'll investigate how FreeBSD makes these tasks easier and the strategies and tools we've developed to run our operations. We'll then look at why the engineering team chose SaltStack to further improve our operations capabilities and reduce deployment and fault handling times. Finally, we'll finish up with an overview of metrics and monitoring at scale with Zabbix and OpenTSDB. Limelight Networks is one of the "Big Three" CDNs and runs its edge using FreeBSD.
Metropolitan area network Scale (map) Touchscreen Freeware Scaling (geometry) Lemma (mathematics) Computer network Coma Berenices Line (geometry) Infinity Limit (category theory) Hand fan Emulation Operations support system Operations support system Commitment scheme Googol Graph (mathematics) Summierbarkeit Directed graph Wide area network
Group action Multiplication sign Modal logic Strategy game Network socket Cuboid Fiber (mathematics) Physical system Social class Area Data storage device Bit Instance (computer science) Degree (graph theory) Type theory Data management Befehlsprozessor Internetworking National Institute of Standards and Technology Website Right angle Whiteboard Freeware Spacetime Point (geometry) Content delivery network Shortest path problem Polygon mesh Server (computing) Game controller Service (economics) Link (knot theory) Product (business) Number Architecture Internetworking Touch typing Router (computing) World Wide Web Consortium Scale (map) Weight Debugger Graph (mathematics) Content (media) Denial-of-service attack Total S.A. Line (geometry) Peer-to-peer Uniform resource locator Database normalization Loop (music) Software Universe (mathematics) Data center Charge carrier Object (grammar) Musical ensemble Turbulence Table (information) Routing Local ring
State observer Context awareness Installation art Equals sign Weight Multiplication sign Modal logic Source code 1 (number) Mereology Computer font Different (Kate Ryan album) Hypermedia Phase transition Endliche Modelltheorie Physical system Service (economics) Arm Gradient Stress (mechanics) Bit Instance (computer science) Flow separation Measurement Open set Category of being Type theory Operations support system Process (computing) Internetworking Internet service provider System programming IRIS-T Mathematical singularity Website Configuration space Point (geometry) Unitäre Gruppe Trail Slide rule Server (computing) Implementation Freeware Service (economics) Electronic program guide Mathematical analysis Power (physics) Number Workload Term (mathematics) Operator (mathematics) Software Computer hardware Graph (mathematics) Operating system Energy level Software testing Utility software Configuration space World Wide Web Consortium Motif (narrative) Distribution (mathematics) State observer Line (geometry) Compiler Software Personal digital assistant Data center
Axiom of choice Existential quantification Metric system INTEGRAL Multiplication sign View (database) Decision theory Execution unit 1 (number) Function (mathematics) Mereology Computer programming Formal language Response time (technology) Semiconductor memory Different (Kate Ryan album) Matrix (mathematics) Cuboid Endliche Modelltheorie Office suite Physical system Area Source code Service (economics) Relational database Structural load Software developer Feedback Bit Database transaction Mereology Instance (computer science) Formal language Type theory Befehlsprozessor Drill commands Configuration space Website Software testing Right angle Moving average Quicksort Metric system Physical system Trail Slide rule Dataflow Server (computing) Statistics Sequel Computer file Time series Login Scalability Product (business) Revision control Writing Workload Average Operator (mathematics) Computer hardware Graph (mathematics) Energy level Software testing Data structure Condition number Scale (map) Dependent and independent variables Scaling (geometry) Graph (mathematics) Interface (computing) Graph (mathematics) Database Configuration management Cartesian coordinate system Equivalence relation Subject indexing Kernel (computing) Software Personal digital assistant Query language Key (cryptography) Inductive reasoning
Scheduling (computing) Randomization Group action Real-time operating system Client (computing) Disk read-and-write head Area Software bug Operations support system Medical imaging Different (Kate Ryan album) Core dump Encryption Multiplication Physical system Mapping Software developer Electronic mailing list Branch (computer science) Bit Instance (computer science) Überlastkontrolle Befehlsprozessor Process (computing) Phase transition Queue (abstract data type) Figurate number Cycle (graph theory) Spacetime Directed graph Booting Computer file Patch (Unix) Computer-generated imagery Maxima and minima Branch (computer science) Event horizon Declarative programming Product (business) Computer hardware Operating system Energy level Configuration space Data structure Implementation Information management Key (cryptography) Operator (mathematics) Total S.A. Volume (thermodynamics) Directory service Software Algebraic closure Integrated development environment Query language Network topology Bus (computing) Window State of matter Multiplication sign Decision theory Mereology Data management Mathematics Video game Phase transition Bus (computing) Cuboid Endliche Modelltheorie Scripting language Curve Electric generator Repetition Feedback Fitness function Flow separation Disk read-and-write head Tangent Degree (graph theory) Band matrix Steady state (chemistry) output Lastteilung Freeware Physical system Laptop Laptop Implementation Server (computing) Service (economics) Vapor barrier Software developer Feedback Image resolution Artificial neural network Virtual machine Modulare Programmierung Attribute grammar Graph (mathematics) Gastropod shell Integrated development environment Loop (music) Module (mathematics) Patch (Unix) Consistency Interactive television Device driver Configuration management Datei-Server Blog Abstraction
Group action Multiplication sign Function (mathematics) Disk read-and-write head Operations support system Mathematics Phase transition Series (mathematics) Multiplication Position operator Physical system Social class Programming language Service (economics) Building Software developer Feedback Bit Instance (computer science) Disk read-and-write head Data management Process (computing) Drill commands Queue (abstract data type) Figurate number Block (periodic table) Arithmetic progression Physical system Directed graph Game controller Implementation Overhead (computing) Identifiability Service (economics) Link (knot theory) Patch (Unix) Computer-generated imagery Virtual machine Device driver Branch (computer science) Rule of inference Power (physics) Number Product (business) Revision control Frequency Computer hardware Software testing Implementation Loop (music) Default (computer science) Key (cryptography) Patch (Unix) Weight Operator (mathematics) Line (geometry) Device driver System call Human migration Mathematics Loop (music) Integrated development environment Key (cryptography) Library (computing)
Type theory Email Integrated development environment Electronic mailing list Process (computing) Student's t-test Product (business)
1 the things you from 1 of them is so
so you can think about what when
we think that the order so I'm
going to get started stayed here
if you want to hear about previously scale of operations and just a quick shot up to the other line like folks here I'm I'm got talk a morning John is here source committed hearing is somewhere another source critter of Jason is somewhat Jason back there and increases back there as well various roles that limelight on the engineering side in Johann as as a contractor for us to be in this school cool stuff with steps in the limits its and so
we know that that's actually the more or less the totality of obviously effort we've got a couple other people and and I'll touch on on that little bit and so just an introduction to what limelight is we all are a CDN and this is cute graphic our marketing folks came up with but basically what we do is put servers close to to users so these are in their data centers that are rich with eyeball networks and back all and we're on our own fiber backbone this actually differentiates us from most other ends which are generally going over and you know Internet transit or some private carriers you know if they're putting for instance the appliance and NIST the location after the back all over the ice please network and so this kind of lets us get get over the you know the turbulence of the internet we can also accelerate non-cacheable content of the above our our background and we do have some other services aside from content delivery we do have video so we've got a pretty comprehensive system around that it's basically like a private you to that you can drop into a site a lot of local news channels for instance uses so let's see we've also got object storage and this is similar to S 3 it's much more targeted to to being an origin for caching service but people do use that as a generic storage basically necessary type of of object storage and we've got a d DoS attacks mitigation that can either be used a with our content delivery products for as a network defense so as long as we can take control of the of the the the front end i've used and so as far as numbers go were somewhere north of 10 terror bits of of Gross at this point of of actual band with and then that's peering transit paid paid peering and so were were pretty big in the in the CDO market were generally between 1 and 3 depending on what kind of degree 1 but number 2 or 3 depending on the time of year and then we have somewhere north of 100 data centers so again these are our you know just pops in large metro areas with lots of vibrant and hopefully lots of viable networks so of
pop up and looks pretty not you know there's not a lot going on inside of them in in terms of of like the equipment we've got and gear this uh basically runs a local fiber loop between generally we don't go into 1 data center in a natural area will have 2 or 3 I the DWM gear lets us you over a single pair of fibers cram like 10 10 Gigabit lines but so that creates a link between the we basically treat all of those data centers is 1 point of presence and we do that a little bit of redundancy of that but that's how that works I'm at the uh at the actual data centers we have a pair of of generally you the larger strategy can get from somebody like with the full rout table and and this is what our peers are coming into an hour transit behind that will either have a couple or more large chassis switches a new these look just like the routers are like path 3 quarters of a rock with tons and tons of attended ports going out to the systems or were pulling 40 off to spying network and generally we're using every step 48 switches here and those will go at the top of rocks which is there are pros and cons to both approaches of price usually dictates which we do as well as the size of the park we got a ton of service that look just like this some like people use Supermicro were in that camp we generally throw 1 CPU into these you know this is good for Free BSD because we don't have new Meprobamate so you know there's no there's just a single Newman narrowed but we're using all assist these at this point on these edge boxes that we use in Samsung I think we've evaluated micron as well and so all of those days will be but generally for 80 at this point we're looking at going up to terrabytes class assess these because that affects our cash retention time but which lets us for for a long tail content we can get faster throughput are the more space we have and so on the back of this saying that it's actually 2 servers in the 2 you the reason we do this is we get for extra drives in the 2 universes when you servers and it does cause some problems with asset management we mostly work that out but for instance if you pull 1 of those nodes and put a new 1 in hand him over the pain but it's worth it for the for extra drives us on the back generally we've got and at this point we're using Intel 10 engage a fiery the net it drops into this thing this little guy right here we're trying to work with Chelsea alright right now and see if we can get the uh Chelsea aboard to go into this thing because if you don't populate the 2nd CPU socket on these Supermicro boards you don't get to use these unfortunately
I'm sure I don't track that nobody on my team does he there were a little bit higher highest power level than that but I would assume so but were trying to get more and more efficient so that will be you part of part of the effort but at this point it's purely performance driven at which we can do so much more with the assistance it does but necessities have dropped to the point where it's they're big enough and cheap enough that it it doesn't matter we don't know where were in close but we only have a couple of our own data centers so we don't care too much about that as long as they did Senators job and so again that the point of this talk about what what actually motivated me to do this was a lot of people talk about embedded uses a lot of appliance vendors talking about previously but I haven't seen a lot of people talking about large-scale installations and that there are a few of those out there so I wanna just to you what we do and and hopefully people can learn a or be motivated to come and talk about their own stuff so the main difference between a you know an OPS type of workload and an appliance workload is the systems are very fluid these things are changing quite regularly in both in terms of software and in terms of configuration and were pushing configuration several times a day it for customer turn-ups order test packages or whatever the case may be I n and this is very common this is all of the hot stuff you see it start up somewhat this is like large websites API centric companies and and service providers and they they're all in this category of parts with others and with that the workload is basically Internet-facing you don't have you know were not like a stored appliance has to have 100 per cent availability because a ton of servers are hanging off of it but we've got lots of cheap nodes and then we can kind of deal with failure in in different ways so
this is more or less the about me I think it's kind of important before we get to the other slides I was a Linux Guide for 10 + years and then very deep into that culture in you know although I was doing at work professionally I kind of played around with other operating systems on around model wall when I was you know still in high school and that was the thing switch to of sensors and when that started gaining traction and I would play around with other OS is just for fun and I kind of curious about you know the design tradeoffs and what people do things also like old hardware can play the role of those ones at the end I'm so I start at Limelight Networks and I'm I'm intrigued by the BEST edge because this is like ah bread-and-butter and there's over 10 thousand genes and there's not a lot of people doing anything to make that happen and I'm I'm curious because like on the Linux society the outline lighter other companies have been at there's a ton of people per you know whatever measurement you wanna use per you X number of servers at line like that wasn't the case there was maybe a handful of people really involved in the design and implementation of of this city that kind of piqued my interest in got me go on and on and the stuff and and you know when I started digging what I found was this is the software and mind-set were really responsible for that and that sucked media so I'll try and explain more of that in my talk is I'm talking about some of the tools we use and hopefully that makes little sense more sense book 1 motif but to keep in the back of mind when I'm doing this observability trumps everything else and this is kind of a stolen anything from branding grade 1 he mentored I think in the context of unitary singing and figuring out how software works but and actually I think it's even in deeper than that we're talking about how BST pulls you into the source tree you for instance know how your compiler least what it is and what it's you're calling out in terms of other utilities last nite I mean the the system you know you know what's part of the distribution it's not just this substrate that you're trying to fire JVM arms on top of it be done with it you actually can't get involved in your operating system but
so all dive into some of our tool choice these are pretty eerie slide so feel free to interrupt me and but we use Xavier X were generally happy with it and it was somewhat hard to scale because it uses a relational database to keep track of all these incoming values so the answer to that was fusion IO we we run out of my sequel on top of fusion I own it works well enough for the the current workload but the key insight here those side I mean I wouldn't necessarily say use of its unless you're a small or medium shop it's a little bit pushing it for what we're doing but use an API driven monitoring systems and there's a couple out there are more than that but just make sure that the way you're interacting with monitoring system is like writing convict files manually you wanna be pushing configuration into this but and that should ideally be part of your configuration management tool box and all of get to that when talk about salt I operationally monitoring has to be part of your entry into production if you have people putting stuff customer-facing up without monitoring you're going to have a bad time you're you're going have problems and there's going to be this you know fire drill and then you're going to wonder why you can't do that to begin with so and this is something we've learned a few times over I think we've got a little bit better at it recently and then the other thing where we want to go is getting monitoring as part of our testing in Q you know a lot of people right QAI toolkits or what have you and you have to run unit tests the integration tests but when you're doing ops you actually need to think beyond just the piece of software you need to think about how it's deployed and how it integrates with no other micro-services a databases whatever the case may be but the answers that we think is you know plug in monitoring that's what's going to tell you when something's wrong in production and if you can't catch those areas knows particulary then you have a nice little feedback loop and then you know just as part of this don't use use anymore it's not very good we can do better than others in industry and so opposite
of monitoring out this matrix and this is more or less time series data coming into some type of scalable database but we have TST being in place right now I'm not really happy with that I was involved in in trying to on FSC it a few times and didn't get very far but but but I like fucking the and shit and in his on guided PST can like so basically what you have is a metric dumping ground we have something that's easy to put a ton of data into and not really anything to get good stuff out of it and so I think there are better answers here 1 of the things we've been experimenting with is a start-up culture that's kind of a hybrid hosted on site application of this kind of actress Kristin tell you all about it if you're interested but it's actually pretty cool it's a dataflow language which is something that's been around dataflow programming has been around for a long time but they kind of put it right here in your face and so if you've ever use Splunk it's just next level beyond that and so for instance you can you query some type of for instance here they're showing querying an asset database and basically the question was you using these metrics like our average response time in our by kilobits per 2nd how can we see how a different hardware models are influencing that so in this example you know this particular device is doing quite a bit better than these other devices and you need somebody looking at this could make a case to say what we should deploy a lot of these and deprecate these because winds as business or whatever and metrics is a pretty important thing for making decisions at scale so I can talk a lot more about this sort of move on if anybody's interested i so basically what we're trying to do we we feed a ton of this you know stats coming off a server so I'm Armenian just as that of program called collecting it's just the c-agent with plug-ins and and this is looking at things like you know your CPU usage load average g out on previous the memory and then we try and application metrics to this requires you know the application developers to get involved but they can push up things like transactions per 2nd for you know average some type of percentile responses or things like that once we get it into 1 of these systems that we can query it this is actually the the bare-bones TST interfaces better ones graph on on but basically then what you do is try and correlate things so you can say you know this is actually a brilliant example it's like can I correlates server model to response time but maybe I wanna look at and a backbone saturation to response time for the swap inverses response and things like that there you know that when when you have the data you can start asking questions and you can do with them in a scalable database you can ask them post-facto so you don't lose that you after an incident you can go back and say why did we do something wrong or or it you know perfectly the but this sure so as I said it's not quite metrics and because basically both of these things are taking in log data I for instance you pushing insists larger up logs but and then they have indexes that you know can can put that into an efficient structure so you can query roll it up into different things a lot of times you can turn that back in the metric so for instance we can use belong to get metrics offer of you know like axis longer something else is more or less the same uh more-or-less equivalent to spawn could just you open-source I'm trying to think you know that the other thing you can do here is is just query you know if you if you are looking for like for instance a panic or or something that's coming officers syslog you can go into spunk and try and you know make inductions based off of about likes you try and correlate things kernel version or things like that there's a so called and these 2 or more textual there really it's how you deal with logs at scale you know a person can go view this log output of 10 thousand servers is just overwhelming so what you try and do is get it all you get it into here and then look for anomalies or create uh you can search is that no certain that conditions things like that it's it's not you know you can use that to then feed an alarm and you're monitoring system but by itself it's it's very free-form it's just it's it's like a search index for text in order to move on so this
is something we invested a lot of work into in the past year and we were if engine to shop uh and then we had some shaft through acquisitions on but we did kind of peak often we we looked at what was out there and and what would work for our implementation we found salt and we've been pretty pleased with this decision on the key insight with salt is that you have configuration management built on top of an orchestration bus so rather than running you're CM system on a schedule or or chronic you actually have agents permanently running on on the systems and then there are always connected to these master systems so this is kind of interesting you can react to different events for instance when cm runs on 1 system and something changes that can push something over the Boston make something else 7 for instance at a host to a load balancer something in real time you don't have to do this on on synchronous skittles so I gave a talk at Seoul Conf where we go really deep into and how we deal with changes to the CM system itself we basically have a workflow where problems we have you know a steady state and then when somebody wants to change that policy we spin up a new salt master in a container and then let them put their machines to that and verify that you know in in in in a sandbox environment or even in production certain changes and when that's ready that's an accepted and and promoted and that set the stage but this has been pretty cool so basically what you're trying to do with configuration management of this is new to you but is move system stayed from something like shell scripts for interactive input into declarations you want describe what a machine is you want a machine it's supposed to do rather than step by step how it is to do it and then let the the system figure out what's changed or what needs to be changed in what order it needs to happen and to make it do a thing so that basically policy is greater than implementation with with configuration management with Salter with most systems you can get you can do things programmatically when you need to be 1 of the key insight is you kind of wanna build those programmatic structures up so then you can use them in your declarations and solve makes this really easy but this is a state that deploys network timer and TPD of using a MAP file this works on our Free BSD hosts are red hair posts and are going to host so that's kind of what you can do with see anything abstract things out a little bit and make make it easy to understand what a host is doing at an abstract level but the other thing we we get with salt is this orchestration bus so I'm a kind of neat example we had recently we ran into some weirdness in the TCP stack where we have a customer with a very bad networks that span sending out of order packets in the initial burst and then it's actually sending axis left of the window and it there is actually an RFC that none of us knew about where this is supposed to be a good thing and so we found by basically we wanted to see how prevail this was production that to gage the severity so we wrote a d trace script and actually ran the on 2000 production machines in you just watch the counter for 10 minutes and we found out it's actually very very rare so that helped us kind of triage about you know from 0 0 wow this is you know we better get a handle on this real quick tour OK you we can take our time to figure out what's actually going on here and and how we want fix and should pause here on Saul through any questions or comments to yes we do and so we've got that I'm trying to think of a good example so we sink ssh keys out to the edges is just 1 I wrote so it's on the top of my head I want to do that we just that the model grows and makes a l that query but for the SSH attribution in the directory and then pumps out to the master the master can use the salt file servers that push out up to orange nodes is just the way we log into our systems we've also written modules to do like different services like I can't think of 1 of them is this is actually this workflow like how this thing spins up containers of of model the couple pools of money it's easy but I am really pleased with salt everything's pretty straightforward that adopts a little bit hard to get started but once you kind of Crockett it's pretty easy to to keep going it's all about things that you would look very little very very little on it's 0 and q underneath right now and there are actually working dedicated to making even more optimize transport but but like as far as bandwidth there's no noticeable I mean I think the machines been up for with these with the large client counted style like 100 kids over a a couple months of the year short so we've got 1 master just a single master right now with the total of flight 2 thousand posts on it and we've got a couple other pools but I'm and that's doing fine like that's uh handling all the encryption and everything you know it you'll see the CPU spike a little bit like you don't want to skimp on on hardware there but it's for that I think you'll be all right so we've got to rule so we we want dual-CPU whatever current generation is like a core duly core and then like a 100 random action in matter but we've got like 128 in there but we also did SST is just because it we didn't need a lot of space and the affordable for us sure so I looked into it on the wall on my own we didn't consider it for our work we looked into shafts engine 3 and puppet outside from salt but what I saw financeable was a lot of the same thing but but it didn't do the last thing that we really like like that that's kind of a key insight to us but I think it's a great configuration management system and it's really easy to get started the docks a fantastic it just seemed to me that salt was a better fit for what we wanted to do the you like what is the we didn't see you really a tremendous gain like from 2 to 3 what we wanted was easy templating like this that would actually be a lot lot more stuff and see if engine 3 and then the use of writing custom models you know we we've got a lot of time people that know Python varying degrees of so the entry to changing ball the server and the client are pretty low and most of us that have worked on the salt implementations have actually been contributing patches like drive-by attaches to the salt upstream and it's you just go in and do it and you're done for the day you don't have to have like a huge learning curve that was it I think that's a key when actually over some of the other systems where they started you know bifurcating the agents in Ruby and now you've got you know a closure server whatever it starts making things harder for cultural development go the bone so how do we
actually get free BST onto our edge machines this is changed in recent times and we're trying to get a little bit more formal with that because we've got source committers on stuff that we're starting to do more interesting stuff on so basically we we use get at why as for all system so we're using our that there's a semi-official get part of the mirror of of the Free BSD as the entry so we have 2 branches we have had and stable and then these follow you know as head and and currently 10 stable and what we're doing here is taking out were deploying 10 stable but we developing its head because we wanna kind of stay ahead of the curve and make sure you know that what we're doing is going to be fine when the next release comes along and so we take these 2 branches and we grind them through a Jenkins job this produces are images that actually go out to the edge and all kind of go off on a tangent here and so we have this big thing that this is actually part of our soul deployment were taking these images and pushing them out as big boxes so developers can run this stuff on the laptop the insight here is we want them to have a very low barrier to entry to writing configuration management and working with our actual production images you know if you're developing against a vanilla image and that might not have all of our customizations maybe you don't run into a problem early enough and it becomes a problem that kind of thing so with figures were able to actually get very low very order entry to our very you know very production looking environments and this is all the big feedback loop of Packer is a thing that we use to make those box files so low but more important in the Linux world because we just have these ISO images that we have to enhance with our changes to packages and configured but in free BSD we've got to build systems so we can't do everything to their or go more into are parts or stuff in a bit but so phase 2 after I've been at 1 might for a little over a year but what I kind of saw was that this BST stuff was was also when we needed to do more of it we needed to be deliberate about it so we brought on a Sean the and that's been also and he's been helping us upstream all the things so we had a stack of patches not a huge list not like some of the appliance people and but you know enough to to try and get that stuff either fixed upstream Earley's reported upstream so it can be fixed and perhaps a better way and and we're trying to get better about how we actually use supports tree it and build packages this is an ongoing thing but that the key here is podrá and and package and g and user really awesome I think they're kind of the best software packaging experience that I've seen the on any operating system today and and again this is all about as being very deliberate about what we're doing a lot of things that you know up into this time we're done just because they have to get done and now we're trying to take a look at it and say OK here's how we should do it going forward and and will be no more efficient but but so how did we start a source team so for instance I found Shawn on the jobs mailing list this is a pretty low volume but you can either post you know you're res there 2 or poster wreck there I you can come to conferences like this and then look for people that are doing stuff and of course if you do cool stuff sensibly generally people come to you but we're trying to do that so we're getting better at that time but there's plenty of people using BST they're during that so I mean the benefits of starting the source the more we were on the BST 8 when we started AI and but you know 9 had come out 10 come out and giving from 8 to 10 was actually a lot more involved than we thought even with this small patch to act as an operator it was quite a bit of work on both because you know there there were actually blogs in the 10 10 and 10 1 release that we've had to work through i'm and then we have a binary blob that we actually deployed production we bought a pluggable congestion control them before that was the thing in previously that does some network magic and so we have to kind of figure out how we can keep the interface consistent so we can keep using that in the 10 life cycle while we figure out what we wanna keep from that and implement ourselves where we can have
as a source changes some of the other things we've done Sean worked on this multicue young driver the in drivers like a gigabit class the net controller from Intel it only uses 1 name Q and then what we saw was that a lot of our machines were actually kind of stuck in the TCP path has so what he found through reading some of the manuals was that you could split this out to at least 2 cues on some of the chips and then now we can get you know 2 or more cost that I think to for cost during that TCP output power and this actually got us with the 2 link gags from like 1 . 1 gigabits reliably now we can more or less max those 2 interfaces out so that was a really nice thing but we also started doing some profiling with the trace and PMC stop we we found that we're paying actually a pretty hefty IPF w penalty on our outbound path and we don't have any out ground rules but this is because the by default even if you don't have any rules there a except rule and then you have a bunch of setup and teardown with IP if W so here and I added a thing is like a 2 line change and will probably try and push a substring if people want of justices control the say ignore any IPF w overhead on the outbound path and we we got an appreciable out of that as well but Sean did this PLM implementation this was basically if people or blocking I I cmp traffic and the lack of progress so during blocking as inferior so this was something that I and I think I don't know if it was a customer requester something that we just noticed our introduction but I'm not was a cool thing that we got knocked out call on ng this was really fun should for some value fund the collapse system was broken up and up through 10 . 1 release and you don't actually notice this on my you know if you're running a small fleet of systems that the panic signal see from this are rare enough but when we have such a large number of machines we could actually daily see machines panicking so this was we didn't actually developed the fixed but we were kind of following along in the review and new poking people and testing the patches so we think this is fixed and in what will have 10 stable what will become tend to and so that was actually quite a bit of work just figuring that out again shown and Jason were key in doing that we're looking into TCP customization a lot of this will go upstream where we can but some of it might be where we're kind of deviating from the spectra whatever and and then we're also doing series of stuff sometimes early or sometimes if somebody can commit something to current and they don't for whatever reason 1 atmosphere will pull back on the upstream project but so some of the insights of of working with source we want to always develop against head we don't wanna get into this situation that other vendors have gotten where they're married to released when they have to do this huge drill to get back to the current release we wanna know what's changing in head while it's changing and so we can influence and you kind of sound the alarm earlier or hopefully you prevent problems from happening so this is our ll head branch then we pull those changes back to ah l all stable which is following 10 stable and when ready to ship this we do an internal release engineering process basically this is running our bill job doing some smoke tests and then deploying it to Canary hosts and then finally will release this to our systems over a long period of time so again 1 thing I'm kind of reiterating here's these feedback loops of this thing called the group 0 OTA loop kind of an interesting way to think about it but basically it's like observe-orient-decide-act so we want to kind of see what's changing get ready you know position either the people or the machines to do what they needed to do the work and then make sure what we did is effective that's all we're doing on a lot of the stuff you there in in operations or in development and so where
I'm at now what I wanna do is kind of identify and support key features and the community at large so there's a couple ways we were were trying to do this we're trying to kind of look out and see what features and previously we want to be there you know push an agenda on or push our resources to implement but we want to support the community with your finances so we've made a donation to the previously foundation on internally we wanna show the company that you know were doing good work that our people are effective and I think we're doing a good job of that you know we've got a relatively small number of people versus the the footprint and the impact of these systems and I wanna bring other people in the company into that folder and help them I use these tools to do the same and you know how will do that we want to empower service saunas to do cool stuff you know that the base system again has it's incredibly observable you configure you know kind of figure out what it's doing and how you assemble at to make whatever you're trying to to actually do be efficient progerin packages are huge for developers you know when pulling in libraries or whatever but you don't get stuck on ancient versions or whatever you have a ton of control and figuring out how you wanna manager dependencies in a programming language environments and then salt also been massive this is something we want push as a self-service out to the groups that are doing product development and so that kind of those 4 things are where we're at today where I'd like to go is really kind of around jails and I O K age but this is the kind of stuff I've been playing around with on my own time but what what I think would be cool is to kind of detached the metal ROS from the user land so as a source team we can start evolving the stuff that's touching the hardware faster than the product guys can can validate their own changes but the reason we want to do that is you know we were trying to test and and minimize the amount of of releases we have a production so when we're doing driver work or whatever these guys don't care too much about that they just needed to work but we need to keep their API compatible everything and so for instance I you know I I can envision in the near future but you know in the next year so will wanna start deploying 11 to production and if we can do that without you know really baking all of this user land stuff that might be interesting for my migration period and you know you can support that for a couple years or whatever i z FS is kind of instrumental to the jail saying you wanna be about a push jails around to work around hardware problems or a new dataset migrations things like that and so I already
mentioned this we but this was actually honor work on you in a corporate environment you have to figure out how you can make people understand a good ideas like a good idea but luckily we had a founding engineer at the company that was able to help us some kind of make that case and get our concerning appear on so that's the end of my
data can just the 1 that I wanna say is you don't be afraid to push a student production in these type roles its final people doing it but there's plenty of resources out there are plenty of mailing lists and and things that you can go to to reach out for help on and and if you're doing this by out to see more talks about stuff like this because I think it's an important market segments that are more kind of quiet about right now


  565 ms - page object


AV-Portal 3.20.2 (36f6df173ce4850b467c9cb7af359cf1cdaed247)