Skynet your Infrastructure with QUADS
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 160 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/33814 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 201711 / 160
10
14
17
19
21
32
37
39
40
41
43
46
54
57
70
73
85
89
92
95
98
99
102
103
108
113
114
115
119
121
122
130
135
136
141
142
143
146
149
153
157
158
00:00
Coma BerenicesSoftwareIntelQuadrilateralSoftware frameworkGodPhysical lawTotal S.A.Multiplication signLecture/Conference
00:32
System programmingAnalogySet (mathematics)Continuous trackScheduling (computing)Data managementAutomationCondition numberTrailServer (computing)ComputerComputer networkSoftware testingProduct (business)Scale (map)Revision controlSystem administratorPhysical systemStack (abstract data type)Open setVirtual machineBroadcast programmingIntegrated development environmentWorkloadResource allocationInstance (computer science)WordElectric currentRun time (program lifecycle phase)Maxima and minimaTerm (mathematics)Parallel portSoftware frameworkQuicksortQuadrilateralSoftware developerVideo gameScheduling (computing)Entire functionComputer hardwareBitPhysical systemServer (computing)Projective planeCycle (graph theory)Multiplication signWorkloadGroup actionVirtual machineSet (mathematics)Different (Kate Ryan album)Virtual LANSoftware bugExplosionRadio-frequency identificationTrailProduct (business)SupercomputerCore dumpType theoryData storage deviceParallel portAnalogyBeta functionInternetworkingBridging (networking)Installation artSoftwareConnected spaceFront and back endsFlow separationCASE <Informatik>Resource allocationIdentifiabilityScaling (geometry)Software testingTerm (mathematics)State of matterForm (programming)Configuration spaceOpen sourceLatent heatIntegrated development environmentMusical ensembleMereologyRevision controlDenial-of-service attackFirewall (computing)Client (computing)Level (video gaming)Computer networkForcing (mathematics)Scalar fieldProcess (computing)Drop (liquid)WhiteboardSatelliteShift operatorService (economics)Figurate numberPlastikkarteComputer animation
08:53
Execution unitScheduling (computing)Group actionScale (map)WorkloadRandom numberOnline service providerReading (process)Block (periodic table)Sound effectServer (computing)Directed set10 (number)Dependent and independent variablesEscape characterQuadrilateralRandomizationDifferent (Kate Ryan album)Data storage deviceProduct (business)Scaling (geometry)Integrated development environmentMetric systemVirtual machineWorkloadScheduling (computing)Multiplication signQuicksortResultant
09:53
Server (computing)Computer hardwareScheduling (computing)Broadcast programmingError messageControl flowComputer networkServer (computing)Multiplication signSoftwareForcing (mathematics)QuicksortSoftware testingBitError messageSet (mathematics)Scheduling (computing)CodeComputer hardwareGame controllerFacebookData managementSoftware developerExtension (kinesiology)Virtual machineMereologyShared memoryArea1 (number)
12:04
Virtual machineSoftwareWave packetExplosionCausalityNetzwerkverwaltungAreaError messageCatastrophismRadical (chemistry)Software bugLecture/Conference
13:00
Virtual machineMaxima and minimaScheduling (computing)Virtual machineSoftwareCycle (graph theory)International Date LineError messageMaxima and minimaQuicksortWorkloadScheduling (computing)Configuration spacePower (physics)Lecture/Conference
14:04
Limit (category theory)Operations researchMaxima and minimaUniform convergenceComputer hardwareScale (map)WorkloadStaff (military)Server (computing)Integrated development environmentBootingMathematicsPasswordSystem programmingComputer networkFingerprintScheduling (computing)Broadcast programmingHuman migrationGroup actionProgrammable read-only memoryInformationPoint cloudInternet forumElectronic mailing listWeb pageWikiAerodynamicsVirtual machineSet (mathematics)QuadrilateralMedical imagingPhysical systemScheduling (computing)Projective planeIntegrated development environmentVisualization (computer graphics)Level (video gaming)Computer hardwareServer (computing)MathematicsOperating systemValidity (statistics)Ocean currentMessage passingSoftware developerConfiguration spacePasswordUniformer RaumResource allocationElectronic mailing listEmailMechanism designSoftwareDifferent (Kate Ryan album)Video gameMultiplication signTerm (mathematics)Computer configurationUtility softwarePoint (geometry)Form (programming)Maxima and minimaCASE <Informatik>Cycle (graph theory)Enterprise architectureSpacetimeOffice suiteWordScaling (geometry)Elasticity (physics)ResultantGraph (mathematics)Element (mathematics)Interactive televisionDemonLatent heatLibrary (computing)File formatWebsiteEstimatorConservation lawFigurate numberMoving averageModel theorySoftware frameworkQuicksortRow (database)6 (number)NumberConstructor (object-oriented programming)MassEnvelope (mathematics)Process (computing)MomentumMereologyOpen sourceCommon Language InfrastructureVideo game consoleWikiGroup actionPoint cloudSoftware testingAutomationComputer fileGreatest elementRobotVirtual LANProgram flowchart
22:36
WikiAerodynamicsDrum memorySoftware testingPoint cloudSystem programmingView (database)Visualization (computer graphics)CodeFreewareFunctional (mathematics)Gastropod shellOpen setStack (abstract data type)AutomationCloud computingInterface (computing)Scheduling (computing)Band matrixComputer networkComputer-generated imageryVirtual machineGroup actionComputer hardwareMereologyPhysical systemUsabilityScheduling (computing)AreaVisualization (computer graphics)Integrated development environmentModel theorySoftware testingVirtual machineServer (computing)Different (Kate Ryan album)FreewareComputer hardwareQuadrilateralQuicksortSet (mathematics)WorkloadSoftware frameworkGroup actionSoftware developerLevel (video gaming)WikiProduct (business)InformationBand matrixAutomationInterface (computing)User interfaceBlack boxValidity (statistics)Block (periodic table)SoftwareFront and back endsIP addressMultiplication signStack (abstract data type)Computer configurationGraph coloringComputer programScaling (geometry)WindowFile formatMereologyGoodness of fitKey (cryptography)Automatic differentiationLocal ringElement (mathematics)NumberNatural numberElectronic mailing listSerial portProjective planeGastropod shellOpen setOperator (mathematics)CodeLink (knot theory)Graph (mathematics)Video game consolePerspective (visual)View (database)Open sourceProfil (magazine)InternetworkingMedical imagingCovering spaceData storage deviceConfiguration spaceWordPoint (geometry)Equaliser (mathematics)Self-organizationService (economics)ForestPoint cloudRight angleMusical ensembleAddress spaceForm (programming)CurveCASE <Informatik>Data structureSound effect
31:08
CodeFeedbackMiniDiscValidity (statistics)SummierbarkeitMultiplication signRankingRight angleSoftwareScaling (geometry)Point (geometry)Software testingDivisorSoftware maintenanceMereologyDifferent (Kate Ryan album)Computer fileConfiguration spaceoutputAreaTerm (mathematics)Service (economics)Multiplication tableServer (computing)Interface (computing)Computer hardwarePhysical systemLevel (video gaming)AnalogyRadical (chemistry)Scheduling (computing)WikiWhiteboardScripting languageData managementCodeAbstractionIntegrated development environmentSemantics (computer science)QuadrilateralState of matterVirtual machinePoint cloudMathematicsHypermediaProcess (computing)Field (computer science)Personal digital assistantAddress spaceOpen setMappingFile formatData structureUtility softwareCASE <Informatik>CurvatureQuicksortUniverse (mathematics)Perturbation theoryWorkstation <Musikinstrument>Asynchronous Transfer ModeMusical ensembleInheritance (object-oriented programming)Semiconductor memoryType theoryPatch (Unix)Database normalizationPower (physics)Open sourceFamilyHill differential equationCollaborationismSoftware frameworkLogicKey (cryptography)Data centerAuditory maskingIdempotentVirtual LANEquivalence relationNeuroinformatikMixed reality
Transcript: English(auto-generated)
00:08
This talk will be in English, thank God, because my time is terrible, as you can tell, and I apologize if I've offended anybody in the audience just now.
00:22
So my name is Will Foster, I'm a DevOps engineer at Red Hat, and today I'm going to be talking about an exciting new Python-based framework called Quads that we've developed in-house that solves some of the problems that we have, and I'm just going to get right into it. So before I kind of explain what Quads is and how we've built this sort of framework
00:45
to solve some of our infrastructure and automation problems, I want to explain what I do at Red Hat. I'm on a very small team of two people, and there's not enough car analogies on the internet. In fact, there's never been a car analogy used for open source.
01:02
It's never happened. So I'm going to use a car analogy here to kind of explain what I do and what my other colleague does on the DevOps side. So I want to talk about high-performance computer servers as race cars. Very high-performance, fully-specced, the latest Intel AMD chipsets as race cars.
01:28
And high-performance networks would be the race tracks. And the race car races that run on these tracks would be performance and scale testing of various open source products that Red Hat develops and also upstream things like
01:44
OpenStack and OpenShift and Kubernetes and different types of technologies that we want to test and vet at a very large scale. And the actual race car drivers themselves that are driving these servers that are race cars on these fast tracks, 40 gig, 100 gig networking, or race tracks are the race
02:07
car drivers. They're the scale engineers. And that's a pretty cool analogy to have if someone says, what do you do while I drive race cars? So it's kind of throwing a bone to more of the core performance and scale engineers by calling them race car drivers. But I look at myself and my colleague on kind of the DevOps side as we're the pit crew
02:24
and we're the track engineers. And our goal is to make as many races happen all the time as efficiently as possible without any wrecks or explosions. And those do happen, which I'll get to. And this tool, Quads, helps us automate the entire thing, including writing our documentation
02:43
for us, configuring VLANs on Juniper and Cisco switches. And the full life cycle of provisioning bare metal servers, spinning them up, passing them to an engineering group for product and scale testing, and then spinning them down when they're done. So if this was either a terrible analogy or an awesome one, I have a very simplified
03:04
one. And this is the Reader's Digest version, which basically we manage 300 or so high performance servers and switches and a large infrastructure. And this infrastructure accommodates parallel product testing. And it's really comprised of isolated sets of machines.
03:22
We refer to them as clouds because we're not very creative for different workloads that happen simultaneously. And with Quads, we have basically automated our entire jobs. We've automated ourself out of a job. And instead of spending the time going and being network engineers, being systems folks
03:41
that have to deploy servers, we've automated all of this with Python and we instead spend our time on actually improving the automation. So what is Quads and what isn't it? What sort of things does it do and not do? Well, it's not an installer. It's not a provisioning system. It bridges several interchangeable tools together.
04:03
I mentioned Foreman because that is our backend provisioning vehicle that we use. But we design Quads in a way that if you have an existing provisioning system or a workflow or anything that you're used to, you can plug that into Quads. Quads will simply call out to your provisioning system to do re-kicking machines, re-provisioning
04:22
machines or pushing image-based deployments across a lot of servers. And it also helps us automate basically the boring things that maybe you do once or twice and it's exciting, but you never want to do again. I love network engineering. I love connectivity. I like switches and firewalls, but I don't want to do that for a living.
04:41
I would rather automation do it for me because it makes a lot less mistakes and it's frankly better at it. But basically our goal is to build a system that orchestrates and builds all the other systems and only spend our time maintaining that automation framework and trying to waste as little time as possible being hardware people or being network people or storage
05:04
people because, you know, it gets boring when you do just one thing. So what is Quads from kind of a high level? We drive everything with PyYAML. So the idea, the top-level idea is that for every asset in our infrastructure we
05:20
have a YAML-based schedule that tells it what it's supposed to be doing from what start date and what end date and what isolated work group or assignment. And this way we can programmatically schedule things in the future. You know, in an ideal world you would know the development schedule of all of your engineering groups.
05:41
You would say, you know, in November team A is going to be releasing a beta of project Y and team C is going to be doing the same thing. And if you knew what people's needs were ahead of time, you can schedule compute and network resources in advance. The reality of it is none of that ever happens like you would like it to.
06:03
Deadlines shift, there's different hold-ups and blockers with bugs and different projects and how you perceive the world to ideally be is never actually how it turns out to be. So we've baked in a little bit of resilience into how we schedule things in the future.
06:20
So a little bit more detail about how it actually manages this kind of programmatic YAML driven scheduling and provisioning. We set aside the YAML schedule for server assets and then we automate basically the entire life cycle of a set of machines from beginning to end automatically. So on a Sunday at 2200 UTC, if you're going to be receiving say 200 servers in what
06:47
we call a Q and Q0 or a specific VLAN configuration that we support, automatically your machines would spin up, they would re-provision. Tooling would go out to each of the switches and configure your VLAN so you would have one isolated environment separate from the other engineering groups and things like a form
07:05
and self-service account would be created, you would have your own IPMI credentials to get into the machines out of band and then our documentation would be automatically generated to reflect the current state of what your machines are doing and who is
07:20
using all of the assets inside of an environment. So like how do we use this internally at Red Hat? So we have a large R&D environment called the scale lab and this is where we test and vet all of our products, Red Hat Enterprise, Linux, OpenStack, Rev, Satellite, just
07:42
I don't even know how many products we have but that's where we test them. And it's a very special place because all of the hardware is deemed high performance scale gear, we use 100 gig networking across the board, it's all very high in servers and it's not a place for you to have a development test bed, it's a place for you
08:00
that if you are doing development work and you hit what could potentially be a scale issue with any sort of bits of the application stack and you're able to reproduce this issue on a smaller scale, this is where you go to run it at a very large scale so we can anticipate how customers using our software would do in the real world.
08:24
Ideally we would identify issues before other customers hit them but again that's not always the case. So in our Red Hat scale lab we have about 300 servers, we have 40 or 50 high performance juniper switches and right now we run about 16 to 20 different isolated scale and performance workload on these systems for up to four weeks at a time.
08:44
And these and quads helps us spin up these machines, hand them over to the appropriate people for short term lease and then spin them back down again and put them in kind of an allocation pool. So I'll give you an example and let's look at some pictures because everyone likes pictures. So this is the example of some of the scheduling that we do automatically.
09:03
So this is from February to May of this year and you can see how we've very efficiently done back to back scheduling of all of the machines in the environment. And you can see, you know, four to five parallel running workloads testing different products, different scale, different aspects of different products and all of this when
09:25
it's scheduled in advance all happens automatically. So we don't ever have to waste time. There's a two-person team, we got a lot of infrastructure so we don't have to waste time manually setting up any of this stuff, nor would you. Here's an example of, you know, some of the metrics that we've gotten from the lab.
09:43
I just, I kind of picked this at random. This is a storage workload, you know, and these are some of the results that we get out of the scale lab and quads kind of empowers us to do this sort of work. So we talked a little bit about the time savings and efficiency, but I want to drill down more into what problems we're actually solving here, besides the obvious.
10:05
So the first one is server hugging. Does anyone know what server hugging is? Okay, so server hugging is the idea. Once I explain it, you'll be like, oh yeah, I know server hugging. So server hugging is the idea that if you give someone a resource, they're going to hold on to it as long as possible until you pull it back from them.
10:25
Developers are very bad at server hugging and usually there's always more of a demand for resources than there are actual resources to give people. So if you're lucky enough that, you know, your manager is particularly savvy and he can fight for a budget to have this R&D dev test hardware and he's better than the
10:43
other managers at it, then you're going to have more gear to play on. But the sad reality is, is there's never enough bare metal hardware, unless you're Facebook or Google or, you know, one of these behemoths, that you're not going to have bare metal, high-end server hardware to run your code and test against all the time when you want it.
11:00
So server hugging is the idea that there's this tendency of people to hold on to things longer than they should, and it's a natural human thing. You're given something, you want to use it, and you kind of become protective over it. It's your pet, it's your area, it's yours, but it's not really yours. You got to share it with other people. So by having automated scheduling of server resources and network resources,
11:26
you sort of force people to be more efficient with their planning, you force them to maximize the time that they would have on a set of hardware, and you can save a whole lot of money and time by doing it this way.
11:41
So what's the other things that we solve when we automate things? You know, there's less human error that's good with more automation, and to a certain extent, you can kind of give control over to the machines, because, you know, what's the worst that's going to happen, you know, at that point?
12:09
There's obviously bugs in quads, and kill-9, terminate. There's obviously some rough edges in our software, as there are.
12:23
But, you know, the idea is that you automate as much as possible. The biggest area of devastation is on the network side. It's one typo to configure the wrong port and totally offline a machine or just cause straight up havoc. So simply just automating the network administration is a huge boon to having errors.
12:45
The downside is when automation fails, it normally fails in glorious ways. It's normally just a giant explosion slow motion train wreck when your automation actually fails, because it's so efficient at something, and if that something has errors, then it's going to be catastrophic.
13:04
So Dave Wilson, if you're watching this, I'm really sorry about your 50 machines that got eaten by our network bug, but it's fixed now. So what are the other things that we're solving here? So we want to maximize idle machine cycles. Electricity is expensive, carbon footprint is always an issue,
13:21
and we want to automate and spin up machines only when they're... Oh God! Must reprovision human! Where were we?
13:40
So what we do is we power off machines when they're not in use, and only when they have an active schedule in their YAML config, do they actually come alive and participate in some sort of an automated workload, and sometimes they don't work like you'd like them to. But that's kind of the double-edged sword of automating things.
14:03
So lastly, we want to solve the scheduling issues. We go with short-term reservations. So if you rent, say, 100 machines, if you use quads, and you get assigned 100 machines in a particular VLAN configuration to do your development and your testing at scale, the maximum you can keep it is four weeks,
14:22
because we only have a couple hundred servers, but we have queued up almost a month wait time to get to those resources. So we want to be more like Airbnb and less like a hobo house. Airbnb has very defined guidelines. You can't stay in an Airbnb longer than four weeks, and it has uniform, for the most part,
14:43
kind of things you would expect from an Airbnb. You know, you generally know what you're going to get. Maybe not, but it's a little more polished and professional than, you know, like, say, a hobo house. And the last thing that we really save here, and this is kind of the impetus for us continuing to,
15:01
for us to automate our jobs and then work on the automation, is the time savings and the cost savings. So we had done some kind of back of the envelope math of using 100 machines, for an example, if 100 machines changed hands tomorrow and went from one development group working on something, or a set of developers with a specific scale
15:20
or performance problem they were trying to fix, what would be the cost and the time involved if someone was to do that manually? Now, granted in 2017, I hope no one is doing all of this by hand. I hope people aren't inserting an ISO into a server somewhere and someone's on an SSH console and a switch, and I hope people aren't doing it that way.
15:41
Maybe some people are. But assuming that you did do everything manually, it would take roughly 90 hours of work to provision 100 servers and pass them off to someone else. So for our current two-person team, that would be about 45 hours apiece, over a week of work, and if we tripled our team,
16:01
it would be about 15 hours or two working days. And if we had a 12-person team, we could maybe get it done in a day. So Quads does all of this on Sundays when most people aren't working and automates the entire thing in a span of two or three hours. So when Monday morning rolls around,
16:20
machines are already passed off, notifications are already sent to the users, they already have their own special credentials to access the machine that only they have, and then the clock starts ticking on the reservation. So I'm not going to drill into this, but this is kind of how we came up with this figure, and these are pretty conservative estimates of everything that we do.
16:41
All right, so how does Quads actually do all of this? We've talked about the problems that we're going to solve with it, that we solved today. We talked about the level of automation and efficiency that we're able to yield, but how does it actually look on the back end? So I have made some rather grotesque topographical images for you.
17:00
This is not going to win any website awards, but we all remember Milton from Office Space. We're going to say he's your typical scale engineer and he needs hardware. So this is the kind of Quads architecture at a very high level. We have now a JSON API in front of it, but generally speaking there's also a daemon and also a CLI that you interact with.
17:22
At its very basic construct, there is a YAML schedule that is constantly modified by PyYAML, and I'll give you an example of that a couple slides later, but this is at the heart of how things are automated in the present and in the future. There's also some provisioning elements.
17:41
If we want to do any graphing, we have hooks into Collecti and Grafana, and we also can send results to Elasticsearch as well, but this is more ancillary stuff that we would set up after the fact. Automated of course. And then lastly there's the consumable here, besides the actual machines, is the documentation.
18:01
At any point anyone can look inside of this Quads managed environment, they can see what all the machines are doing, what does the utilization look like, who has the machines, what are they working on, and for how long they'll be working on it. So when you provide transparency like this, it's easier for people, they don't have to ask you,
18:20
you got any spare machines guys, or I have this project a month from now, what's the schedule going to look like? All this is already published and available to anyone who wants to look at it. So there's tie-ins to the actual provisioning, which is the next slide, and there's a plug-in to Quads, basically a open-ended command that we call movehosts.
18:41
It's just a simple argparse option in Quads, but this is where you tie in your provisioning system. So however you run the lifecycle of re-provisioning on operating system or laying down an image over top of it, whatever your method is, you plug this into the movehosts command, you define this in Quads, it can be whatever you want.
19:01
In our case, we just use Foreman because that's one of the tools that we enjoy using and saves us some time. So the quads movehosts command would basically spit out, if the time is running, like if now is scheduled right now for a set of machines to change hands and go to another environment with another VLAN configuration,
19:21
we would get this printed out in our logs. We would say, this example server is moving from Cloud 01 environment to Cloud 02, as an example. On the back end, on the Foreman end, at least for us, this is our provisioning workflow. We would tie Foreman in to add and remove role-based access for the host.
19:41
We would change the ipmi passwords so that users are isolated in their own environment. We would do a full provision of the operating system. We would lay down any post-configuration stuff and then we would actually move the VLANs on the physical switches, depending on the VLAN design that we support. And then lastly, and I think more importantly, is there's automated network validation
20:01
that happens. We don't want to pass off a set of machines to people if they're not ready to use or if there's something wrong with them. So we run automated validation that checks and checks and checks and if every one of the machines doesn't pass the network validation, we get notified and then it continues to check in intervals until we fix the problem and then it finally passes validation
20:21
and the consumers of the hardware, of the isolated environment get notified through a couple different mechanisms, usually IRC bot and email. So we talked about the YAML schedule per host. This is what it looks like. This is kind of at its basic, the construct of how the PyYAML drives
20:41
current and future scheduling of machines and networks. So we see this defined schedule here. They're very simple command of LS schedule. We'll list you all of the current, the past, current and future scheduling that Quads knows about for a particular asset. So at the very bottom here, number five,
21:01
that would be like, say a current allocation. That would be something that started June 28th and it's going to end the 6th of July or has ended. And we keep a historical record of this in the YAML file because that drives the documentation and also the visualizations that we make.
21:22
So onto documentation. Now, I like writing documentation once. In fact, I don't even really like writing documentation, but it's one of those things that is so critical to any sort of project or any sort of endeavor that it's also the most lacking
21:41
aspect of documentation. So we decided that one of the pillars of this Quads framework would be to automate all of the things that we either don't want to do or we are going to screw up at some point. If you have even outdated documentation is still better than no documentation, but it's still terrible. So our goal was to have
22:01
absolutely up-to-date, by-the-minute documentation. And the way that we do that is we query our provisioning source. In this case, it's Foreman. It could be anything else. It could be Ansible Facts, for example. And then we query Quads because it knows about the past, current, and future schedule of everything in our infrastructure.
22:20
We parse it into Markdown format and then we use an XML RPC Python library to push it up to a Wiki page. And then this is continually updated every minute. Anytime there's a change in the environment, anytime a bare-metal server is added or removed, this gets updated in the infrastructure documentation. In this case, we use WordPress. It's got a nice API for this,
22:40
but it could easily be MediaWiki or anything that supports some programmatic Markdown format. And again, it doesn't even have to be Markdown. It's just that's what we currently use. So this is an example of what it actually generates. So this is the front sort of infrastructure documentation of a set of our servers. And this is what is continually generated
23:02
and updated every time. We have your typical things you would expect from infrastructure docs, like a hostname, serial MAC address, IP address. We have a link to the out-of-band console. But what's different from this than something that someone edits is that on the right-hand side, we have workload. So if you were to click on that workload link,
23:21
you would drill down to exactly what those sets of machines are doing. What is Cloud06, for example, doing? We know R. Bryant is the owner of these sets of machines. And then this is an older image, but the graph link would redirect you to a Grafana dashboard that has all the historical bandwidth throughput of all the interfaces per machine,
23:41
which is useful. And again, these serial numbers are actual servers, but they're out of support. So you're not going to gain anything by getting them for my talk, but you could certainly pay the bill for us if you feel inclined. So along with a general infrastructure layout of the documentation, we drill down into assignments.
24:00
Again, kind of what machines are doing right now. And this is just an example, a snapshot taken a few months back of various internal testing of products and things like that. We see a lot of OpenShift stuff in here, OpenStack, elements of OpenStack. We see some software-defined networking running, things of that nature. And then you can drill down further
24:21
into the workload, and you'll see how long they've had the assignment, how long it's going to run, and what's the remaining time. So again, this just gives people an added level of transparency to see there's no more black box. What's going on with this server gear that we have? You know, it's very clear. Is there servers available? I can request them,
24:40
and I'll get them. And then I get this nice generated printout. And then we also tag faulty machines. So if there's something wrong with the hardware, we simply assign a key value pair of faulty, and then it goes into the spare pool or the faulty pool that we can have the local lab people take a look at and fix. On top of the documentation,
25:00
we have visualizations as well. We generate a calendar. So at any point, you can see what kind of tests are running inside of the R&D scale environment. And then lastly, we have a heat map visualization. Now this doesn't look like it's going to win any website awards. It looks like the old Windows 95 defrag program. Do you guys remember that?
25:21
Where it's like the big blue grid and then the colors change. But it's incredibly useful from a scheduling perspective, because we can look at, and this is generated three months in advance, six months in advance, whatever you want to set it at. But we can very quickly see what's available from a day of the month or at a longer view, and then we can use that to schedule free servers
25:42
for people that have requested it. So again, and this is all automatically generated for you. Cool. So as you see earlier, we definitely need more testing, and CI is very important. So we do have testing. It's not good enough,
26:01
but it's getting there. We use Garrett for code review, and then we use Jenkins for the CI. And we're working on now kind of a fully instantiated virtual sandbox using Open vSwitch and some other stuff to emulate switch ports. But right now, we're using Flake 8. We're using Shellcheck for some of the shell kind of glue
26:22
that we have in the project. And we need to probably get proper tests in. So we're getting there. Quads is about 11 months old, and it's been running our R&D environment for about eight of those months. Cool. So what's working right now? This long list of stuff
26:40
that I'm not going to read to you. This is available in all the documentation that we have. If you're curious, you can ask me after the talk, but it's going to do a whole lot of stuff for you. What are we working on? This is even more important. So what are some things that we have planned to introduce with Quads? The major thing right now
27:01
is we want to introduce the idea of a post-config. And what that would do is it's one thing to provision the networks, the storage, the servers for people, and hand them off, and document them, and then reclaim them when they're done. But you probably want to automate other stuff on top of that. We have a lot of folks testing OpenStack, for example.
27:22
And you might get, say, 50 servers to do an OpenStack deployment, and then you're testing a specific part of that OpenStack deployment. And you could easily burn one or two days getting OpenStack deployed and getting just right. So we want to offer the option of kind of an open-ended model that whether you're laying down
27:42
an infrastructure-as-a-service set of software, or you want to lay down some kind of container orchestration on your host, we want to offer the option that that is also automatically done for you. So when developers come in on Monday or Tuesday or whenever they want to work, they not only have their servers that are documented,
28:00
that they have their credentials, and they're ready to go, but they also have any ancillary software stacks that they need to test on top of already set up for them. And it's, again, it's about saving time and being as efficient as possible. We're also working on a Flask web interface for Quads to kind of enable some self-service scheduling. So if you're, you know, developer Jane Doe,
28:20
and you want to schedule yourself 100 machines for a week, and they're available, you can go to the Flask interface and request them. In a week's time, whenever you want to start, your machines will show up for you. So that would be a really cool feature to have. We just put in place a JSON API. So that's been pretty useful, but we haven't quite, we're not using it internally,
28:41
but it does work pretty well. That's going to kind of pave the way for the Flask interface. So we're kind of doing slow moving blocks that way. We just got in place the automated network validation that I talked about. And we also want to support more resource backends, besides Foreman. I'd like to have an Ansible backend
29:01
that all you have to do is run Ansible against a set of hosts, you yield all of the facts from discovery, and then that is what formulates the information that's in the wiki that's automatically generated. So again, you know, the overall theme of quads, of this kind of loose framework that we put together,
29:20
is being as efficient as possible with the few resources that you do have. And you'll kind of see this. So in our case, and this is really a construct of any sort of company that's larger than 20 or 30 people, if you have bare metal assets, or say you have machines in a data center, or rented resources,
29:41
is that over time, development groups tend to silo their resources. So, you know, like department A is going to have their servers, and they're going to be entirely different make and model than department B, and they're going to be bought at a different time, and their depreciation date's going to be different, and maybe their hardware profile's different. So a company ends up spending a lot of money
30:02
maintaining these little siloed areas and pockets of infrastructure. And the idea behind quads is that you put everything in one giant bucket, and then you let quads do all the provisioning, all of the scheduling, and take care of the whole thing for you. Now obviously there's downsides of automated scheduling and provisioning,
30:22
and that sometimes people's deadlines slip, or sometimes they have something come up, and they need to take vacation, or they can't use the hardware. So we've built in things into PyYAML, and PyYAML's very good at this, and that we simply just need to modify the schedule, and then quads framework does the right thing
30:41
from a provisioning perspective. But again, this is a lot of very little tools, little small things that do one thing and do one thing well, sort of sticking to the Unix KISS principle, if you will, that we just keep it simple. And selfishly, from kind of the DevOps operations side,
31:01
is that we don't want to do the same thing more than once. If it can be automated, we want to automate the crap out of it. If it's boring or we mess it up a lot, we want to automate it. And if it's something that we just don't want to do, obviously machines can probably do it better for us. So that was kind of the drive to initially get this thing going.
31:22
So the parts of quads might be more useful than the sum of its parts. Some people like to use just the automated documentation, for example. The scheduling aspect might be useful. There's parts of the framework that you could consume yourself because it's not all tied together. It's very modular.
31:42
So we've also got some external interest into quads as well. There's some ongoing collaboration we're doing with Boston University and MIT in the Massachusetts Open Cloud. So we're kind of merging parts of quads with their scheduler that they've written called HIL.
32:01
And we've had some kind of large public company show interest in using it for their dev and test environments. So that's basically it. Thanks for coming to my talk. If I have any time left, I could open it up for questions. If anybody's got some.
32:24
And if you want to read more, everything's open source. It's all on GitHub. And we certainly welcome patches because we're just not that good at Python, but we want to be. So welcome for anybody to give us a shout and look at the code. Yes.
32:44
I'm curious that can we use this framework for failure testing of a cluster? For what type of testing? Failure testing. Failure testing. Failure. I mean if some of the notes goes down,
33:01
can we use to test such a scenario? I don't know. Okay. It's really aimed right now for bare metal servers and kind of scheduling provisioning and resources based on a future date. Okay.
33:20
For example, what happens if like one of the notes doesn't start and the scheduler, what happens then? Just logs on, you go and fix it by hand? You're talking about OpenStack specifically or? For example, your class, I think like the US schedule,
33:41
like the 50 notes with your computation. And if one of them fails, what happens? I don't know. I'd have to think about it and get back to you. I mean you can layer on any sort of post provisioning automation that you like
34:03
and you know, orchestrate that yourself. It's designed in a way that we don't want to dictate a use case for you. We find great utility and efficiency and having kind of an open-ended framework and if there's certain parts that you can reuse,
34:20
it's designed in a way that there's inputs for that. So we ship like a YAML configuration file that just a key value pair that if you want to use a different wiki for example or if you don't care about the provisioning aspect, you could only do the scheduling or you could only use the documentation part.
34:44
Hi, thank you for the talk. You mentioned about the way you schedule and provision servers. What happens when they go offline? Say you reserve 50 servers and then two of them go offline. Do you have any logic in terms of preparing them spare servers? What's exactly the policy on a large scale setup?
35:03
As far as like error? You reserve 50 servers, right? And two of them go offline, have servers break all the time. So how exactly do you manage that in a company policy? So technically it won't be serviced anymore. Do you provision them from somewhere else?
35:21
Well, so things are going to always break. You can build in any level of validation, of pre-provisioning validation, but we found when you're doing very large deployments like several hundred servers at once, there's going to be always be one or two stragglers where maybe they don't pixie correctly to Kickstarter.
35:42
Maybe there's a failed disk that isn't picked up by the monitoring system. So the best that we can do is bake in as much automated validation as we can. And right now that's only on the network side, but we don't really do it on the system side. Generally, we have enough servers
36:00
that if there is a serious hardware failure, we can easily shuffle them out of the pile and throw another one into the mix and then that one takes its place. And then we can always sort it out later. And then that's why we generate the faulty servers on the documentation. But I don't know a good solution for that because you're fighting against so many factors at that point inside of a data center.
36:22
There's the network layer, you have power issues, you have anything you could even think of that's going to go wrong at a large scale. So we build more, our redundancy is more built into having at least one or two spare of that type that we can quickly shuffle in and then dealing with it later, rather than trying to bake in
36:42
extremely long tenuous validation. But we can always do better on the system side validating. Does that answer your question? Great, yes sir. I'm just wondering about the network side.
37:01
How do you actually deal with connecting out to your devices, say the Juniper, your Cisco? Do you have any issues with like vendor interoperability and how you connect? Do you use just like Netmiko or something or? I'm sorry, can you repeat? So do you have any issues with interoperability? So Cisco you might connect via SSH
37:20
or with Juniper you might have a better way like a Netcomp or something. We do, certain vendors, because we have a pretty heterogeneous environment. We have a lot of super micro machines, a lot of Dell machines, and they're not made equal when it comes to out of band interfaces and things like that. So we do have to do some one-off things.
37:43
It's more apparent on the network side that you would interface with Junos completely different than you would interface with iOS. So we have per vendor tooling that would go out and change the VLANs and everything is done, we basically keep this flat file structure and it's done in some kind of a CSV format.
38:03
So there's one per server that has several fields and it maps like say the ethernet interface to a MAC address and there's a field there for vendor ID. So if we were dealing with a Cisco switch versus a Juno switch, we would need to make sure that that vendor ID field
38:22
actually has Cisco and then if it does, it knows to use the other tooling that does the equivalent iOS stuff. But it's not elegant in any way, it just it gets the job done. Ideally, and this is more of a immediate future thing, is that Ansible is working on network orchestration
38:40
but it's done at an abstraction layer. So you would have Ansible drive all of your switch changes but you would tell it, I don't care what the vendor is, I want this VLAN or this construct to go from this to this and if it's in its desired state, it does nothing. So it's idempotent but you don't need to care about the vendor semantics.
39:02
The abstraction layer in Ansible will take care of that but I don't know how big that is. I think it's very actively being worked on and that's what we're going to move to but right now we maintain per vendor network automation scripts and luckily right now we're almost all Juniper across the board. So it makes that easy.
39:26
Any other questions? Well, feel free to find me after the conference and you can borrow my Terminator mask and attack me if you like. I'm into that I guess. So thanks for your time and I appreciate you coming to the talk.