8 Years of Config Management
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 95 | |
Author | ||
License | CC Attribution 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/32332 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
FrOSCon 201715 / 95
4
8
9
15
20
22
23
24
25
27
29
32
36
37
38
39
40
45
46
47
48
49
50
51
53
54
59
63
64
65
74
75
76
79
83
84
86
87
88
89
91
92
93
94
95
00:00
Data managementConfiguration managementRevision controlService (economics)Repository (publishing)NumberHypermediaProcess (computing)XMLProgram flowchart
00:56
Multiplication signEndliche ModelltheorieDifferent (Kate Ryan album)Shared memoryContext awarenessCivil engineeringSystem administratorWeb 2.0Universe (mathematics)HorizonTask (computing)Data managementHypermediaXMLComputer animation
01:50
Figurate numberWeb 2.0Data managementScripting languageWebsiteGastropod shellDatabaseData miningPresentation of a groupLimit (category theory)InternetworkingStatisticsMultiplication signConfiguration managementState of matterOrder (biology)Revision controlComputer animation
03:13
Revision controlSoftware developerResultantWeb 2.0DatabaseProjective planeServer (computing)CodeControl flow graphData managementFigurate numberSelf-organizationVirtual machineQuicksortFile Transfer ProtocolControl flowStatisticsWeb serviceState observerPhysical systemConfiguration managementLoginLecture/ConferenceComputer animationDiagram
04:28
Virtual machinePhysical systemMereologyData managementSoftware developerMultiplication signUniform resource locatorProcess (computing)Computer animationXMLUML
05:17
Server (computing)Software developerEntire functionWebsiteUniform resource locatorDatabaseConfiguration managementRepository (publishing)Virtual machineSampling (statistics)Video gameReplication (computing)Control flowElectronic mailing listDomain nameDiagramProgram flowchart
06:22
Social classPhysical systemDomain nameUniform resource locatorMultilaterationControl flowVirtual machineReplication (computing)Server (computing)AreaMathematicsProduct (business)Right angleTheoryChainRepository (publishing)CodeDiagramComputer animation
07:10
Replication (computing)Type theoryRepository (publishing)Virtual machineControl flowFlow separationMultiplication signDirected graphComputer fileContent (media)Maxima and minimaComplex systemMereologyMathematicsLocal ringConfiguration managementData recoveryPasswordServer (computing)Module (mathematics)Commitment schemeRevision controlNetwork topologyScripting languageCodeDatabaseHookingSlide ruleMultiplicationSource codeJSON
09:02
MathematicsConfidence intervalOffice suiteRevision controlAsynchronous Transfer ModeType theoryMoving averagePhysical systemError messageMultiplication signQuicksortPoint (geometry)WebsiteData managementCASE <Informatik>Stress (mechanics)Statement (computer science)Enterprise architectureFreewareServer (computing)Fundamental theorem of algebraComputer animationLecture/ConferenceJSONXMLUML
10:36
Shift operatorProduct (business)Multiplication signArmRepository (publishing)Data managementQuicksortComputer animationLecture/Conference
11:25
Physical systemData managementServer (computing)Revision controlDesign by contractImage resolutionProjective planeComputer fileWeb 2.0Latent heatFiber bundleMetadataConfiguration managementState of matterModulare ProgrammierungResidual (numerical analysis)Computer animation
12:50
Data managementMechanism designServer (computing)Power (physics)Revision controlMereologyMathematicsAsynchronous Transfer ModeProduct (business)Control flowConfiguration managementTemplate (C++)
13:38
MereologyConfiguration managementComputer fileMultiplicationData managementDifferenz <Mathematik>Multiplication signRevision controlService (economics)Public key certificateCodeMetropolitan area networkGoodness of fitServer (computing)Computer animation
15:09
Scripting languageReading (process)System softwarePhysical systemGastropod shellQuicksortUtility softwareAuthenticationMachine visionJSONLecture/Conference
15:59
Multiplication signPoint (geometry)Error messageConfiguration managementSoftware testingConsistencyComputer fileValidity (statistics)Data managementServer (computing)Software developerRepository (publishing)Line (geometry)Type theoryRevision controlComputer animation
17:20
Physical systemData managementRevision controlOperator (mathematics)Content management systemOrder (biology)RepetitionCodeTable (information)State of matterComputer configurationRadical (chemistry)MereologyLecture/ConferenceComputer animation
18:34
Group actionRegulärer Ausdruck <Textverarbeitung>Product (business)Exception handlingPower (physics)Physical systemFunctional (mathematics)MetadataLecture/ConferenceComputer animation
19:56
Group actionDefault (computer science)Server (computing)SubgroupMetadataDifferent (Kate Ryan album)Electronic mailing listDirectory serviceSet (mathematics)Lecture/ConferenceComputer animation
20:46
Function (mathematics)Default (computer science)MetadataAssembly languageMultiplication signProcess (computing)Functional (mathematics)Fiber bundleQuicksortData structureProduct (business)BitRepetitionReal numberComputer animationLecture/Conference
21:47
Fiber bundleCartesian coordinate systemRight angleGraph (mathematics)AverageGene clusterMedianDifferent (Kate Ryan album)Scaling (geometry)Physical systemWeb 2.0Server (computing)CASE <Informatik>Lecture/Conference
22:56
Configuration managementPlotterComplex (psychology)Function (mathematics)Line (geometry)Computer fileResultantMedical imaging
23:42
Different (Kate Ryan album)QuicksortComputer fileArmArrow of timePhysical systemMetropolitan area networkMultiplication signRepetitionGroup actionChainData managementData loggerFunction (mathematics)Fiber bundleBit
25:05
Database normalizationGenderRevision controlOffice suiteType theoryNetwork topologyConfiguration managementData managementComputer fileHash functionLevel (video gaming)Function (mathematics)Service (economics)Physical systemRepresentation (politics)Code refactoringDirectory serviceControl flowGreatest elementCuboidMathematicsFigurate numberArmHypermediaWebsiteMetropolitan area networkState of matterElectronic visual displayLecture/ConferenceComputer animation
28:46
MathematicsText editorHydraulic jumpMultiplication signData structureSpeech synthesisPoint (geometry)Lecture/Conference
29:42
Configuration managementRepository (publishing)NumberData managementGraph (mathematics)Fiber bundleEndliche ModelltheoriePhysical systemMathematicsExtension (kinesiology)Multiplication signMoment of inertiaSurjective functionRevision controlDiagramComputer animation
31:06
MereologyMathematicsRevision controlMultiplication signFiber bundleSlide ruleIntrusion detection systemProcess (computing)Configuration managementComputer fileState of matterConsistencyBlock (periodic table)WebsiteEntire functionPerfect groupComputer animationLecture/Conference
33:59
Point (geometry)Branch (computer science)Virtual machineConfiguration managementState of matterMathematicsCodeChainDirect numerical simulationDatabaseString (computer science)Software testingWeb applicationPasswordFunctional (mathematics)WeightComputer fileInformation securityRepetitionFiber bundleKey (cryptography)Repository (publishing)Lecture/ConferenceComputer animation
35:58
Sound effectPasswordMultiplication signContext awarenessRepository (publishing)WhiteboardDifferent (Kate Ryan album)Data managementRevision controlData conversionOperator (mathematics)MathematicsConfiguration managementLecture/ConferenceDiagram
37:29
Operator (mathematics)Cartesian coordinate systemComplete metric spaceConfiguration managementDatabaseInformationNumberRepository (publishing)State of matterBackupArithmetic progressionRevision controlStandard deviationLevel (video gaming)Endliche ModelltheorieLecture/ConferenceDiagram
38:41
Table (information)Physical systemKey (cryptography)MetadataGroup action1 (number)Query languageProduct (business)NumberComplex (psychology)Electronic mailing listDatabaseQuicksortFilm editingGastropod shellScaling (geometry)Computer animation
39:46
WebsiteForm (programming)SoftwareControl flowRepetitionRepository (publishing)Uniform resource locatorShape (magazine)Data centerOffice suiteSeries (mathematics)Virtuelles privates NetzwerkIPSecLevel (video gaming)Configuration managementData management
40:44
Connected spaceUniform resource locatorMetadataSoftwareAutonomous System (Internet)Group actionCircleData managementConfiguration managementINTEGRALMathematicsIPSecRevision controlRoutingLecture/ConferenceComputer animation
41:57
Query language10 (number)Point (geometry)Configuration managementMultiplication signMedical imagingOperator (mathematics)Single-precision floating-point formatRepository (publishing)InformationSession Initiation ProtocolPhysical systemSource codeDomain nameDatabaseData managementUsabilityCASE <Informatik>SoftwareDirect numerical simulationScripting languageClassical physicsOnline helpCore dumpIP addressVirtual machinePersonal digital assistantDiagram
43:37
InformationScripting languageState observerRepository (publishing)Computer fileInternetworkingOvalRevision controlWave packetDifferent (Kate Ryan album)Gastropod shellVirtual machineData managementServer (computing)
44:34
CASE <Informatik>Different (Kate Ryan album)Connected spaceSoftwarePasswordRepetitionPublic-key cryptographyAsynchronous Transfer ModePhysical systemHash functionPublic key certificateCategory of beingConfiguration managementWave packetFront and back endsProcess (computing)Multiplication signBitLecture/ConferenceDiagramProgram flowchartComputer animation
45:49
Repository (publishing)Hash functionConfiguration managementConnected spaceParity (mathematics)Physical systemDatabaseNumberAsynchronous Transfer ModeSlide ruleSummierbarkeitState of matterComputer animation
47:02
Rollback (data management)Repository (publishing)DatabaseMathematicsCuboidMoving averageMetropolitan area network
47:49
Software developerRepetitionFiber bundleOperator (mathematics)Repository (publishing)Multiplication signNumberElectronic mailing listPrimitive (album)Physical systemFile formatSoftware development kitContext awarenessIP addressFigurate numberData managementDirect numerical simulationCore dumpAsynchronous Transfer ModeKey (cryptography)Lecture/ConferenceJSON
49:08
System administratorPhysical systemData managementPasswordOvalPower (physics)InformationData centerPresentation of a groupInformation securityLevel (video gaming)WeightUML
49:53
QuicksortResultantRight angleEntire functionData managementConfiguration managementPhysical systemLecture/ConferenceComputer animation
50:44
Projective planeOpen sourceOpen setFiber bundleBlock (periodic table)Firewall (computing)SoftwareQuicksortForcing (mathematics)Lecture/Conference
51:55
Multiplication signTouchscreenSoftware testingMathematicsFiber bundleProcess (computing)MetadataPerspective (visual)RepetitionSpherical capJSONComputer animation
52:56
Fiber bundleRepository (publishing)Physical systemMathematicsComplete metric spaceEvent horizonBitPoint (geometry)MereologyConfiguration managementOperator (mathematics)RootData managementKey (cryptography)DampingLecture/Conference
54:06
Process (computing)Multiplication signEntire functionConfiguration managementData managementDisk read-and-write headPower (physics)Physical systemInheritance (object-oriented programming)Moment of inertiaException handlingQuicksortLecture/Conference
55:27
Set (mathematics)Concurrency (computer science)Default (computer science)Right angleVirtual machineConfiguration management2 (number)Speech synthesisEntire functionAddress spaceSoftwareHash functionCodeLecture/Conference
56:28
Graphical user interfaceVirtual machineScripting languageMereologySummierbarkeitLogic gateSet (mathematics)State of matterServer (computing)Fiber bundleConfiguration managementConnectivity (graph theory)Hash functionMathematicsRevision controlRepetitionMultiplicationPhysical systemService (economics)JSONComputer animation
57:32
Service (economics)Product (business)InformationRepository (publishing)Metadata1 (number)Physical systemLevel (video gaming)Configuration managementComputer animation
58:30
Random matrixElement (mathematics)RotationSystem callLimit (category theory)Multiplication signTwitterSlide ruleWebsiteLecture/ConferenceComputer animation
59:27
Data managementRevision controlConfiguration managementJSONXMLUMLLecture/ConferenceComputer animation
Transcript: English(auto-generated)
00:07
Hello and welcome to 8 years of config management. My name is Thorsten Rhein and I work for a company called Zybot Media where I first started back in 2008. Now, why should you care about our config management?
00:21
Let me give you some numbers. We currently have 727 nodes under management, our monitoring is checking well over 10,000 services. Importantly, at least in my opinion, last month we had 21 different people committing to our shared config management repository and overall they are averaging about 38 commits a day to this same repository.
00:47
Getting to these numbers from zero was a very slow and steady process that took around 8 years and many mistakes were made along the way, some of them I want to share with you today.
01:03
Our journey starts back in 2009. Now, what was the world like back then? Let me give you some context. In 2009, at the beginning, Michael Jackson was still alive. There was no iPad, there have been 15 different models of iPads ever since. Obama had just been freshly elected to the White House for the first time.
01:23
The Deepwater Horizon oil spill was still a year away and I'm sure you all remember that it took a while to clean up. Osama bin Laden was still hiding in Pakistan and the world was kind of worried about swine flu. Personally, I had a lot more hair back then and just dropped out of university and started as a trainee in systems administration with Zybot Media.
01:44
It was during that time that I was given the task to build something that automates management of our web hosting customers. Simply enough, it kept me from breaking anything else. So I started figuring out, okay, we have these web hosting customers,
02:03
they were not making us a lot of money back then but they had supported the company from the beginning in 1996 so we kept them around and still tried to take good care of them because they were really loyal to the company. So in the beginning, of course, there was a child script. I tried to get that started to add a new customer, add a new website for an existing customer
02:25
and stuff like that, add a new database. But quite early on, I started running into the limitations of this. I wanted to improve on what we already had. I wanted to add stuff like statistics and more centralized management.
02:43
So it just took me maybe two weeks before I started looking into config management. And the first thing that I looked at was Puppet, which at the time still used this cute little fella as a logo. I couldn't find this anywhere on the internet today so that's just out of an old presentation of mine.
03:02
Kind of cute. But at the time, Puppet just kind of rubbed me wrong. I couldn't really get to grips with it in my old notes from the time I said that the documentation was bad and maybe it was just me being stupid but I couldn't really make it work for me.
03:21
Then I looked further and I found bconfig. Yes, it's spelled b-c-f-g-2 but the developers all just call it bconfig for short. And that was a whole different story. The community was rather small but in the best possible sense they were really welcoming and they helped me out a lot on IRC and eventually I even started contributing code back to the project
03:44
and even that logo you see up there. So that's kind of where I settled, where I could get started quickly and produce the results that I wanted. So I had my little web server set up here with two web servers and a database server and I added another virtual machine as a sort of controller that ran the config management on it
04:05
and then also had an LDAP server because I was curious to learn more about LDAP and I figured it would be kind of nice for web hosting customers could use the same logins for their FTP data uploading and accessing their statistics on the web.
04:21
So that kind of worked very well for me and that was sort of my little kingdom that I had and around it was a vast sea of unmanaged systems that were really just administered in the traditional way as in someone SSHs into a machine, does something and it's done and nobody ever knows what he did there.
04:45
So that kind of worked that way for a while. Fast forward to 2011 and this part is a little emotionally challenging for me because I tried to expand my cool management set up to cover our entire infrastructure
05:03
and even back at that time that meant different locations that were really strongly separated from each other and couldn't talk to each other. So what I eventually ended up with was this behemoth of a set up where a developer would commit to a master server, see it all the way at the top,
05:22
that had the bconfig server on it, still had a hugely expanded LDAP database that not only included user accounts but now also the entire inventory of virtual machines and even websites and you could configure website quotas in LDAP was all terribly over engineered.
05:44
But to add insult to injury, we still had these separate locations that you see here right below and so I somehow needed to get the data out of the central repository to each location but preferably only that data that the location needed for itself.
06:04
So what I came up with was that with LDAP replication you can restrict that replication through access control lists and that was really probably the worst thing I ever did in my life. And now I had the data at the location but I'd already come up with the concept of a domain inside a location
06:28
you could call it an isolated cluster of systems and they still had these controller virtual machines that I had started with. So that was another step in the replication chain and then we eventually were running LDAP on every production system
06:45
because why not, it just takes a couple of megabytes of RAM, right? And it was a great way to make changes to a user in the master server and have it automatically replicated down to each production system, in theory.
07:01
In practice this led to operational problems but the code for it was also really messy. This is from the very first commit that I found in this repository and it's atrocious, you don't need to read it all, I just highlighted some parts. If you look down in the lower third you recognize this is actually some kind of XML file
07:22
and then you have some Python code inlined at the top which you can do with bconfig. And of course you have passwords in there committed in plain text like you do and then down below you have this bound config file XML entity which has a for loop embedded in it to create multiple config files.
07:44
It's a huge mess, terrible. Eventually we improved on this slightly by moving the LDAP code into a separate module, we got rid of the passwords but it still retained that horrible stink and was very unreadable. And while it seemed powerful to me at the time, really what I was making was a huge mess.
08:06
And eventually we ended up with this kind of workflow. I couldn't even make a pretty slide for that but let me just run through it real quick. First obviously you might get changes in git, push that, that would trigger a post update hook to rsync the contents of that repository to each bconfig server in this whole tree that you saw earlier.
08:28
But the LDAP replication wouldn't always work reliably, in fact most of the time it wouldn't. So you had to SSH into each node running slabd that was between the master server and the node you actually wanted to do something on,
08:42
wipe the LDAP database and restart slabd to just retrigger the replication the hard way. Then you could finally SSH into the target node, run the local script there that would pull the configuration from the bconfig server and apply it to the machine. Now at the worst of times this could take up to 70 minutes for a complex system like our monitoring,
09:04
which is terrible all in itself, but we also had no real confidence in the system so we used the interactive mode that bconfig has where it asks you for each and every change that it would make. And you could always tell that someone in the office was doing this because you'd hear them push no, enter, no, enter, no, enter.
09:28
Very, very fast, very, very often until they finally arrived at the change they actually wanted to make. And sometimes you'd know accidentally then because you were kind of on a roll and then you had to start all over again.
09:42
And then you could finally apply a change, notice that you made an error type or something and then you would start all over again. Now it wasn't always this bad, this is kind of a worst case scenario that I sketched here, but it could happen to you. And this was sort of the low point of conflict management that really made it clear that something had to change fundamentally.
10:08
And that brings us to 2012. That was the time when I really started looking hard at how can we get out of this situation. And during that time Chef was finally getting a lot of traction and had a lot of buzz around it,
10:23
so I looked into that and I remember seeing a statement on the Learn Chef website back then that said, instead of spending hours trying to get Chef installed, we recommend a free enterprise Chef account and we'll take care of the Chef server for you.
10:44
Now I get it, you're trying to sell a product, but when you're already assuming that I have to spend hours just getting this to work, that just didn't sit right with me. Would I be spending more time managing the management?
11:01
I really didn't want to deal with that. So I looked into it a little further and still didn't like what I was seeing at the time, so ultimately I wasn't really enthusiastic about going the Chef way. And nowadays that would have probably driven me right into the arms of Ansible,
11:23
but if you look at the first commit in the Ansible repository, that really just got started in early 2012 and wasn't really known to anybody yet. And this disillusionment with server-based contract management systems kind of sent me on this path which many people frown upon,
11:46
but I was kind of interested in the challenge. How hard can it be? Can't I just make this work? Why is this so difficult? So I tried to do this myself. So in July 2012 I started my first project trying to solve conflict management on my own, foolishly.
12:09
And first it was around fixing bconfig because I liked a lot of the ideas in there. There was the idea that you had bundles which most systems have some kind of concept
12:21
where you take a collection of items like files and packages and whatnot and you bundle them together and that all evolves around a software package like the Apache web server where you have an Apache bundle. But then I wanted to have a clear distinction of metadata that you would attach to each node that handles the specifics for this node.
12:45
And combining bundles and metadata would yield the configuration for that particular node. I also liked and definitely had to keep an interactive mode because just culturally people expected this.
13:01
They didn't trust conflict management and they were scared shitless about just running this machinery on a production server and just hoping that it would do the right thing. So we had to have some kind of human-in-the-loop mechanism so you could still say, oh no, I don't want this change. Don't do that.
13:22
And I liked the idea of having Python anywhere because I was very familiar with Python already and I still wanted to retain that power now that I have learned where it can take you. There were also a couple of bad parts. I definitely wanted a better template engine. I didn't want to have to write any more XML just to describe my infrastructure.
13:47
I wanted to speed things up by applying multiple items in parallel and not just serially. And that of course meant I had to deal with dependency management.
14:00
Also one of the weird quirks of bconfig is it for some reason had very slow diffs. So after we generated a huge config file for monitoring, it took a bizarre amount of time for it to generate the diff. I still don't know why that is, but it was definitely annoying. And I kind of worked on this for a couple of months and then I realized I wasn't doing enough good.
14:27
I was just reinventing bconfig, taking some parts out. I even reused some of their code, but it still didn't feel worth it. So I figured my approach to solve this had to be way more radical.
14:41
I didn't want to deal with servers anymore. Do I really have to have that? Can I just take that out? I also didn't want to fiddle with agents running on each and every node and didn't want to sign certificates for them or anything. So there was this big disillusionment that I really wanted to take this thing apart
15:01
and only put those parts back together that I really, really needed. And somewhere during the time while I was reading a man page, as one does, I realized that for decades we've been writing shell scripts that use system utilities
15:21
and they were sort of an API for the whole system. And if you look at how stable these things have been for years, well, they're more stable than some proper APIs I've seen. So I was wondering, couldn't I use the existing authentication channel that we already had, namely SSH?
15:42
Everyone already had SSH access into our systems. And couldn't I use the existing utilities on the system to manage them? Would that be feasible? Would it really be as hackish as it sounds? I didn't think so, and that was kind of when all that vision for BANU-RAP came together.
16:05
And I just tried it and wanted to see how that went. And by the time I got to this point, we were in 2013. And in June 2013, so that's a full year after I started tinkering with this whole
16:23
I'm doing conflict management on my own idea, I started what BANU-RAP is today. Now, from the get-go, there were a lot of things that were either impossible or hard to do with the existing solutions that I wanted to make possible.
16:41
For example, bw test. You can type the simple command into any BANU-RAP repository and what it will do is go through each node you have configured and actually render all the config files in there. So at least you know you don't have any syntax errors in them. There are some more internal consistency checks, and if you look at the last two lines here
17:04
you can also define hooks for that and do your own custom validation on top. Now this is awesome because it requires no setup for you. You can just run it locally, it's terribly easy to just plug into a CI server and now you can almost feel like a real developer, you know?
17:20
Because they have these tools, why shouldn't ops people? Another thing that I always found odd is that many conflict management systems will just run whatever commands they need to do and then just assume that they worked. Sure, they're item potent and all that, but I really had the urge to make BANU-RAP check everything that it did.
17:45
And that's why the code is already set up in such a way that anything you can configure in BANU-RAP can not only be set by BANU-RAP, but it can also check that it actually worked and read that back. And that means that BANU-RAP has a verify command that can really go through each node
18:03
and tell you the state of everything. And that always seemed more complete to me than the dry run options you can find in other systems because it really just reads out what is there and it doesn't just assume that the command fails
18:21
or makes too many assumptions on that part. And after view from that you get this nice table rendered in your terminal, looks really great, that tells you how many items there are on this node, how many of them are in a good state, and how many of them deviate from your conflict management. Something that was also very important to me was being able to compose groups really the way I wanted to.
18:48
Now, this is an example for how you can compose groups in BANU-RAP, where you have this group that we call important stuff, it has a couple of members in it. Here we have just that old DB host that we set statically, that's possible of course,
19:05
but you can also use Regex to include all nodes that start with cluster one. And you can even go further by defining functions if you really need that power to decide whether a given node should be in that group or not.
19:20
Here we just look in that node's metadata, have we marked this node as being a production system, and then we would add it into the group. And after you've done that, you can even remove nodes again when you have these little exceptions that you need to make, and just because we could, we're removing Debian nodes here, we still like Debian of course.
19:43
So that's an example of how flexible you are when composing groups in your infrastructure, which is obviously very important when you have a lot of systems and a lot of diverse groups of nodes that you're dealing with. Another important concept is metadata and how it relates to groups.
20:03
Consider this example. You have a group that's called Germany, and in its metadata it says all nodes in Germany should use this particular name server. Then it also says that the Frankfurt group is a subgroup of Germany. In the Frankfurt group, you set a different name server and metadata,
20:23
and by default they will be merged. So node 1, which is a member of the Frankfurt subgroup, will have both name servers, except we can also override this name server again, and if you wrap it in atomic, you can make sure that the third name server doesn't get added to the list,
20:42
but it rather overrides these name servers. So that's an example of how you can use groups to assemble metadata for a node and really override defaults when you need them. You can take metadata even further. I don't want to get into this too deep, but every bundle can also define metadata processes,
21:03
which are really just Python functions where you can mess around with that metadata and put stuff in a very dynamic fashion. And so I was working on this for a time, and I really liked the concepts
21:22
and the sort of structure that came along with it, and that brings us to 2014, when we were starting to use bundle wrap in production for Ernest. Not that much yet, but we really tried to use it for real.
21:43
Now let's talk a bit about what our infrastructure looks like. I've prepared this chart here. What you see on the x-axis is each and every bundle that we have, they stretch all the way to the right, so it's just shy of 200 of them, and on the y-axis you have how many nodes these bundles are assigned to.
22:05
So you have the bundle for the Apache web server, and that's probably just around here somewhere, and you can see, OK, it's assigned to 500 nodes. That's just what this chart says. But from the average and the median that's noted here, you can see,
22:23
yes, we have a few bundles that are assigned to a lot of nodes, but the majority of bundles is assigned to just three nodes or less. Here I've put it on a logarithmic scale. It's the same chart, just on a different scale. Where I can really see that around one third of our bundles just apply to a single node.
22:42
So the infrastructure is very diverse. We don't have a lot of clusters with 40 nodes that all look the same. We have to care deeply about each and every individual system, and often have to come up with special cases. BundleRaptor, I think, does that very well. One way to inspect a node with complex configuration
23:06
is using bwplot, which generates dot output that you can pipe into graphvis, and then it will render an image for you. Now, I've done this for a real node in our configuration,
23:21
as it is today, and this is the result. Now, you probably can't see anything here. What you're looking at is a 52 megabyte PNG file. Let me zoom in a little bit closer. Okay, now you maybe can make out some things. Maybe you can even see some of the lines.
23:41
You can zoom even closer, and finally we get to see at least a bit of what's going on here. What we're looking at is the apt bundle for package management on DBN and Ubuntu. It has different items in them, like files and actions and apt packages itself, and they're all connected through dependencies.
24:03
The arrows that you see going on there are all different kinds of dependencies. Now, for example, when you're installing apt packages, and Bunrep is inherently parallel, so we need to take care of package managers
24:22
just because they use log files. You can't install two apt packages at the same time on a system, so you need to make sure that we install them one by one. The way Bunrep does this is by daisy-chaining all the package items in a sort of dependency chain,
24:41
and make sure every package depends on another package and so forth, so you can only apply them one after another, and all the other stuff can still run in parallel, like files, because you can obviously upload two files at the same time, for example. Now, this output for a whole node isn't terribly useful,
25:02
but what Bunrep can do, if you end up with a dependency loop, it will also give you a trimmed-down version of this view, where you can really see what's going on and where you might have introduced a redundant dependency or something.
25:24
And it makes nice office art, that kind of type. Another thing that had been really important to me is understanding configuration as a Merkle tree. Now, if you've never heard of the Merkle tree, it's pretty easy to explain, actually.
25:42
I have a couple of items spread across two different nodes in this example, so I have a file, a directory, a service, and then on another node I have another package and a file. Now, what I can do is look at how are these things configured in Bunrep, create a representation of that,
26:01
and run a hash function over it. So now, for each item, I have one hash that will tell me exactly how is this item configured. We'll see how that looks in a minute. I can then take all of these hashes and aggregate them into hashes for each and every node. By hashing all the item hashes for one node,
26:21
I get just one hash that represents the configuration of an entire node. And then, of course, I can take all my node hashes and hash them together, and now I have one hash value that represents the state of my entire repository. Let's see how that looks in practice.
26:43
What we're doing here is we have a node, which is GCE's media1-netbox, and we're going to show what does that item look like that controls the file at C host. You see there's a content hash, and then you have all the ownership and permission attributes,
27:04
and that really is the entire configuration that Bunrep has for this particular file. Then we just aggregate that into a hash. The hash you see at the bottom now is just the output of the previous command run through share1.
27:26
Then you can go one step up. Now show me the hashes for all files or for all items on this node. I've just included three here. You see that's for each item for each file, and there's a package in there too.
27:40
They end up with one hash each. We go up another level. Now you can show me the aggregate hash for this particular node, and again that's just the output of the previous command run through a hash function. From there we go up even further.
28:01
Now show me the hashes for all nodes, and finally we end up with just BW hash. What that will do is generate your entire configuration, generate a hash value from it, and display it to you. Why is this important? I'm not aware that any other config management system does this. Suppose you're doing some refactoring
28:21
in your config management code. You're just trying to clean up some stuff, and it's all very complicated, but you don't want to make any actual changes to your nodes. You're just trying to produce the same result, but in a different way. What you can do is generate this hash beforehand and after,
28:41
and if they match you can be confident that you didn't cause any unintended changes on your node. That's a really powerful assurance to have. It can also be used for other things, like say you made some changes, but you're not really sure how many nodes are affected. By comparing the node hashes before and after,
29:01
you can really tell if a node has changed through different points in your git history. I love this feature and I think it's really powerful. I think one or two times it really saved my bacon when I made a huge change that impacted a lot of things, and there was some small detail that I missed.
29:24
These are some of the more advanced features that BunRep has. You won't use them every day, but that's just the possibilities that arrive from the internal structure that we came up with. With that we can make another jump into 2016.
29:43
Now that was a huge year for our config management and you can just see that by the number of commits to the repository. What you see here in the graph is not the total number of commits, but just how many commits were added each year. It's pretty easy to see that things started relatively slow in 2009
30:05
and slowly worked their way up to 2015, even though we had this horrible infrastructure behind it, but then in 2016 activity almost tripled. That was a huge deal for us. Now why was that?
30:20
I've prepared here a chart of the number of bundles that we managed. You can again see that bconfig had a slow start in 2009 when I was tinkering with it. Adoption increased between 2012 and 2013 when we made it mandatory to put all changes
30:41
that are done to infrastructure under config management. Then in 2014 we started finally using bundle wrap and it took us almost two years to migrate from bconfig to bundle wrap. It's a very long time. Config management systems always carry this huge moment of inertia
31:02
where you don't really want to rewrite your entire configuration, but to some extent you have to when you're switching. That's really painful to do, but once you get through that phase, as you can see, the rewards were quite visible because as soon as we switched off bconfig,
31:23
activity increased dramatically. That really felt liberating because we had collected a lot of experience as part of these past few years and now it really felt like we arrived where we wanted to be. Another important change that we made during 2016 was mandatory pull requests.
31:47
That's one of these changes where I have absolutely no idea how we ever lived without it, where everyone would just push into master and think it's probably okay. When you introduce a change like that, there's obviously a lot of hesitation.
32:07
How much time will we spend doing reviews? I can't make any changes immediately. Do I have to wait for this stupid review? What if nobody has time and can only review my changes tomorrow?
32:21
I need to do it now. That was a problem which we addressed with node locking, with bundle wrap you can say okay, I want to lock this particular item on this particular node for the next three days. You can also lock entire bundles on a particular node
32:43
or you can lock the entire node. It's really quite flexible. We can again show what locks are present on each node and as you can see here, they are identified by these simple IDs. They always have an expiry date.
33:03
They affect certain items and what you can see on the slide here is another column at the end for a comment that you can leave so people will know why you locked this particular node and what you're doing there. Now when someone else tries to apply configuration to this node,
33:20
they will just skip that particular file and they can still do the other work they're doing on this node. That gives you time. An example here, I've locked it for three days. To get that pull request review in and after that has happened and your change has been merged into master, you can remove that lock and everyone will be in the same state again.
33:46
Now this process of course isn't perfect. People still forget to lock nodes and then override their changes. We're still working to figure out how we can prevent that and make locking more intuitive and easy
34:00
so you will do it automatically. For me this is one of the strong points of being able to apply configuration directly from your machine where you control the state, where you can work on your own branch and just apply changes directly to a node because sometimes you need to make changes now because you're setting up a DNS change for a customer
34:22
or you need to react to some situation or you just need to keep a deadline, whatever. You can absolutely do that and still get the benefit of a code review later. Another important feature that we came up with is secrets.
34:43
When you create a new bundle rep repository it will automatically generate these two keys for you. One neat way you can use this is for passwords you don't really care about. Let's look at what we're doing here. We have this file called etc-secret
35:01
and we just want to write a password in there. We do that by just passing any string that somehow describes that password through this password for function. And what it will do is take your string that you can really make up anything you want and derive a password from the key in that secrets file
35:25
and your string. This is very useful for situations where you can control both sides of how a password is used. Think of a web application that needs to access its database. You don't really care what the password looks like
35:41
and what is in there. You just want a secure password that's configured the same way on both sides in the database and in your web application. And the cool thing is when someone leaves and you switch out the secret and apply your configuration to all nodes you'll automatically rotate all these passwords
36:03
but they will still match on both sides even though they've actually been changed. So that's a nice little side effect that we have. And that brings us to 2017. We're almost there. And this chart that I have here
36:21
is probably the one I'm most proud of. It shows how many contributors each month different people committed into our conflict management repository. You can see in the early years it was just me working off and on on this and then over time you had more people come on board but things really exploded at the beginning of 2017
36:43
and we peaked last month at 21 different humans committing into that repository and they come from seven different teams in our company. It's a really great kind of adoption to have. And the best way I think to pull this off
37:00
is by just getting one team really trained in it and then send out ambassadors in those other teams and embed them there for a while. You can call it DevOps if you want. And really enable these other teams to make these kinds of infrastructure changes in a way that's still reviewed. That's why we have pull requests.
37:22
And through pull requests and the conversations that go on in there you can really make those other teams aware of the challenges that the operations department has when it comes to maintaining all of this configuration. And of course the other teams also see everything else that's going on in the infrastructure
37:43
and get a better sense of where their application fits in. Now these charts wouldn't be complete if I can show the progress of the number of nodes but as you can see the data for bconfig2 is rather spotty because remember we kept all this information in LDAP
38:03
and databases are not always great at telling you how their state was two years ago. So just the ballpark numbers I have there are taken from old LDAP backups that I found lying around.
38:21
With bundlewrap you can see the data is much more accurate because finally we had each node committed to the repository as text because we finally realized that we don't need to have those in the database. Now as we went past 700 nodes that also brought with it some more challenges
38:43
to how to do inventory for these and how to keep an overview of what's going on. And a pretty cool feature that's relatively new is metadata tables where you can just look at a certain metadata key for a certain group and it will give you a nice table.
39:01
Here I can see of all our systems running on Google Compute Engine which one of these are in production and which ones aren't. Now these tables can also be restyled quite easily to be more grep-friendly and that lets you form some really powerful queries in your command line so suddenly bundlewrap itself feels like a database again
39:22
and if you need to come up with any sort of list for any sort of purpose you can very easily do that in your shell using just grep and cut and sort and all those other good things. Now scale is an important issue mostly for us not in sheer numbers but in complexity and diversity.
39:46
Take a look at this picture. This is automatically generated from data in our bundlewrap repository and it shows what we call our side-to-side network. So all these nodes that you see here represent different locations that we have
40:03
in some way, shape or form. You can see we have our own data center in Frankfurt, we have several with profit breaks, we have some with Google Compute Engine, we have something at AWS. It's really spread out all over the place and of course also includes our office locations.
40:23
Now all of these locations are connected through IPsec VPNs with BGP doing dynamic routing between them and it's really all very complex and would take days and days to set this up manually but with config management once we have that all figured out
40:40
we can describe it at a very high level. This is taken from a metadata where you have some notion of what that location is and then you just define which location you want to create a connection to and just write that as a pair in this particular piece of metadata
41:02
and then you describe each location in more detail. You can tell it which networks it should announce over BGP, which AS number it has and just from that we get all these IPsec connections, we get all the dynamic routing so if one location goes down for some reason
41:21
we can route around it. It has worked really well and this kind of thing just isn't possible without config management in my opinion because you'd be running in circles all day and never know where you need to make a change to fix your connection.
41:41
Config management here really shines because you can make sure that every one of these locations is configured the same and is talking to each other the same way which is of course especially important with IPsec. Now let's take a minute to talk about integrations. I previously said that talking to LDAP
42:01
was one of the huge mistakes we made back in the bconfig era because it created this dependency and at some point the thing was very inefficient and it ran tens of thousands of LDAP queries just during one single apply operation
42:20
and that of course took a lot of time. Now what you can do with the information that's contained in the repository well you can obviously create images for documentation like the one I showed you earlier with the side-to-side network. You can also talk to your domain registrar or the DNS subsystem to see if you have configured any unused domains
42:43
and help with cleanup there. And since you're also configuring IP addresses for virtual machines you can also push into your IP management system. That's all easy because it's just a script that's in that repository and you can use it every now and then
43:01
to update information on the system so you don't need that to be live at all. But that's taking data out. Taking data in from the source is a little more tricky. And the classic use case is LDAP. We still want to put all users in LDAP in some kind of way because LDAP is still a great database for storing user accounts.
43:24
And the way we solve this is using a simple JSON dump. Now that's one of these ideas where you think is that really sensible? But it really works quite well in practice because you don't add users every day. At least we don't.
43:41
We still hired a lot of people last year but even then it's okay to just run this import script again every now and then and pick up the new users. So we take all the information we need out of LDAP just dump it into a JSON file and commit that through the repository and now Bunnelwrap can just read from that JSON file
44:01
in the repository and always has that information available. And that is a huge deal because now we can work offline on a train where we have no internet and you can't talk to this LDAP server. Another problem that can happen is what if your LDAP server goes kaput and you need to set it up again in a different virtual machine?
44:23
If your conflict management depends too heavily on LDAP then you end up with the chicken and the egg problem where you can't provision the LDAP server because the LDAP server is down. If you cache the information locally, you can always use it. There's another case that we need to handle is secrets.
44:42
We keep all our different kinds of passwords that need to be read by humans and stuff like SSL, private keys, in a software called TeamWorld. And that is a live connection where Bunnelwrap will really during the apply process of the configuration
45:02
talk to that system over an API and pull the data out. But I still don't want this connection to be mandatory so the way we solve this is you can switch it into a dummy mode with an ANFAR and that will make this backend always return to stummy values for passwords
45:21
and private keys and stuff like that. Because when you're developing stuff for this on a train you don't really need the actual certificate. You don't really need the actual password. You just need something to arrive and that kind of looks like a password. And then you can still go about your day
45:40
and do most of your developing. And doing things this way has the very nice property of tying one git hash in your repository directly to one of the BW hashes that you can generate. If I switch off the TeamWorld connection
46:01
and replace it all with dummy values I always end up with the same BW hash for the same git hash. So even ten years from now I can go back and be confident that I don't depend on any external database to reconstruct what I was doing ten years ago.
46:22
And that's something that became very apparent to me when I was doing the research for this talk. So much data from the past was just gone because we were pulling it live out of another system that doesn't have the kind of history like git has. And I still find this very important
46:41
to this day to have that kind of parity between how does my configuration look and what's in the git repository. And I try to always keep it as complete and tied to each other as possible. And this will of course also let you do more interesting history spelunking
47:01
where you can use git bisect more effectively by going back because now you can also roll back changes. In git you can't roll back changes in an external database when you're doing bisect and try to find a problem that was introduced when you added that new node or something
47:22
if you pull your nodes out of LDAP. Now having this kind of repository where everybody works on one giant repository with a huge history has the benefit of it being also a poor man's dropbox
47:40
because if everybody works every day using that repository it also means they're pulling that repository almost every day. So you can assume that every one of your ops people and even the developers that are using the system will have a reasonably recent checkout of that repository. And we abuse that
48:00
by putting more stuff in there that isn't directly used by bundle rep but that we call the emergency kit. And that's just a really primitive list of phone numbers for example. Because if you have a large outage and you need to call all your colleagues to help you
48:20
chances are you just got a new phone and haven't synced your contacts yet. We actually had that problem at the time. So this way at least you can be confident that everyone can reach everyone at any time. We also put in a complete grab-friendly dump of our IP address management system there
48:42
so even if DNS completely goes down we can still find stuff and figure out which IP it has. It's often overlooked how much we rely on DNS and how we can't figure out what IP addresses we connect to if it goes down. So we kind of ensure against that failure mode as well.
49:02
Critical secrets are also in there and they are encrypted of course with the secret keys you saw earlier. And that's just to prevent other chicken and the egg problems like, okay, so the NetApp filer that hosts the password management system just went down. What's the admin password for the NetApp filer?
49:22
Okay, I'll just check the password management system that I can't reach anymore. That's a nice place to just put those very low-level credentials that you need when everything goes dark. And just mundane data center contact info. Who do you call to get access?
49:42
There's also stuff like building security, facility management just what you could need if everything goes dark, where do you start? That's always a good question to ask yourself. If the power goes out in the entire world where do you start to turn things back on? Where do you need access?
50:01
How can you bootstrap your entire infrastructure again? So, let's summarize sort of what our experience has been. Having the right tools is important, obviously. But I think if you can't
50:20
find them, don't be too hesitant to build them. Not everyone should build their own config management system and as you can see it can very easily turn out pretty badly if you do. But if you find it interesting and like me
50:41
I've been doing this for more than four years now and I still very much care about the subject and there are still releases about every other month. And also definitely open source it. Why not? Not everybody has to use it. Nobody knows about bundle wrap really. It's just like
51:01
over a hundred stars on GitHub. It's nothing compared to Ansible or anything. Maybe this talk will change some of that. But I'm okay with it. Open sourcing what you're doing there not only invites other people to collaborate on that and maybe contribute back and stabilize your project
51:20
but it also forces your own mindset to be more open to other users and avoids kind of building these special bells and whistles that really only you need. So that creates sort of a firewall against that where you will be naturally hesitant to put these little things in there that will only
51:40
trip other people up and that nobody else will need. So I kind of use that as an insurance that what I do there will be generally useful for other people as well. And that usually creates better software of course. Now I've touched on this a few times already but treasure your Git history.
52:00
When I was doing the research for this talk every now and then there was a small crowd that gathered behind my screen and was looking at how we did things three years ago and how horrible it all was. It was really such a great experience for the team just being reminded of the way that we came together and how we survived all this
52:20
crappy infrastructure that we had built around it and how fast we were and now processing changes and how many pull requests we were doing each month. And that really put things in perspective for the team and was a nice experience. Also being able to go back and visualize
52:40
that sometimes creates new ideas. While I was preparing this talk I thought of some ways to clean up metadata processes for bundle wrap. Some really advanced stuff that you don't really need to know about but it created this new sense of, okay, when I look at my
53:00
history, what can I learn from that? And you can only do that if you have your history available and if it's as complete as possible. So I'm really glad that we kept this same repository since 2009 and not moved to a new one when we introduced bundle wrap. But we kept them all in the same Git repository because
53:20
that makes it a lot easier to inspect your history and see how much activity there was, what events and changes increased or decreased activity because people were more hesitant to use the system. I find that really important. Another point is allow a culture to evolve and
53:42
evolve is really the key part here. I can't really say for sure how much this applies to other people but coming through this painful journey that we had was really important to deeply root config management in our operations
54:01
culture. Going from scratch, from really what features do we really need and having to build them for us was a huge plus. If we had been starting out with a super powerful config management system chances are that there are some features that we would use just
54:21
because we can. And that's dangerous. Like you saw in the terrible way how I abused bconfig2. So, acknowledge that config management carries with it a huge moment of inertia
54:41
and that if you go too quickly you will accumulate way too much technical depth. I know everyone is sick of hearing about this. But really try to go things slow except that it will take years to really establish this. You can't go from one day to the other to managing your entire infrastructure and that's okay.
55:01
Acknowledge that it's a process that it needs to evolve on a tactical side, that your processes need to evolve and also the mindset and the heads of your people need to change. And they can only change slowly by learning each step and realizing why did we make that step. And that just
55:21
takes time for humans in my opinion. So, where do we go next? One thing that I will need to address very soon is speed once again. Right now I'm generating the entire configuration for our entire infrastructure on my
55:42
machine right here with the default concurrency settings takes 4 minutes and 11 seconds. Now that's 700 nodes and there are very few situations where I would really need the entire infrastructure. But when you do a BW hash that's exactly what you need to do. So faster software is always better software
56:01
so I'll try and see if we can squeeze a few minutes out of that. The other interesting topic that we'll probably address in the next come month is orchestration. One of my colleagues expressed it quite nicely just earlier. What we're doing right
56:20
now is mostly configuration as code. Infrastructure is code that we don't really do. We only manage what's inside our virtual machines but the setup of the virtual machine itself that we still do using old fashioned scripts and using some
56:41
graphical tools even. There we can expand more and also create that in a way that is tracked in Git so we know when was this virtual machine really added who did it, why did they do it and how did it all end up here. I've
57:02
also started work on a server component for bundle wrap but that's very different from the other configuration management servers that you may know. Some things that I want this tool to answer is who applied where when? What was the state of the node before the apply happened?
57:20
What Git revision was applied and who did it? Maybe even allow them to leave a comment on why they did that particular change. Then I want to automate looking into BW hash and just very easily show for each commit how many nodes are affected. Which ones which files on these nodes were affected?
57:42
As we add more and more data to our metadata like information about is this a production system? What kind of service level agreement does it have? We already want to expose that data that we already put into other systems to consume. So that's another really interesting thing
58:01
that you can do once you have a certain amount of data committed in your repository. And then we can really take things further and maybe start tinkering with automatic BW applies. What we do right now is some guys
58:21
every Monday they'll do a BW apply all just to make sure there's not too much configuration drift across our infrastructure. Automating that will be tricky. I think there are a lot of fine details that we'll need to get
58:40
right. But it will be interesting to see where that goes. And then there will also be automated commits for trivial things like changing the on-call rotation. Which is another thing that just happens every week and maybe we can automate that as well. And that concludes
59:01
my talk. Thank you very much. If you would like to learn more about bundlewrap you can do that on bundlewrap.org. If you'd like to learn more about our company there's the website. We also have a booth. It's the one with the obnoxious LED wall. You really can't miss it if you come here. I put the slides up on speaker deck and you can always
59:20
find me on Twitter of course if you have any questions. Now we've already hit the time limit here. If you have any questions I'll be hanging out at a booth for the rest of the day and I'll be happy to chat about any kind of config management issues you want to talk about. Thank you.