8 Years of Config Management

Video in TIB AV-Portal: 8 Years of Config Management

Formal Metadata

8 Years of Config Management
a journey through one company's challenges and learnings
Title of Series
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Starting with a small Puppet deployment in 2009, followed by the spread of Bcfg2 and finally the development and full-scale adoption of BundleWrap, we explore how configuration management at //SEIBERT/MEDIA has changed over the years.
Keywords Persistent Migration

Related Material

Video is cited by the following resource
Revision control Data management Service (economics) Process (computing) Computer animation Hypermedia Repository (publishing) Configuration management Number
Context awareness Computer animation Civil engineering Hypermedia Multiplication sign System administrator Universe (mathematics) Shared memory Horizon Endliche Modelltheorie Task (computing)
Web 2.0 Scripting language Revision control Data management State of matter Multiplication sign Gastropod shell Website Figurate number Limit (category theory)
Revision control Web 2.0 Presentation of a group Server (computing) Meeting/Interview Internetworking Multiplication sign Software developer Order (biology) Control flow graph Resultant
State observer Statistics Server (computing) Virtual machine Control flow Database Web 2.0 File Transfer Protocol Web service Data management Computer animation Self-organization Figurate number Quicksort Physical system
Server (computing) Multiplication sign Software developer Virtual machine Sampling (statistics) Database Entire function Uniform resource locator Data management Process (computing) Computer animation Repository (publishing) Website
Domain name Area Server (computing) Electronic mailing list Virtual machine Control flow Multilateration Replication (computing) Product (business) Uniform resource locator Mathematics Computer animation Video game Right angle Physical system Social class
Module (mathematics) Server (computing) Computer file Multiplication sign Data recovery Virtual machine Control flow Maxima and minima Configuration management Replication (computing) Mereology Flow separation Revision control Type theory Content (media) Mathematics Computer animation Commitment scheme Repository (publishing) Password Complex system Local ring Directed graph
Revision control Point (geometry) Mathematics Data management Confidence interval Personal digital assistant Office suite Quicksort Error message Asynchronous Transfer Mode
Shift operator Computer animation Multiplication sign Stress (mechanics) Statement (computer science) Website Product (business)
Revision control Data management Server (computing) Computer animation Repository (publishing) Multiplication sign Quicksort Physical system
Server (computing) Computer file State of matter Image resolution Projective plane Modulare Programmierung Configuration management Metadata Revision control Residual (numerical analysis) Web 2.0 Data management Latent heat Computer animation Fiber bundle Physical system
Server (computing) Multiplication Computer file Control flow Configuration management Mereology Template (C++) Power (physics) Product (business) Data management Mathematics Mechanism design Computer animation Asynchronous Transfer Mode
Revision control Service (economics) Code Multiplication sign Mereology Public key certificate
Scripting language Authentication Computer animation Meeting/Interview Gastropod shell Utility software Quicksort Reading (process) System software Machine vision Physical system
Point (geometry) Server (computing) Validity (statistics) Computer file Multiplication sign Software developer Consistency Line (geometry) Configuration management Data management Computer animation Repository (publishing) Software testing Error message
Radical (chemistry) Computer animation Meeting/Interview Content management system Computer configuration State of matter Order (biology) Repetition Table (information) Physical system
Group action Functional (mathematics) Computer animation Meeting/Interview Metadata Power (physics)
Default (computer science) Group action Server (computing) Electronic mailing list Set (mathematics) Directory service Metadata Subgroup Product (business) Computer animation Meeting/Interview Physical system Exception handling
Default (computer science) Functional (mathematics) Process (computing) Assembly language Meeting/Interview Multiplication sign Bit Function (mathematics) Repetition Metadata Product (business)
Web 2.0 Server (computing) Computer animation Meeting/Interview Average Median Right angle Fiber bundle Cartesian coordinate system
Graph (mathematics) Scaling (geometry) Computer file Plotter Gene cluster Line (geometry) Configuration management Medical imaging Computer animation Personal digital assistant Different (Kate Ryan album) Fiber bundle Resultant Physical system
Group action Arm Computer file Multiplication sign Gender Repetition Revision control Data management Database normalization Meeting/Interview Different (Kate Ryan album) Chain Arrow of time Quicksort Office suite Metropolitan area network Physical system
Service (economics) Computer animation Hash function State of matter Network topology Representation (politics) Website Directory service Figurate number Configuration management Metropolitan area network
Greatest element Arm Computer animation Computer file Hash function Hypermedia Control flow Energy level Cuboid Function (mathematics) Configuration management
Point (geometry) Multiplication sign Configuration management Mathematics Data management Computer animation Hash function Meeting/Interview Speech synthesis Code refactoring Text editor Data structure Hydraulic jump Physical system
Graph (mathematics) Multiplication sign Moment of inertia Configuration management Number Revision control Data management Mathematics Computer animation Repository (publishing) Endliche Modelltheorie Fiber bundle Extension (kinesiology) Surjective function Physical system
Revision control Mathematics Computer animation Meeting/Interview Block (periodic table) Multiplication sign Mereology
Computer animation Consistency Website Fiber bundle Entire function
Point (geometry) Mathematics Perfect group Process (computing) Computer animation Computer file State of matter Virtual machine Figurate number Configuration management
Functional (mathematics) Computer file Code Weight Database Configuration management Direct numerical simulation Web application Mathematics Computer animation Meeting/Interview Password String (computer science) Chain Software testing Information security
Data management Context awareness Mathematics Computer animation Repository (publishing) Different (Kate Ryan album) Password Multiplication sign Sound effect Data conversion Whiteboard Configuration management
Standard deviation Backup Information Database Complete metric space Cartesian coordinate system Configuration management Number Revision control Computer animation Repository (publishing) Operator (mathematics) Endliche Modelltheorie Arithmetic progression
Complex (psychology) Group action Key (cryptography) 1 (number) Electronic mailing list Database Metadata Number Computer animation Query language Quicksort Table (information) Physical system
Control flow Repetition Shape (magazine) Configuration management Uniform resource locator Computer animation Software Repository (publishing) Data center Website Energy level Series (mathematics) Office suite Form (programming)
Autonomous System (Internet) Group action INTEGRAL Configuration management Metadata Connected space Revision control Uniform resource locator Data management Mathematics Computer animation Software Meeting/Interview Circle Routing
Point (geometry) Domain name Classical physics Scripting language Information Multiplication sign Source code Virtual machine Database Configuration management IP address 10 (number) Usability Session Initiation Protocol Medical imaging Data management Computer animation Software Repository (publishing) Personal digital assistant Single-precision floating-point format Operator (mathematics) Physical system
Scripting language State observer Information Computer file Virtual machine Wave packet Revision control Data management Computer animation Meeting/Interview Repository (publishing) Different (Kate Ryan album) Internetworking Personal digital assistant Oval Gastropod shell
Multiplication sign Bit Repetition Configuration management Public-key cryptography Public key certificate Connected space Wave packet Category of being Process (computing) Computer animation Software Hash function Different (Kate Ryan album) Password Physical system
Slide rule Parity (mathematics) State of matter Rollback (data management) Database Configuration management Connected space Number Mathematics Computer animation Hash function Repository (publishing) Cuboid Summierbarkeit Asynchronous Transfer Mode Physical system
Context awareness Key (cryptography) File format Multiplication sign Electronic mailing list Primitive (album) IP address Number Direct numerical simulation Data management Computer animation Meeting/Interview Repository (publishing) Operator (mathematics) Figurate number Software development kit Physical system Asynchronous Transfer Mode
Presentation of a group Weight Power (physics) Data management Computer animation Meeting/Interview Password Data center Energy level Right angle Quicksort Information security Resultant Physical system
Data management Computer animation Open source Meeting/Interview Block (periodic table) Projective plane Configuration management Physical system
Touchscreen Computer animation Software Firewall (computing) Forcing (mathematics) Multiplication sign Software testing Quicksort
Point (geometry) Mathematics Spherical cap Repository (publishing) Bit Repetition Perspective (visual) Event horizon Metadata Physical system
Key (cryptography) Multiplication sign Moment of inertia Mereology Configuration management Data management Process (computing) Root Meeting/Interview Operator (mathematics) Damping Quicksort Exception handling
Default (computer science) Concurrency (computer science) Software Hash function Meeting/Interview Code Virtual machine Speech synthesis Set (mathematics) Right angle Configuration management Entire function
Scripting language Server (computing) Multiplication Service (economics) Connectivity (graph theory) Virtual machine Set (mathematics) Repetition Mereology Configuration management Graphical user interface Mathematics Computer animation Hash function Meeting/Interview Summierbarkeit Logic gate Physical system
Service (economics) Computer animation Information Repository (publishing) 1 (number) Energy level Configuration management Metadata Physical system
Rotation Slide rule Computer animation Multiplication sign Website Limit (category theory) System call Element (mathematics)
Revision control Data management Computer animation
and welcome to 80 years of conflict management my name's Tossavainen I work for a
company outside the media where 1st started and back in 2008 now why should you care about our conflict management
when you some numbers we currently have 727 notes on the management of monitoring is checking 1 of temples of and services hopefully on this my opinion last month we had 21 different people committing to a Config Management repository and overall they're averaging about 38 comments today to the
same repository getting to these numbers from 0 was very slow and steady process have from took around 8 years and
many mistakes that were made along the way some of them I want share with you today a journeys back in
2009 now what was the world like back then let me give you some context in 2009 at the
beginning Michael Jackson civilized but was no I've had that been 15 different models of our pets ever Obama adjustment freshly elected to the White House for the 1st time the
Deepwater Horizon oil spill was still a year away and I'm sure we all remember that it took a while to clean up also Bolanos aligning in Pakistan in the world was kind of worried about swine flu personally I have a lot more back and just drop out of university and started as a trainee and systems administration with Simon media was during that time that's all I was given the task of some to build something
that automates management of our web hosting customers something kept me from breaking anything out and so so something figure out OK we have these so posting customers there are not making as a lot of money back then but that supported the company from the beginning in 1996 and so we kept them around so far to take care of them because there are realized so in but the beginning
of course there was a shell script not try to get that started to you know have a new customer at a new website on existing customer and stuff like that of a new database but
quite early on i aside running into limitations of this I want to improve on what we already had a lot of stuff like that of the 2 states and more centralized management so it just took me maybe 2 weeks before start looking into conflict management and the 1st thing that I looked at was puppet which at the time so use use this cute little
fellas logo I couldn't find it anywhere on the Internet today so that's just on of order of presentation of my computer but at the time public just kind a rocking new wrong they can really get to grips with it my old molds from the time of
the documentation was bad and maybe it was just me being stupid but I couldn't really make it work for me then I looked further and found the conflict this about be
CFG to but on the development of skull becomes a truck and that was a whole different story the community was rather small boats in the best possible sense they were really welcoming and they helped me out a lot of C and eventually even started contributing cold back to the project and in the blogosphere so that's kind of where several where could get started quickly and produce the results that I was so I had my little from web server set up here with
2 web service and the database server and added the now the virtual machine is sort of controller that found contig management on it and and also in an observer because I was curious to learn more about other and don't figure to be kind of nice for opposing cost must use the same organs for the FTP data uploading and accessing their statistics on the web yeah so that kind of work out work rural for me and that was sort my kingdom that I hadn't around it was the
last see you off unmanaged systems that were really just administered in the traditional way as and someone as status into a machine that's something and it's done and nobody ever knows what he did it so today that can work that way for a while fast forward to 2011 and this of motion challenging
for me because I tried to expand my cool management set up to cover the entire infrastructure it that at the at that time that meant different locations that were really strongly separated from a job and couldn't talk to each other Saul what I eventually ended up with what's
this scheme of of a set up where developer would commit to a master server cedar always top that out of the country those are on its head so that the hugely expanded and a database that not only include user accounts but and also the entire inventory of virtual machines and even web sites and you can could configure website followed us in and that was all terribly over-engineered but to add insult to injury we that the separate locations so that you see here but you know and sold as sample needed to get the data out of the central repository to each location but preferably
only but they don't that the location needed for itself so when I came up with was that come with an upper replication you can restrict that replication for access control lists and that was really on probably the worst thing I ever did in my life and not at the data location
but already come up with the concept of of a domain inside location the colonized later class of systems and data so that the some control of
virtual machines that have started with so that was another step in the replication change and then we eventually by running out of on every production system because we're not it just takes a couple megabytes of RAM right and was a great way to make changes to of the user and the master server and have it automatically replicated so down to each production system at the if the area in practice live alone that'll operational problems but the cold for it was also really messy this is from the very 1st commit that I found in this
repository and it's atrocious you don't need to read it just highlighted some parts and if you look down in the lower 3rd direct methods actually some kind of XML file and then you have some type cold in mind at the top which you can do with the the conflict and I think of course you have passwords and they're committed and text like like you do and then down below 1 have this bound config file and maximum amount entity which has a floor loop embedded in the to create multiple quantify the huge mass terrible and eventually be improved on this slightly but moving gather called into a separate module recovery assets but it still retained that so horrible thinking was very unreadable and while it seemed powerful to me at the time really what I was making was a huge mess and eventually we ended up with this kind of workflow I can you make a pretty
sight for the part of me just run through it really quick 1st obviously would you might get changes in its push that that could trigger post up they took 2 sink on the contents of that repository to each the config over in the in the Sloan poultry that you saw earlier but the other replication wouldn't always work reliably in fact most of the time it was so he had to SSH into the each no planning so that the because between the master server and the knowledge you actually wanted do something like the other database and restart the expected to just 1 re-trigger replication the hard way then you could find the SSH into the target node run local descriptor there that would follow the configuration from the the conflict server and applied to the machine now that the worst of times state up to 70 minutes for a complex system like a monitoring
from which is terrible on itself but we also had no real confidence and the so we use the interactive mode that the conflict has when it asks you for each and every change that it would make and he would always know that someone in the office was doing this because you killed them which no and no and no and very very fast very often until they finally arrived at the change that actually wanted to make and sometimes you know accidentally than because Europe and what role in the near to start all over again and then you could finally apply a change notice that he made an
error of 5 or something and then you would sort of now it wasn't always this that this kind of worst case scenario that of scripture but it could happen to you and this was sort of the low point of conflict management that really made it clear that something had to change fundamentally and that brings us to 2012 that was the
time when we decided looking out of what how can we get out of this situation and during that time chest was finally getting a lot of traction and of the little buzz around it so I looked into that and I remember once seeing in statement on the matter and shaft website back then that instead of spending hours trying to get
shift in installed we recommend the free shift a and depressive account and we take care of the shifts over 40 now I get it they're trying to sell a product what's when you already assuming that I have to spend hours just getting this work that in this than supplied with me would I be
spending more time managing the management I really didn't want to deal with that so that into further and still didn't like what I was saying at the time so ultimately I wasn't really enthusiastic about sort of got going to shift away and nowadays the movement of problem probably driven me right into the arms defensible but if you look at the 1st commit and the answer but repository that's really just
got started in early 2004 elephant wasn't really known to anybody at and this disillusionment with server based conflict management systems kind
of sent me and this is the path
which many people frown upon it but I was not interested in the challenge how hot and that's the kind I just I make this work why this so difficult so I tried to
do this myself so in July
2012 I start at on my 1st of all project trying to solve config management on my own foolish and the
residue of ground fixing fixing the conflict because I like a lot of the ideas and the there was the idea that you had unknowns which most systems have some kind of concept where you will take a collection of items like files and packages and what not and you buy them together and that's not all role ball around a software package like state of the Apache web server we have an Apache but and but then I want to have a clear distinction of metadata that he wouldn't have attached to each node in the town specifics for the and combining bundles of metadata would yield a couple a the configuration for that particular no I
also liked and that the the had he an interactive mode because some just culturally people expect the they couldn't trust config management and they were still scared shitless about just you know running this machinery on a production server hold just hoping that it would do the right thing so we had to have some kind of human in the loop mechanism so you could still say 0 no I don't want this change don't do that and elected he'll something Latin anywhere because there was very familiar with Python already and so as the wanted to retain that power now that I have learned where it can take there also a couple of that outside of the 1 of the better template
engine by didn't want to have to write on any more XML just to describe my infrastructure I wanted to speed things up by applying multiple items in parallel and not just a certainly an that of course meant I had to do with dependency management also 1 of and the we the parts of the country the types of for some reason it very slow so after we generated a huge config file formal trying its
took a bizarre amount of time for to generate the 1st of all know why that is but it was definitely annoying and I can work on this room for a couple months and then I realized I will I was doing enough good I was just reinventing the conflict of taking some parts followed and reuse some of the code but it still didn't seem worth so I forget my approach to solve this had to be way more radical I didn't want to deal with
service anymore do I you have to have that can just take that out also didn't want to fiddle with agents running on each and every node and then sign certificates so for them on things so this is big disillusionment that so I really wanted to take this thing apart and really put only put those parts back together but I really really needed and summer during the time when I
was reading of man pages 1 does I realized that for decades we've been writing shell scripts that use system utilities and they were sort of an API for the whole system and it if you look at it on
all stable these things have been for years but they're more stable and some proper in the eyes of the so I was wondering couldn't I use the existing authentication channel that we already had namely SSH everyone already had as his age axis Andrus systems and can use the existing utilities on the system to manage the would that be feasible what it would be really be as hackish as it sounds but in thing so and that was kind of when all that vision for buying
rap on came together and I just tried it and want to see all that well and by the time I got to this point we were in
2013 I in June
2013 so that's a full year after the start of something going on with this so all config management on my own idea I the aside what upon request today now from the get go there are there were a lot of things that's the way it is then possible or what to do with the existing solutions that I wanted to make possible for example b test contact the
simple command into anyone repository and what we do is go through each node you have configured and actually ran all the config files in there so at least you know you're not you don't have any syntax errors in the some more internal consistency checks and if you look at the last 2 lines here you can also define books for that into your own custom validation on top this is also because it requires no set up for you you can just run it locally it's terribly easy to just plug into a CI server and now you can almost feel like a real developer no because they have these
tools what should not people another thing that I always found orders that many content management systems we just run whatever commands so they need to do and and just assume that they were sure that important and all that but that really have the
urge to make fun rep check everything that it did and that's why the cold is already set set up in such a way that anything you can configure and bun wrap cannot only be set but unwrapped but it can also check that it actually work and read that back and that
means that as verify command that can really go through each node and tell you the state of of everything and that always kind of seeing the more complete to me uh the dry run options you can find other systems because it really just reads salt what is there and it doesn't just assume that the command fails or makes too many assumptions so far enough after you from that you get this nice table rendered the new terminal that's really great that tells you how many items that are this note how many of them are in a good state
of many of and deviate from your country something that was also very important to me was of being able to compose groups really the way I wanted to now this is an example
fall all you can compose groups and when you have this group of people important stuff that has a couple of members in the Uriah have just so that all to be that we said statically that's possible of course but you can also use redirects to include from all nodes that start with cluster 1 it and you can even go further by defining functions of you really need that power to decide whether or not a given node should be in the group or not here we just look in that notes metadata that we mark so this being
a production system and then we would add it into the group and after you've done that you can even remove Nolde's again when you have these 2 exceptions that you need to make ends just because you could remove that the end notes here this is like that in it so that's an example of from how flexible we all when composing groups in your infrastructure which is obviously very important when you have a lot of systems and a lot of diverse groups of nodes that you're dealing with
another important concept is metadata and how it relates to groups consider this
example you have a group that's called Germany and it's metadata sets on all nodes in Germany should use this particular name so and also assess but the Frankfurt group is a subgroup of German in Frankfurt group is set a different name 7 metadata and by default there will be much so Nolde 1 which is a member of the Frankfurt subgroup would have both name service except we can also override the slowness or again in a few rapid atomic you can make sure that the but name server doesn't get added to the list but is rather overall His name themselves so that's an example of how you can use groups to
assemble metadata for knowledge and really 0 override defaults when you need you can take metadata even further
but I don't wanna get into this 2 D but everybody who can also define meta-data processes which are really just 5 functions way can mess around with metadata put stuff in in a very dynamic fashion and so I was working on this for a time and I really like the
concepts and the substructures that came along with that and that brings us to 2014
when when we were starting to use bun rep production foreigners from not that much yet but it but we don't really tried to use it Fourier no let's talk a bit about what our infrastructure looks like have
to prepare them this check chart here and what you see on the X. axis this each
and every bond that we have bait stretch all the way to the right and so it's slow just shy of 200 of them and on the y axis you have how many moles these bundles are assigned to so you have for the Apache web server and that's probably just right here somewhere and you can see OK assigned to 500 notes that just what this chart sets but from the average and the median that's noted here we can see yes very few bonds that are assigned to a lot of notes but the majority of bundles assigned to just 3 nodes or less you have put it on a logarithmic scale of
the same charges on different scales where can really see that around 1 3rd of all bundles just applied to a single no so the infrastructures very diverse we don't have a lot of clusters with 14 nodes that all of the same graph to care deeply about each and every individual system and often have to come up with special cases and bond rupture I think that's very
well 1 way to inspect an old not with complex configuration is using w plot which generates thought all put on the pipe into reference and then
it'll rendered an image for you now have done this for real Nolde in our configuration as it is today and this is the result now you can probably can't see anything here from what you're looking at is the 52 megabyte PNG file so that assume a move closer OK now American make out some things some where you can see some of the lines some consume even
closer and finally we get to see at least a of what's going on here what we're looking at is the act upon the full package management on Debian and to and that has different items and the like 5 halves and actions and that packages itself and they're all connected through dependencies the arrows that you see going on they're all different kinds of dependencies now I for example it when you're installing apt packages and unwrap this inherently come arms parallel so we need to take care of package management man just because they use lot file you can't install to act like it was at the same time on the system so you need to make sure that we install them 1 by 1 on the way by and rap this is by of daisy-chaining all the package items in the pen in a sort of dependency chain and make sure that they repackaged depends on another package and so forth so you can only apply them 1 after another and the all the other stuff and still run in parallel files because you can obviously of blow to 5 at the same time for example now this whole all patrol node isn't terribly useful but somewhat on rep can do
on the gender with the dependency you it will also give you a trimmed-down version of this so you where you can really see what's what's going on and where your way you might have introduced redundant dependency of something so and it makes nice office on here but another thing that had
been really important to me is on understanding configuration as a Markov tree no on should never at the Markov tree it's pretty easy to explain
actually I have a couple of items spread across 2 different nodes in this example so I will file a directory of services and then another node of another package and 5 not what I can do is look at how are these things configured and bond greater representation of that and run a
hash function over so now I have for each item that 1 hash that will tell me exactly how was the site of the figure with the other clerks mn I can then take all of these hashes and aggregated into halves for each and every node by hashing all the item hashes for 1 node again just 1 hash that represents the configuration of an entire node and then of course I can take the old man no attaches and hash them together and now I have 1 hash value that represents the state of my entire repository let's see how that works in practice what we're doing here
is we have ignored which is so GC is media 1 box and with I'm going to show arms what does that item and look like the flow of control to the file the Colts see there's a content Hirsch and then you have some ownership and permission at tributes and that really is the entire configuration the polymer paths for this particular 5 and then we just segregated that into hatch but as you see at the bottom now this just the output of the previous command run through the show 1 and then you can go 1 step up OK now show me the hashes for all files of all items on this note I've just going through the 3 year is see this for each item for each file and there's a package and the 2 they end up with 1 hash each go up another level OK show me now you
can show me alone the aggregate have for this particular node and again that's just the output of the previous command run through you From
there we go up even further OK show me the hashes for all nodes and finally we end up with just the W hash and what that will do is generate your entire configuration generate a hash value from it and displayed now why is this important and I'm not aware that any other config management system that is supposed to doing some refactoring in your config management OK it's just I'm trying to clean up some stuff and that's all very complicated but you don't wanna make any actual changes to you notes to just trying to produce the same result but in a different way what you can do is generate this hash beforehand and after and if they match we can be confident you didn't cause any unintended changes on
you know that's a really powerful assurance you can also be used for other things like say you made some changes but do not really sure how many nodes are affected by comparing the node before and after you can really tell if a node has changed through different points and you get this I love to speech and I think it's really powerful and editor from I think 1 or 2 times it pretty safe my data when I made a huge change that impeccable of things and and so there was some it's more detail about this so these are some of the more advanced features of but then repairs you you want use them every day but that's just the possibilities that the derived from the internal structure that we came up with yeah and with that we can make another jump into 2016 not that
was a huge year for our config management and you can just see that by the number of commits to the repository what you
see here in the graph is not the total number of commits but just how many comments were added each year and it's pretty easy to see that things start out a relatively slow in 2009 and slowly work their way up to 2015 even tho we had this horrible infrastructure behind it but then in 2016 activity almost tripled when that was a huge deal for us no why was that have prepared here
chart of the number of models that we managed the and you can see that again
see that the confocal SLO Stodden 2009 minus tinkering with that of adoption increases between 2012 and 2 thousand 13 when we animated mandatory to form the but all changes that onto infrastructure in under config management and then in 2014 we started from finally using bundle rap and it took us almost 2 years to migrate from the conflict to ban rap to the huge very long time config management systems always carry this huge moment of inertia where you don't really want to rewrite your entire configuration but to some extent you have to when
switching and that's really painful to do what I want to get through that phase from as you can see the rewards were quite visible because as soon as we switched off the conflict activity increased dramatically and that really felt liberating because we had collected a lot of experience but as part of this so the last few years and now it really felt like we arrive you want to be another important change that we are made during 2000 16 was
mandatory pull requests i and that's
1 of these changes where I have absolutely no idea how we ever live without it where everyone would just put into master think it's probably OK and when you introduce a change like that so this is obviously a lot of hesitation how many time how much time we come we spent doing Review's I can't make any changes immediately left the wait for this little stupid review what if nobody has time and then only review my changes tomorrow I need to do it now I that was a problem which we addressed with no block upon replicon say OK I want
to look at this particular item on this particular moment for the next 3 days
it cannot I'm also look entire bundles on a particular Nolde early in the the entire Nolde it's really quite flexible you can again show what what marks are present on on each node and as you can see here there there are identified by the site of simple ideas and they they always have an expiry dates and they affect certain items and what you are consonants light here another column at the end for a common that you can leave so people know we know why you mark this particular node and what you're doing and now when someone else tries to apply
configuration to this node they were just skip that particular file and they can still do the other work the doing of snow and that this you time on In the example we have adopted for 3 days to get a pull request review in and after that has happened and then it changes the merged into master you can come remove that log and every 1 of them will be in the same stay together now this process of course isn't perfect people so forget to lock nodes and then all right the changes from was still working to figure out how we can prevent that and make locking more intuitive and easy so you will do it
automatically but for me this is 1 of the strong points of being able to apply configuration directly from your machine where you control the state where you can work in
your own branch and get just apply changes directly to know because sometimes you need to make changes now because 0 setting up a DNS chain for custom owing me to react to some situational we just need to keep a deadline whatever you can absolutely do that and still get the benefit of all code review led not the importance of feature of the creator of him up with this
secrets when you create a new bond reparable Satori would automatically generate these 2 he's for you and some 1 neat way you can use this this from for possible to don't really care
about that's a good to what we're doing here we have this file called it's a secret and we just want to write a password we do that by just passing any string that somehow describes the password for this test but for functions and what we would do is take your string that you can really make of anything you want and derive the password from the key in that those secrets file and just string now this is very useful for some situations where you can control both sides of passwords think of a web application that needs to access the database we don't really care what the past what looks like and what was that what is and they're just 1 secure password it's configure the same weight on both sides of the database and a new repetition and the cool thing is when you when someone leaves and you switch up a secret and apply a configuration to all nodes you automatic the role played
rotate all these passwords but it but they will still match on both sides even though there actually been changed so that's my side effect of at the and that brings us to 2017 so we're
almost there and this chart that I have here is probably the 1 I'm most proud
of that shows how many contributors each month different people committed into all config management repository you can see in the early years it was just need working off and on among those and over time you more people come on board but things really exploded but at the beginning of 2017 and repeat last month 21 different humans committing into that repository and they come from 7 different teams and all company really great kind of adoption to have the and the best way I think to pull this off by just getting 1 team really trained in and then send out ambassadors in those other teams and bed and therefore while in college that helps if you want and really enable these other teams to make these kinds of infrastructure changes in a way that still reviewed the fire of pull requests and through pull requests and the conversations that go on in there you can really on make those other teams aware of the challenges
that in the operations department has when it comes to maintaining all of this configuration and of course the other teams already also see everything else that's going on in the infrastructure and so get a better sense of where their application for this now this chats wouldn't be complete if I can
show up on the progress of the number of notes but some as you can see the data from the conflict to was rather spotty because remember we kept all this information and held up and database so model was greater so I'm telling you hold a stage was 2 years ago so on just the ballpark numbers have they are taken from all other backups that found lying around with tiny rapidly you can see the data as much more accurate because finally we at each node committed to the repository as text because we've finally realize that we don't eat apples in the database now as we went past 700 made standard that also brought book with some more
challenges too hard to do inventory for these and how to give an overview of what's going on and a pretty cool feature that's
relatively new as metadata tables where you can just look at a certain metadata key fall a certain group and will give you an ice table here can you can see all follow systems running on Google Compute Engine which 1 of these iron production which ones are now these tables can also be restored quite easily to be more friendly and that lets you form some really
powerful queries in your command line so suddenly bun itself feels like a database again and if you need to come up with any sort of list firmness of purpose you can very easily do that in your show using just grab and and salt and all those other things now scale the song and important issue mostly for us in modern sheer numbers but in complexity and diversity take a look at this
picture this is JET automatically generated from data in our bond rep repository and shows what we call our site to site network so all these Sloan notes that you see here and represent different locations that we have in some way shape or form you can see where and have all no data center Frankfurt and we have several with profit breaks we have someone who you compute engines lose
something of AWS series spread all all over the place and of course also includes
office locations not all of these locations are connected through IT Sec weekends with of BGP doing dynamic rooting between them that's really
all very complex and would take days and days to set this up manually but with config mentioned once we have that all figured out we can describe it as a very high level this is from taken from a
metadata we're just OK you have some notion of what that location is and then you just defined which location you want to come greater connection to and just write that as a parent in
this particular piece of metadata and then you describe each location more detail so you
can move on talents of which the which networks it should announce over BGP which air AS number it has and just from that we get all these the 2nd actions we get all the dynamic rooting so at 1 location goes down for some reason we can
route around it it's the it has worked really well and this kind of thing and just isn't possible without conflict management in my opinion because you be running in circles all day and never know on where you need to make a change to fix your connection config management here really shines because you can make sure that every 1 of these locations is configured same miss talking to each other in the same way which is of course especially important with his not let's take a minute to talk about integration sigh from previous
said that I'm talking to other was 1 of the huge mistakes we made with of and will be config around on because it's great this dependency and at some point in the other thing was very inefficient at ran and how tens of thousands of on the various just
during 1 single apply operation and that of course took a lot of time now what you can do with the information that's contained in the repository where you can obviously created on images for documentation
like the 1 I showed you earlier with the side to side network you can also talk to your domain registrar or the DOS subsystem to see on the you've of configured and year of unused domains and come up with clean up there and since you also configuring IP addresses for virtual machines you can also push into your IP management system that's all the easy because some it's just a script that's and then that repository and you can use it every now and then to update information of assistance you don't need that to be life at all but at taking data out taking data in from the source was little more tricky and the classic use cases that we still want to put all users and add up in some kind of way because that's the great database for storing user accounts and the way we solve this is using and the simple Jason dumb 1 of these ideas for thing that that's really but it really works quite well in practice because you
don't it uses every day that is we don't we still have a lot of people last year but even then I it's OK to just from the simple script again every now and then and pick up the new so we take all the information we need a lot of other just it into adjacent file and commit that to the repository and now by Moroccan just read from the Jason file in the repository and always has the information of a and that is a huge deal because now we can work offline on a train we have no internet and you come and talk is observer not a problem that happen what a shell observer goes put and you need to set up a again in in the different virtual machine into conflict management depends heavily on add up and you you end up with the chicken and the egg problem where contribution the of the the of observers down to catch the information locally you can
always use it there's another case that we need to have this secrets keep all
of different kind of passwords that need to be read by humans from and about stuff like as a private keys in the software called Team want and that is alive connection whereupon rep will really during on it can
be applied process of the configuration talk to that system over an API and pulled the data out but I still don't want this connection to be mandatory so the way we solve this is you can switch it into a dummy mold within an within a lot and that will make us back and always returned to values for passwords and private keys and stuff like that because when developing and stuff for this on a train you don't really need the actual certificate you don't really need the actual password just need something to arrive and that kind of looks like a password and and then you can global today into more severe developing yeah and doing things this way as very nice property of time 1 bit hash in your
repository directly to 1 of the p w hashes that you can generate a slide switch off the on the modes of connection and replace all w values I always end up with the same pw hash for the same so even 10 years from now I can go back and be confident that I don't depend on any external database to reconstruct from what I was doing 10 years ago and there's something that became very apparent to me when I was doing the research for this talk this so much data from the past was just gone because we're pulling it applied of number of another system that doesn't have the kind of history remember like it the and as the sum of funds for important to to the states have that kind of of parity between how does my configuration look and what's in the git repository that try to keep it always keep it as completing and tied to each other as possible and this will of course also that you do all more interesting histories belong thing where you can you skip bisect
more effectively by going back because now it can also roll back changes the following In get you can't rollback changes in an external database monitoring by 2nd tried but find of the problem that was introduced when you add that a new node or something if you put in a lot of other now having this kind of repository where everybody works on 1 giant repository with a huge history has the benefit of of being on so pullmans box because if everybody works every day using that repository it also means that pulling that repository almost every day so you can assume that every 1
of your ops people and leaving the developers that in using the system will have a reasonably recent check out of that repository and we abuse that by putting more stuff and that it is indirectly you stumble upon rather but that we call the emergency kit and that's the this was a really primitive list of phone numbers for example because I can be large outage and you need to call you know colleagues to help you chances I just a new format and center context if we actually have a problem at a time and
so this way that this you can be confident that everyone can reach everyone at any time we also put in the a complete
crap friendly down before our p risk management system there so even if DNS completely goes down we can still find stuff and figure out which appeared as it's often overlooked how much we rely on DNS and I can figure out uh what IP addresses we connect with goes down so we kind of insure against that of failure mode as well critical secrets also in there and of course there are encrypted of course with the secret keys you saw earlier and that's just to prevent you the
other chicken-and-egg problem so like OK so the net at finding that hosts the password management system just went down from what's the admin possible from and apply the OK I just checked the present management system but I can't regional and that's a nice place to just put those very low level credentials but you need when everything goes stock the and just 1 day data center conducted from who do you call together access this of us all the stuff like building security facility management just what you could need if everything goes start where do you start it always good question to ask to serve it power
result in the entire world where do you start to change things back on where do you need access How can you bootstrap your entire infrastructure again so that summarize in sort of what our experience of been having the right tools to
support obviously the EU but I think if you can't find them don't be too
hesitant to build them not everyone should do their own config management system and as you can see it can very easily to out pretty badly if you do but on if you find it interesting and and let me at thing I think doing this for more than 4 years now and it's due to very much care about the
subject and uh that's the releases every about every other month and also definitely open source of why not just not everybody has to use that nobody knows about bun really 1st just like over hundred stars and it up it's nothing compared to answer but I think maybe this block will change some of that but I'm OK with that open-sourcing from what you doing they're not only invites other people who on collaborate on that contribute back and uh stabilizer project but it also
forces your old mind set to be more on open to other users and avoids kind of building these special bells and whistles that really only you need so that this sort of a firewall against that where we will be naturally hesitant to put these little things in there that will only trip other people up and that nobody else will need so I can use that as an insurance that watch I do there will be generally useful for other people as well and that's usually creates the software of course now I
have touched on this a few times already but Treasury look at history testing the research for this talk every now and then the where was small crowd that gathered behind my screen and was looking at how we do things 3 years ago on how horrible it all was was really such a great experience
for the team just being reminded of the way that we came together and how we survived all this cap infrastructure that we built around it and hold for us we were now processing changes and all many pull requests we're doing each month and that really put things in perspective for the team was in was nice experience also people being able to go go back and visualize that and sometimes creates a new ideas while I was preparing this talk farm I thought of some ways to clean up metadata prosthesis for bond rep some really fun stuff you don't really need to know about but it's it created the slower this sense of
OK when I look at my history what can I learn from that and you can only do that if you have your history available and that's come as complete as possible so I'm really glad that we kept was same repositories since 2009 and not move to a new 1 when we introduced by more but we kept them all in the same bit repository because that makes it a lot easier to inspect your history and see how much activity the walls what's on events and changes increased or decreased activity because people were more hesitant to use the system I find that really important but the point is from allow a
culture to evolve and evolves is really the key part part here I can't really say for sure how much this applies to other people but coming through this ch painful journey is that we had it was really important to deeply rooted config management in all operations culture and going from scratch from really what features
do we really need and having to build them for us was a huge plus if we had been starting out with the super-powerful Config Management System chances are that there some features that we would use just because we can and that's dangerous like you some sort of the in the terrible way all of you still the country to so you acknowledge that config management carries with it huge moment of inertia and that if you go too quickly you will accumulate way too much technical that I know everyone is sick of hearing about this but really try to gold things slow except that it will take years to really establish this that would you can't go for from 1 day to the other to managing your entire research and that's OK acknowledge that some prior process but it needs to evolve on the technical side of that your processes need to involve lots of mind set and that of your people need to change and they can only change slowly by learning each step and realizing why did we make the stuff that just takes time for him to so where do we go next 1 thing
that I will need to address very soon this speech once again right now on
generating the entire configuration for entire infrastructure on my machine right here with the default concurrency settings takes 4 minutes and 11 Sec now that 700
Nolde's and they're very few situations where I would really need the entire infrastructure but when you do pw hash that's exactly what you need to do so faster software is always better software so I'll try and see if we can squeeze a few minutes out of that the other interesting topic on that through properly addressing monarchs commonsense
orchestration from 1 of my colleagues that makes the expressive was quite nicely just earlier on what we're doing right now is mostly configuration as code infrastructures cold that we don't
really do we only mention what's inside of virtual machines but the set of the virtual machine itself that we still do using old-fashioned scripts and using the sum of graphical tools even there we can expand more and so also create that in a way that is tracked in GATE so we know when was this virtual machine really added who did it why do we do with it and how the parts that at all end up here I'm also involves a sub work on the
server component for bond rep but that's very different from the other configuration management service that you you may know some things that I want the system to advances who applied where when what was the status of the multi fold the apply on happened what's what it provision was applied and who did it maybe even allow them to leave a comment on why they did that but the change then I want a automate looking into pw hash and just very easily show for each
commit OK how many nodes are affected which ones which files on these notes were effective and as we have more and more data to all metadata like information about this this abduction system what kind of a service level agreement it has we all really want to expose the data but we already put in the 2 systems to consume so that's another really interesting thing that you can do once you have a certain amount of data committed in your repository and then we can really take this further and maybe some
of our tinkering with automatic BW applies so just Running what we do what we do right now is some guys where every Monday there will be w apply all just to make sure of this no configure not too much configuration drift come across all
infrastructure and alternating that will be tricky there I think there are a lot of fine details that we need to get on but so that would be interesting to see whether bills and then they will also be on
automated uh commits for trivial things like changing the on call rotation elements which is not something that just happens every week and maybe you can automate that so and that concludes my talk thank you
very much if you like to learn more about unwrap you can do that on 1 report or if
you'd like to learn more about the company and this website and we also have what that's the 1 with the largest LED while you really can't miss it if you're become here and I put the slides up and speaker and then you can always find it would of course if you have any questions now we already we've already at the time limit here and if you have any questions I'll be
hang out of the booth for the rest of the day and I'll be happy to chat about any kind of conflict management and this is 1 of the thank you
me thank