Use case: Configuration Management in an enterprise Linux Team
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 199 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/32649 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 201435 / 199
2
3
10
11
12
13
14
17
18
19
23
24
25
27
29
45
46
47
48
49
50
51
52
55
56
57
58
65
67
68
70
72
74
75
76
77
78
79
81
82
84
86
87
89
90
93
94
100
101
102
103
104
106
107
109
110
111
112
113
114
115
119
122
123
124
126
127
128
129
137
139
141
143
145
147
150
154
155
158
161
163
164
165
166
168
171
174
175
176
179
182
183
185
188
190
191
192
194
195
196
00:00
Process (computing)MereologyConnectivity (graph theory)Information Technology Infrastructure LibraryXMLUMLJSONLecture/Conference
00:41
Line (geometry)Information Technology Infrastructure LibrarySet (mathematics)Information managementCartesian coordinate systemExecution unitOperator (mathematics)Data storage deviceWindowSelf-organizationSoftware developerInternet service providerIntegrated development environmentMultiplication signPoint cloudEnterprise architectureProcess (computing)BuildingImplementationOpen sourceProjective planeCASE <Informatik>Solid geometryBitWordNumberOrder (biology)MereologyDifferent (Kate Ryan album)Dependent and independent variablesWater vaporState of matterRevision controlDivisorJSON
03:53
Dependent and independent variablesSlide rule1 (number)Process (computing)SoftwareWordDifferent (Kate Ryan album)Universe (mathematics)
04:45
Greatest elementMereologyResultantServer (computing)System callMenu (computing)Incidence algebraOperating systemOrder (biology)Time zoneJSONComputer animation
05:59
Mechanism designDependent and independent variablesNumberParticle systemInformationNatural numberMultiplication signSeries (mathematics)Revision controlEvent horizonOperator (mathematics)Direction (geometry)ResultantState of matterFinite differenceSoftware developerINTEGRALServer (computing)BitVideo gameConfiguration spaceInformation Technology Infrastructure LibraryBranch (computer science)Operating systemControl flowBlock (periodic table)BuildingSemiconductor memoryCartesian coordinate systemJava appletSoftware testingCuboidMultiplicationPhysical systemData centerProduct (business)Characteristic polynomialJSONComputer animation
10:25
CodeProduct (business)Repository (publishing)Information Technology Infrastructure LibraryServer (computing)Multiplication signView (database)Branch (computer science)Domain nameBusiness modelState of matterDataflowProcess (computing)Right angle
12:08
View (database)Right anglePhysical lawMetropolitan area networkRepository (publishing)Product (business)Branch (computer science)Commitment schemeSoftware developerComputer animation
13:15
Figurate numberPerspective (visual)MereologyProcess (computing)BuildingFinite differencePhysical systemLattice (order)Different (Kate Ryan album)JSONComputer animation
14:00
TwitterSlide ruleIntegrated development environmentMoment (mathematics)WordPosition operatorSoftwareConfiguration spaceAutomationCollaborationismScaling (geometry)MereologySolid geometryGame controllerProcess (computing)Operator (mathematics)Software testingPhase transitionMultiplication signServer (computing)Entropie <Informationstheorie>RootSystem administratorMixed realityMetric systemSoftware bugElectronic mailing listImplementationNumberDrop (liquid)Incidence algebraInformation Technology Infrastructure LibraryInformationRight angleDivisorForcing (mathematics)TheoryVarianceExecution unitReading (process)FamilyQuality of serviceBuildingLattice (order)Point (geometry)Line (geometry)Revision controlState of matterMaxima and minimaCategory of beingArithmetic meanInsertion lossTraffic reportingGoodness of fitGroup actionSummierbarkeitView (database)JSON
21:30
Integrated development environmentComputer animation
21:50
Event horizonWordAreaMereologyUniverse (mathematics)Type theoryDifferent (Kate Ryan album)NumberDependent and independent variablesPower (physics)Multiplication signData structurePoint (geometry)Human migrationBitWindowRevision controlBranch (computer science)Information Technology Infrastructure LibraryDisk read-and-write headGoodness of fitLecture/Conference
Transcript: English(auto-generated)
00:01
He's going to do a session about CF Engine and how they use that in his job to automate himself out of a job. So I guess it's also a component inquiry for a new job. No, in fact, I'm starting Monday somewhere else. Oh, okay. That's not even starting. All right. So, Remi, how are you doing?
00:20
Thank you very much. So guys, welcome. Since a lot of people switch rooms, the talk before this, Sean was talking about the one-on-one configuration management. And one of the cool things he showed, I also use while telling people about configuration management. And that actually explains it in a very simple way, especially among technical people.
00:42
So go and delete it. So this is configuration management very easily. When you have to explain configuration management to managers and stuff, you just tell them, define the setting. And it will always be working like this. So that's pretty interesting. So what's the story today?
01:01
First, I'll present the use case of implementing configuration management in an already existing environment. And then we will zoom out a bit. And I'll tell you about the steps to take to implement configuration management in your organization or in your team when you're already working there.
01:24
It basically boils down to whatever happens, you should just use configuration management. Because that's really the way forward. That's very interesting. So let me now introduce myself. My name is Remi Berzwa. I am an engineer that loves building and managing infrastructure based on open source technologies.
01:43
And it is my goal to automate what can be automated. So everything works rock solid. And I have the time to export new tools and techniques. So that's a win-win situation. Because that usually leads to even better automation. So until last year, I worked as an internet service provider.
02:03
And my final project there was merging two high-speed companies on this new shiny private cloud. And when I automated everything around it, I decided to challenge myself. Because I just wanted to get something new.
02:22
And I did that by accepting a job in a completely different environment than I had ever worked before. So I joined an enterprise Linux team. So that was cool, right? So let me define enterprise, because that can be anything, right? So in this case, this was a semi-government organization.
02:43
And there were around a thousand people working, so quite big. And in fact, this organization was building its own software. At least they tried to do so. So that's why we have around 150 IT guys in the country. So this was pretty interesting.
03:01
It was a very classical defined organization. So of course, development and operations had probably separated in their own units. So no DevOps there. And even in the operations unit where I was working, there were many different teams. I mean, there were teams for everything.
03:21
All separated by specialty, mostly. So you've got the Windows team, the networking team, the storage team, everything. Application management. Actually, for every application, they had their own teams. And it was very interesting. And I joined the Linux team. So it was quite interesting to me. And before I tell you anything else about this team that I joined, let's just meet the team.
03:46
I'll show you the team that I joined. Does that probably give you some idea of what the problem was that I needed to fix? So if you're still in there, fill the right slides. Don't worry, this is not a problem.
04:00
This is the team, more or less, that I joined. And we were very busy. We were working very hard to fix all kinds of problems. Actually, most work was done in response to some crisis or problem or whatever. So it was all reactive, more or less. And, well, we were pretty good at it.
04:21
I mean, look at the pictures, we're even smiling, right? And we were good at it because we had so many of the same problems over and over again that we knew the solutions already. So we just, oh, this problem again, fix it. And then it was done, and try it again, etc. There was just, well, there was a major problem with this approach.
04:43
And that was that our users, they weren't happy. Because they didn't like the same incidents that we had. They just couldn't understand why they had to keep asking us the same things. So let me give you a simple example of such a situation.
05:04
One of the cool things that had been automated properly was the deployment of new servers. So you could just spin up a new machine, you could just use pixie boot, and there was this menu where you could choose one of the five operating systems that you support, press a button, and it was all done. So that was pretty cool.
05:22
And in this installation was included a post-explaner to another post-explaner. This must be good. Everything was on an internal network, actually, and some servers had access to the outside world using the drive zone. So in order to send an email, you just had to relay it somewhere, of course.
05:43
In this installation, there was no relay defined, so we would just automatically deploy our kind of service, but the mailer was broken. So we got a lot of calls from our users, saying, hey, can you fix the mailer on this one? Hey, this is strange, I'm sending an email, and it gets stuck somewhere, so what's going on?
06:00
So that's what they really didn't like. We kept many, many calls from them, almost on a daily basis, and it was kind of frustrating to our users. So when it was about three months in the team, something interesting happened. And that was that our team lead left the company, and I had to replace them. So we talked in the team, and I suggested to change the way we work.
06:26
And then we came up, while we were talking, with this mission. We wanted to go from firefighting, like we saw in the picture, to fire prevention. So referring back to the post-it example, we would just want to fix the problems overall.
06:43
Let's just make it work, and don't frustrate your users with all kinds of problems. So I had used configuration management before, so I already knew there was a solution to this problem. I had used Puppet when I was working at the service provider, and it was interesting.
07:04
So the first idea was to use it again. But then we ran into some interesting problems. The bad was that most of the servers that we were running were running Java applications. So their memory had been tweaked a lot, so there wasn't just much room left between the operating system and the website.
07:25
So simply, we couldn't afford to wrap up. So that was interesting. So what we did was bring you our configuration management tools, and select the one that had the smallest memory footprint. So that's how we ended up with TFH3.
07:41
It's a very small memory footprint, written in C, it's pretty big. It was new, so that was interesting as well. And that became our first building blocks. In this solution, we were actually three building blocks. Git for version control, CFH3 was of course the configuration management, and then we used 3D to test it.
08:03
So about Git, I strongly believe you should always use version control. Because that brings in flexibility, and it allows you to revert back to some earlier known to work new state. So if something breaks today, you can always revert back to yesterday's situation.
08:21
I've already said by the previous speaker as well that this is also very important to see the history of what you're actually doing. In fact, all the version controllers also love this version control. The footprint we are using for testing, because we had these five different operating systems.
08:41
Now that we want to use configuration management to control everything, we better be sure it would actually work on all operating systems. So what we did was, using favorite, you can just define the favorite box, which is like template. You can just spin up new virtual machines and use them over and over again.
09:01
And those boxes were defined to be the same as if you had installed them using Pixi. So we have clean servers to test with, and then used the boxes to use the tests and to spin everything up. This is the workflow that we were using.
09:24
Features, we would create on a separate branch, so you could work on multiple features. Multiple team members could work on multiple features. You could just commit whatever you wanted on that special feature branch and do any work that you wanted to do.
09:42
Once it was done, once it was tested using favorite, when it was working on all operating systems, we would merge this to the development branch. And everything in the development branch should at least work, should be tested using fragrance. The only issues that may arise are some integration issues, because if two features are developed at the same time,
10:05
when you both put them in development branch, then life is over. So that's what we evolved from that. The next step was to add a little bit of testing. And that's where we're using the data branch. Those were actually real servers that were running in the data center that
10:22
had the same characteristics as the other production servers that we would deploy on. But it was just not so important. So those were our own servers that we used to compile, for example, RPM packages or whatever. So we had like 10 servers or something to use for this data branch. And in fact that gave us a fairly good view of how everything was working.
10:49
So then it's getting interesting, because there's the difference between pre-production and production. And in fact we understand. But what we did was try to involve our users in this whole process, because they weren't happy anyways.
11:01
We wanted to show them a way of more of cooperation or something. So we asked them, could you please assign us a few of your servers? And on this service, which were production servers, we will deploy the configuration management code that we think is production ready, but we want you to verify.
11:22
So we ended up having like 25 pre-production servers that were selected by our users. You can use this one, then something will break, it's less important. So that really built a lot of trust. And then we first deployed the pre-production. And then the thing worked. And we especially used it in the beginning to gain trust.
11:43
And later on the time gaps between the merges became smaller. Now actually I don't really think they're too much interested in this anymore, because they're trusted enough. They know we're doing a lot of great things. Finally we bring it into production.
12:01
And we scale it to around 450 servers at the moment, so it works pretty cool. Let's have a look at how the Git repository looks like. It's a bit simplified. So you see all the branches over there from production all the way up to development. So everything at the top, so the FTP feature is actually in every branch, and then you go up.
12:25
At the top, the Apache feature is currently being worked on. There are two commits that are currently being worked on. So if it's ready, if it's working properly, we would merge it to the development branch, right?
12:41
And we want to do that with a single commit. We want atomic commits, because later on if something breaks, data for example, we want to be able to revert just one commit, and not two, three, or five. So that's very important. We also shared this Git repository and read only to our users.
13:01
So they didn't actually see what we were building, what we were doing, and that was a very open way of working on it. So if they submitted the problem, they didn't see another entity. Involving users in this whole process is a very important thing, because they look from a different perspective usually.
13:22
So you can think of improvements yourself, but trust me, you'll get better ideas if you use it. And they'll like you for it, because now they finally understand that something is changing, and that we're building something that actually gets better and better. So in one of the meetings that we had, we were discussing what we wanted to do,
13:44
and we asked them, well, are there any things that you want us to add, etc. And then there was this guy who said, well, you know what, whenever I log into a server, since we have five different operating systems, I always have to find out which operating system this is running on,
14:01
before I can just do my work. And of course it isn't that hard, because I know them all right now, but it's pretty bad. So, this is what we ended up with. Whenever you SSH to a server, it's a very simple problem, only such amount of code, maybe ten lines,
14:21
and just print all the information in there. And now everybody would just log into the server and immediately say, hey, this is Red Dead Enterprise, it's actually production, you should not be here. That's why we made it to Red. And some extra information is there as well. So that really made a lot of difference, even those small things.
14:45
So finally, our users became happier again. And we saw a significant drop in the number of incidents that we now have, because we were fixing problems proactively. So if anything happens, it would be fixed on all of our servers.
15:02
So the question is, how did we do that? How can you do this to the shelf? So how did we get started? Because if you want to start this and you're firefighting, then you're very busy, so how do you find the time to actually start working? So when I was looking back at this whole process, I found five different phases, five different steps,
15:22
five different steps. So the first step, you should find out what to fix. So that's where we can be asked for, right? And of course, in this, there will be a lot of, I mean, this is a platform, this is, there will be a lot of what to fix. So if you're just entering a team and saying, wow, what are they drinking?
15:41
What's going on out there? So you should just get an invitation to find out why is everything working as it is and make this list of problems you want to fix. So this postfix thing was on our list. And why we are creating this list? Because you really want to find out what problems occur the most.
16:02
And the thing to do next is to find quick wins for those problems. So one of the things we did was quick fix the postfix trouble. We just did an ugly way to solve that issue and we knew we would be fixing it later on for good, but now at least we were with the problem. Another interesting quick win for us was installing updates.
16:24
It sounds crazy, but we were hitting a lot of bugs that had already been solved, but nobody cared to install the updates. So we did, and that saved a lot of time as well. And the thing here is you need to buy some time to install, to invest in the final solution,
16:42
because you will want to implement configuration metrics. That's really important, because that's the way forward. So how do you start this? I would suggest, if you want to start configuration management, first of all, just choose one of these. It doesn't matter which one, just go ahead and get started.
17:00
Find a pair that's already using it to use the same tool. Then start building your baseline. And the baseline is your... I mean, try to find maximum five configuration items that are the same on all of your servers. So what we did was, we wanted to manage the root password,
17:20
we wanted to manage SSH to make sure you didn't log in through it, we wanted to make sure entropy was properly configured, and I guess we had to shoot only for that. So it's pretty easy to start building this solution. And then it's time to start scaling, because when you have this baseline, I would suggest scaling it out to all of your servers.
17:44
So go to the beta servers, then pre-production, finally production. So you end up having configuration management on the small part, but on all of your servers, and this is actually very powerful. Because when you have this in place, every next configuration item you will add to the mix
18:01
will be really powerful. So we added Postfix to the mix, and we were just testing in the five operating systems, when it was done, everything was solved. So you can add every configuration item that you actually want, and it will really pay off quickly. So and then, this is how we eventually found the time
18:22
to automate everything over there. So the final phase is to relax. We are now in control. Everything is in configuration management, everything works rock solid. There is a downside to this whole approach, because you might get bored.
18:42
Once this is done, I don't know what you should do next. You should look for your next challenge to begin with, and that makes it even more interesting. So one of the things I did was visiting conference. So I went to the CloudSec collaboration conference in Amsterdam, and I attended a talk by Chris Deinhardt,
19:01
and he was talking about the future of system administration. It was a very nice talk. And he quoted someone from Google, and he said, every 18 months, automate yourself out of your job. So that's pretty cool. Don't be afraid by your job part,
19:20
because actually it means whatever it is that you are doing today, many of you, you should not be doing 18 months from now. So that's very interesting. And that really made me realize, so either look for a next challenge within this company, or run to something more interesting, since the semi-government part isn't really my thing. It's not quite done.
19:41
Let me recap. If you want happy users, happy team members, happy stakeholders, happy everybody, you should reclaim your work in configuration management. I talked about five steps to get there. So find out what to fix,
20:01
implement quick fixes for the things that come to mind. It will buy you some time. Invest this time in configuration management. Work in a baseline and scale it up and out, so you will include everything, and you will be in very relaxed position. There is just one extra thing I want to add to the slides.
20:25
So there are a lot of tools out there for configuration management. In fact, don't fight which one is better, or try to do this and that, etc. The most important thing is that we also use configuration management.
20:40
So whatever happens, do use configuration management, because that is the software that will really help you, and it will bring stable environments. So that's what we all want, right? So tell everybody, spread the word about this, configuration management is really important. So if you guys want to get in touch,
21:01
I'll be around. I did not cover a lot of technical details, because I only have 25 minutes, but there are some of the technical details so don't ever look up there. Or come talk to me, we might have a few minutes for questions at the moment. The slides will be posted anytime soon. In fact, I automated that part as well,
21:21
so we'll see anytime soon. There should be a tweet flying, so go have a look. Try to handle us over there. So thank you very much for your attention.
21:40
Anybody want to ask a question? We were in this very silent environment where we were not allowed to touch any windows. So we did that. But in fact, I think it's a good idea to have a district layer of your infrastructure.
22:04
So it would be a good idea to manage infrastructure and not just the infrastructure. So you guys can do it. Yes, you.
22:25
Yeah, that was part of the challenge. The question is, when did you have any trouble getting everybody on board or the other members? And the turning point really was when I became too late. That's a little bit ridiculous, maybe, but that allowed me to just
22:40
evolve everybody and make an open communication and that really helped getting people in. We asked for their ideas, et cetera, so everybody was actually quite enthusiastic. It was really interesting. Any other questions?
23:00
Yes? We did break things. So that's the thing where great power comes with great responsibility. But the good thing is, if you're breaking a great power, you can use the same power to fix it again. But you should be really careful
23:20
about implementing this. You should really use different types of branches and versions before you ever hit something to production.
23:46
I guess I understand what you mean. The question is, if you're using the migration app, and you have enough time, can you implement it fast enough? Is that what you meant? Our users were requiring us to fix some things
24:02
and they were quite used to the firefighting approach and do it directly. Now we have the migration app, which takes more time. We thought of that as well. In fact, it turns out that not too many things need to be done immediately. So when we were up the street,
24:22
we could release very often. In the event that we really did have some fire that needed to be fixed now, you couldn't use configuration management. We just fixed it by hand and then implemented configuration management for this piece so it would be deployed. So you can still... Another thing to do is use an hotfix.
24:42
Maybe we can talk about it. Yeah, I think we're done. Thank you, Amy, for your talk. Thank you for the amount of nodding heads that I saw.