We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

The Softer Side of DevOps

00:00

Formal Metadata

Title
The Softer Side of DevOps
Title of Series
Number of Parts
50
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Previously I've spoken extensively at ChefConf about the technical aspect of devops. How to implement the technologies, controls, tools, code, etc. But over the past few years people have asked more and more about the social aspect. How did we get countless teams across a large company to do this? How do you get buy-in? How do you sell it? How do you handle the teams who you don't think can cut it? What about the teams that are stuck in the past? How do you build or transform your team/teams/department/company? Getting one team to do it is easy - but it doesn't get you where you want to go. You have to get everyone in. That's what this talk will focus on: the soft-skills side of devops.
FacebookBitResultantWordFacebookFocus (optics)Physical systemClient (computing)Product (business)Open sourceQuicksortSoftwareAdditionOperating systemProgramming paradigmMultiplication signSoftware maintenanceConfiguration managementMetropolitan area networkService (economics)Different (Kate Ryan album)Figurate numberData managementComputer animation
BitQuicksortLogicEvent horizonConcentricData managementPoint (geometry)Group actionMeeting/Interview
Different (Kate Ryan album)QuicksortOperator (mathematics)Software developerStaff (military)Cartesian coordinate systemSet (mathematics)CodeComputer animation
MassContinuous functionCoefficient of determinationComputer-assisted translationMassAnalytic continuationCodeMereologySoftware testingVelocityOperator (mathematics)CuboidSource codeComputer animationUML
Computer iconMaxima and minimaSpectrum (functional analysis)Multiplication signWordOperator (mathematics)DiagramClassical physicsCartesian coordinate systemSoftware developerAlgorithmData storage deviceCodeSystem administratorXML
System administratorMassProcess (computing)Line (geometry)Physical systemWebsiteMultiplication signSoftwareProduct (business)Operator (mathematics)Order (biology)Cartesian coordinate systemSystem callGoodness of fitSoftware developerWordSystem administratorSet (mathematics)Right angleShared memoryFilm editingGroup actionDatabase normalizationForm (programming)Computer animationMeeting/Interview
ResultantElectronic mailing listRight angleWater vaporFrequencyField (computer science)Endliche ModelltheorieCategory of being
AutomationCodeView (database)Endliche ModelltheorieComputer animation
BitProcess (computing)CodeSocial classData modelSoftware testingEndliche ModelltheoriePoint (geometry)Control flowBackdoor (computing)Integrated development environmentAnnihilator (ring theory)Formal verificationComputer fileMereologySoftware developerCodeOperator (mathematics)MathematicsCartesian coordinate systemArithmetic meanSoftware repositoryBuffer overflowProduct (business)MassAttribute grammarLine (geometry)BuildingFacebookStress (mechanics)Category of beingSoftwareVirtual machineGoodness of fitCore dumpPhysical systemSoftware testingService (economics)Graphical user interfaceGame controllerOnline helpLevel (video gaming)Single-precision floating-point formatElectronic mailing listQuicksortPosition operatorTemplate (C++)Wave packetHuman migrationSpacetimeLink (knot theory)Source codeStaff (military)AreaLinear regressionPeer-to-peerLibrary (computing)Default (computer science)CuboidWebsite1 (number)Open sourceSoftware engineeringMetropolitan area networkPrime idealSystem callGroup actionMotion captureSocial classLecture/ConferenceMeeting/InterviewComputer animation
Process (computing)Focus (optics)Software testingGroup actionPhysical systemMathematicsCausalityProcess (computing)FacebookSoftware testingWordChainMultiplication signVariable (mathematics)Data conversionRight angleMatrix (mathematics)Tape driveDependent and independent variablesMereologyMoment (mathematics)Group actionType theoryLengthWritingGastropod shellDemosceneCodeDivisorScripting languageSelf-organizationDisk read-and-write headPattern languageElement (mathematics)Computer configurationService (economics)Operator (mathematics)Wave packetWeb pageOrientation (vector space)InformationMetropolitan area networkPoint (geometry)Reading (process)Basis <Mathematik>VideoconferencingSet (mathematics)Covering spaceVolume (thermodynamics)Data managementFrequencyInheritance (object-oriented programming)Cellular automatonDirection (geometry)Product (business)Natural numberOnline help2 (number)Entire functionBuildingSimilarity (geometry)Computer animation
Computer animation
Hill differential equationIntegrated development environmentCASE <Informatik>Table (information)Endliche ModelltheoriePolarization (waves)Computer animation
AutomationModemFreewareTime zoneInformation securityWave packetBoundary value problemReading (process)Internet service providerTime zoneData conversionProjective planeConfiguration spaceMultiplication signSoftware testingVideo gameProcess (computing)Product (business)WebsitePosition operatorSubsetGoodness of fitRoutingInformation securityData managementSoftwareData storage devicePhysical systemRouter (computing)Condition numberEndliche Modelltheorie1 (number)Decision theorySet (mathematics)INTEGRALCartesian coordinate systemDemosceneTransformation (genetics)Dependent and independent variablesOnline helpSoftware developerEuler anglesWindowComputer wormVacuumMultiplicationInsertion lossSlide ruleIntegrated development environmentCuboidComputer animation
Hill differential equationGroup actionProduct (business)Self-organizationEndliche ModelltheorieVelocityComputer animation
Computer configurationWave packetOnline helpPhysical systemQuicksortProcess (computing)Core dumpBuildingMultiplication signMereologyLattice (order)Data conversionMultilaterationSelf-organizationCoefficient of determinationLogical constantGroup actionData managementMathematicsParameter (computer programming)LastteilungSound effectOperating systemNegative numberContext awarenessPressureServer (computing)Range (statistics)Software bugEntire functionSoftware repositoryInformation technology consultingGoodness of fitWebsiteLink (knot theory)Structural loadVelocityPoint (geometry)Endliche ModelltheorieCodeLevel (video gaming)Moving averageFacebookBitBridging (networking)WritingWell-formed formulaTape driveThread (computing)Pattern recognitionResultantGame controllerGeometrySoftwareSpecial unitary groupRight angleForm (programming)Water vaporAngleRing (mathematics)Stress (mechanics)View (database)Shift operatorWordFirewall (computing)MultiplicationEvent horizonGodDivisorControl flowLogic gateHierarchyData structureDivergenceCross-correlationNumberDifferent (Kate Ryan album)ResonatorOrder (biology)Bit rateWave packetArithmetic meanWorkstation <Musikinstrument>Block (periodic table)FluxRow (database)Lecture/Conference
Transcript: English(auto-generated)
Thanks guys, so this is going to be a culture talk which is a little bit new for me, so we'll see how it goes. I'm going to be talking about the culture side of DevOps, or as I like to call it, the softer side of DevOps. And before we jump too deep into that, I want to start really quickly with a few
words about who I am and how this talk came to be because I think it's relevant. So as you said, I'm a production engineer on the operating systems team at Facebook. I'm also a Chef client maintainer, and in fact I was the first Chef client maintainer outside of Chef. I used to work as an SRE at Google, and before that I hacked on a configuration management system at Ticketmaster called Spine.
And some of you may have seen three years ago at ChefConf I gave a keynote, and in that keynote I talked about how Facebook had engineered a move from CF Engine 2 to Chef, and I talked a little bit about how the reason we did that in addition to wanting the new technology was to build a cultural paradigm to move, the way in which we manage
systems both culturally and technically. And I mentioned that sort of in passing, but the talk was heavily focused on the technology, on the tools we wrote in open source at the time, the way we wrote our cookbooks, the way we were managing our Chef servers, it was very technical. And so I gave that talk, and afterwards, as you might imagine, I got a lot of questions
about how did you write this thing, and how did you implement that thing, and blah dee blah dee blah. So I take that talk and I go on the road, and I give it all around the world. And as I did, people started asking me different questions. People started asking me, how did you convince this kind of team? How did you convince that kind of team? How did you get buy-in from this manager, or that kind of manager? All these sort of cultural questions.
And for sure, I had been involved in the cultural discussions at Facebook and trying to give my opinions on how culture should work and working towards that culture. By the same token, it wasn't my focus. My focus was the technology. So when people asked me these questions, I had to think a lot harder about the answers. And truth be told, I may have been bullshitting a little bit in the beginning.
So I get asked these questions more and more, and then fast forward a year or two, and Chef, the software company, the time ops code, invites me to the Chef Leadership Summit. And I was a little surprised cuz I don't consider myself a leader. I mean, my title is tech lead, but I'm a manager, I'm not a director,
I'm not a VP, I've certainly never owned a company. So I was a little surprised. And I realized with a little bit of terror that I was about to walk into a room full of venture capitalists and CEOs and vice presidents. And if there was anyone in the world who would know that I was bullshitting on these sorts of topics, this was the room full of them.
So I sat down and I thought really hard about all of these things I'd been saying to people for the last few years and made sure that I really had thought through them to their logical conclusion, talk to a bunch of people, walk into the room, have some conversations, goes well. Well, at this point, now it's on my mind.
So as I'm going to more conferences and more events and talking to friends and colleagues around the world, I'm now actually bringing this up, not just being asked about it. And I was talking to Adam Jacob about it and a bunch of other people. And Adam in particular kept telling me I should give a talk on this. Hence, DevOps the talk. So, as I said, this is gonna talk about the culture side of things.
And if I ask anyone in this room what DevOps is, you'll have an answer. Well, most of you. But you'll have different answers. John's gonna have a different answer than Alon. And that's because we all sort of inherently know what DevOps is, technically, culturally. We sort of have an idea, it's a thing that we do or we're trying to do.
We don't really know what it is cuz it's a lot of different things, right? There's no single definition. Some of you will say, Phil, it's right there in the title. It's Dev in Ops, come on. Dev developers working with operations people, maybe you'll tell me about how your developers, your on-call.
Maybe you'll tell us to me about how your operations staff knows a lot about the code and is embedded with the application's developers. You'll talk to me about how they've built a relationship, about how they work together and they have overlapping skill sets. Or, as we would have called it before the DevOps movement, human sacrifice, dogs and cats living together, mass hysteria. Turns out, it works.
Works pretty well. And if I were to ask other of you, you would talk about what Barry talked about in his keynote yesterday. Taking idea to customer as fast as possible. Continuous delivery pipelines that pull in code from operations and Dev and test it together and move it out to that customer as fast as possible. The velocity from idea to customer.
And all of the technology and ideas that go into making that a reality. And that would also be true. That's also part of DevOps. And the reality of it is most of you are gonna give me a definition somewhere on the spectrum, but I wanna go back in time. I would like to take you back to the late 90s.
Before DevOps was a thing, before anyone had ever said the word DevOps or thought the word DevOps. And for those of you who worked in this time, or who have ever seen a talk about how we got into DevOps, you've seen the classic dev wall ops fucking diagram, right? And it turns out that there's more than just people
throwing code over the wall. Developers felt that they were irreplaceable, because they knew how to mem map. They knew how to optimize applications. They knew how to write algorithms. And your sys admins, we felt that we were irreplaceable because we could configure storage in one user ad, and certainly no one else could.
And so you'd have a developer, and they'd write their application, and they'd throw it over the wall, and sys admin would try and deploy it. It wouldn't work, and he'd throw it back over the wall. And everything was terrible. Now it turns out it worked. I mean, in the late 90s and early 2000s we shipped billions of lines of software to people all over the world.
We made massive websites, lots of things happened. But it wasn't as good as it could have been. And so right around this time, I took a job. I was a young lad. I took a job as a wee junior sys admin, getting paid almost nothing because I wanted to learn.
This was me. And at the time I was working at a little.com with a guy who was really well known in the community, he was a senior sys admin. And I asked the guy, I said, well, what makes a senior sys admin? What makes you really good at your job? How can I be really good at my job?
And he said, well, I can think of two things. One is automation. That laziness that is inside of you that makes you want to automate the crap out of your job so you never have to do it again. I was like, I got that one. And so the second one is you gotta be a good enough software developer to be able to A, write that automation. And B, debug the applications you have in production and
talk to the developers about the problems you find and help them with the problems that they find. So I was lucky, I got a good answer. But what's interesting about that answer is that it's operations people knowing code, operations people having a good relationship with their developers,
having an overlapping skill set. I don't know if that sounds familiar to me. It's one of those things that we talk about when we talk about DevOps. And around that same time, Google made a team called SRE, Site Reliability Engineering. And their marching orders was to hire operationally minded engineers and embed them in application engineering groups.
And they would help those application developers to write more diverse, more reliable, redundant, distributed applications that had better telemetry, were easier to manage and easier to monitor. They would share on call with these people and they would be a member of that team. Man, that sounds a lot like some of the things we talk about in DevOps.
These ideas are not new. They've been around forever, but we didn't have a word for it. And the thing that DevOps gives us is a word to communicate with each other, to talk about this thing that we're trying to build and to improve upon it. DevOps today is better than DevOps last year, which is better than DevOps before that.
So you all are here cuz you use Chef, or you use Habitat, or you use something. Maybe you use Puppet or Ansible. And you're like, hey, so I got the technology, but I want, at least there's a few people in this room old enough to remember that. Yes, I know the technology, Phil, but how do I get people to act the right way?
How do I get people to embody this model, this culture that I keep hearing about? Well, I sat down to write this talk, and I thought, well, what are all the questions that I've been asked around culture? What are all the questions I've asked around culture? What are all the problems I've seen? And I started scribbling down all these post-it notes, and
I was trying to put them in categories that I could write a talk around. And then I realized 98% of them all were just two questions. They all boiled down to just two questions. One, what about that team that I don't trust to give access to my code, to build good relationships, to automate?
Maybe they mostly use GUIs, whatever your reason is, you don't trust them. And two is, what about that team that refuses? I've talked to them, I've tried to explain it to them, they just won't do it, they absolutely refuse. So I put these two questions in front of me, and then something else occurred to me.
They have something in common. And what they have in common is fear. Fear that you have of that team, whether you're a tech lead, whether you're an individual contributor, a CEO, a director, a VP, whether your company is two people or 2,000 people or 100,000 people. These two questions have fear in common. You're afraid of that team, or
that team is afraid of the future of the tooling of the model of something. And how do you deal with fear? Well, we talked about this yesterday in the keynotes, trust. Trust is what breaks down silos. Trust is what makes these models work. And so then the question becomes, well, how do you build trust? And that's what we're gonna talk about today.
But before we do, one clarifying point. I do not mean blind faith. When I say trust, I mean trust but verify. Having an environment that everyone has access to do all of the things they need to do, and everyone else has all of the tooling they need to feel comfortable about that. So let's start with that first question. What about that team that they're not good enough,
they're not technical enough, they just sit in a GUI all day, they're not gonna get it. If this is the problem that you were dealing with, I would like you to ask yourself, regardless of what level you are or what your position is, four questions, and they are as follows. Do they have training and documentation available?
Seems easy enough. Is there a way for them to test? Is there a code review process? And finally, and most importantly, do you have a noble and post-mortem? So training and docs, teach classes, teach them regularly. If you can't, record them, let people watch them, have docs. But I don't mean point people to chef's classes.
I don't mean point people to docs.chef.io. Certainly people need to know how to use the tools, but that is not sufficient. People need documentation that shows them the model you're trying to use. Why did you use attributes instead of resources, or resources instead of attributes? Do you use the poise model or not?
Why do we build APIs this way as opposed to that way? Each one of you has a way in which you're building this technology to foster the model and the culture you want, and you have to not only document it, but you have to explain why. People don't like to do things if they don't know why they're doing them. You need to also talk about the model.
Why is it that I, as a software engineer, have to be on call? That's what we have operations people for. People are going to ask that question over and over again. Why isn't it documented? When we were open sourced, when I say we, Facebook, Facebook open sourced taste tester and grocery delivery, the first two major tools we open sourced around Chef. We included in that code repo a document called philosophy.md.
You can go check it out today. There's a link at the end of the slide. And now what it did was talk about why we had built these tools the way we did. It was a drastically different way of using Chef than almost anyone else out there. And we wanted to share it. We wanted to give it away, but if you just give it away, no one uses it. You have to tell people why it's beneficial.
In what ways will it work better in certain areas? Feel free to take it as your template if you'd like. You also need good examples, both technical and cultural. There's a team out there that's automated the crap out of their stuff. Maybe they've got cookbooks that are just fucking awesome. Point to them in your docs. People love, sorry.
People love to learn from example. This is why we have Stack Overflow in all of these sites. People love to learn from examples. Let's say you have a team that's built an amazing relationship between their developers and their operations folks. Point to them. Let people go and ask them how they did it, or why they did it, or what problems they ran into. And finally, give people a place to ask for help.
Your company has probably got many diverse services. And it turns out that no matter how many people there are following this model, there's gonna be a service that goes, hey, I looked at the docs, I looked at the code, it doesn't really fit my service. How do I fit this to my service? And if they don't have someone to ask, they're not gonna do it. Testing, make testing easy.
Remember that these tools are building culture. They're not just about finding the fact that someone put three spaces where they should have put two. These tools need to be a part of your culture, which means if I have to remember to run them, I'm not, and then it's not a part of your culture anymore. Make it automatic the way GitHub does. There's many tools to do this.
Production testing. Talked to a guy outside in the hallway literally 20 minutes ago, who was trying to engineer this entire thing to diff things and whatever. And I was like, what problem are you trying to solve? And it turned out that the only thing he needed to be able to do was see if this was gonna work in prod on one machine before he deployed it to
the rest of the world. Production testing also buys, imagine you are a team that has been managing your application or your systems or your containers by hand. And now you're being asked to automate it. Well, if you automate it all in test, you probably still don't feel comfortable just throwing it into prod. And even once that automation is in production, when you change that
automation, you wanna be like, is it gonna work with all the real traffic that we get, right? Production testing is really important, and I'm not saying it supplants other sorts of testing environments. It does not, but it's a good tool to have to make people feel comfortable about the model you're asking them to adopt.
And of course, this applies to both dev and operations. There are no special snowflakes. You as an operations person or an applications person are not special or different. You need to follow all the same things. Code review, I cannot stress this enough. Every single line of code that you have in your production environment, operations or development or application should have been code reviewed before it got there.
There's no excuse for this, for not doing this. This catches things that other things cannot. Automation can catch problems that humans can't. But humans can catch problems that automation can't. This catches things like, hey man, you seem to be building an API in this kind of way, which is the only person in the world who's done that inside of our company, we do it this way. Or hey, if you factor these three lines out, you'll be able to reuse this and
so will seven other teams. That's a huge cultural win. But more importantly, it fosters a relationship and it allows for continued education. It turns out that as you deploy a chef or any other tool, you're gonna deploy it one way.
And then you're gonna be like, so close, change that a little bit. Because it turns out that you learn as you go. And as you change these things, you want your company to come with you. Which means doing code reviews allows you to constantly be re-educating people on the newest ways to do things, the newest cookbooks, the newest libraries, the newest whatever. And then they tell their teammates who tell their teammates who tell their
teammates, and you're no longer walking around playing whack-a-mole. At Facebook, we have two categories of cookbooks. We have core cookbooks, and we have other cookbooks. Now, core cookbooks run on every machine in Prague. And they provide APIs to all the low-level things like sys controls and mounts and blah-di-blah-di-blah.
And with other cookbooks, cuz I suck at naming. And these other cookbooks get tacked onto the end of your run list and for the service owner for that machine or that container. And what they can do is then reach back, twiddle APIs that change how those early cookbooks run, and also set up their service. Now, when we were in migration from CF Engine 2 to Chef, we not only required
code review for all core cookbooks, but we required code review for every new file that ever got added to the repo. We were undertaking a massive amount of code change, and we needed to be able to make sure that people were moving in the right direction. As soon as we were done, we changed that, and we said, look, in other, get a code review from any engineer out there. We trust you. Find a good engineer who knows what they're doing, get a code review.
In core, we're still gonna make sure that we're on the code review. But we made a suggestion. We said, hey, look, if you're writing something new, if you've never touched Chef, if you're implementing a new service, get a code review from us anyway. We're not gonna require it. We'll help you out. And what that meant is we're constantly setting the precedent. And this means that as a team, we do more code review than
any other single thing we do on a day-to-day basis. And you know what? Worth every second. Super high code quality, super quick dissemination of information. Code reviews like this are easy to implement. GitHub does it, Fabricator does it, Delivery does it, Reitfeldt does it.
It's easy to do, there's no excuse. Postmortems. So lots of people have talked about postmortems at length. I don't wanna go too much into it. If you're really interested in how to build great postmortems, John Allspaw has spoken on this at length. Jay Parikh, the head of engineering at Facebook, has spoken on this,
as has Pedro, the head of production engineering. So I talk to people all the time about what goes wrong in their organizations, and when I ask people, what happens when you fuck up? They go, we have a postmortem. Sounds lovely, what happens in your postmortem? And the thing that makes me happy is that there are so many postmortems out there. The thing that makes me sad is that 80% of these people seem to miss the point.
Point is, as you can see, no fucking blame. I'll say that one more time, no fucking blame. The point of a postmortem is to not talk about who did what, but what happened and how, and how can we improve the system?
If something went wrong, is there a change to the tooling we can make that makes it easier to do it right the next time and harder to do it wrong? And I don't mean adding process. I don't mean adding a bunch of red tape so that I got 19 people have to approve my fucking change. What I mean is, that tool made it really easy to do the wrong thing? Cool, what if we changed the options? What if we update the docs, right?
You want to look at how you can make the system safer to move fast. It's not about slowing down the business, it's about increasing the speed of the business in a safe way. So the other thing that you need to look at when you're doing postmortems is have a postmortem that's organization-wide. And look for repeated patterns and behaviors.
So if you have an organization-wide, your entire tech organization, and there's a couple people in there that are there every week, and a dude comes in and goes, well, we had outage. And I read the docs wrong for this new tool, we were migrating the new tool. I told my team how it worked, I was wrong, we had an outage. Now, it was just a one-off, now I've read the docs.
Better we all know there's no action items here. Fair enough, off they go. Week three, somebody comes in and she says, hey, we had this outage and we were moving to this new tool and I read the docs and I was tired, I misread them, and we moved to the tool, we used it wrong and caused an outage.
Now I know, I told my team, no action item. You go, wait a minute, I've heard this before. This tool sounds like it sucks, we should fix it. The documentation is bad, or the options are bad, or whatever it is. That's your moment to find a thing that those teams individually could never have found.
These are the two things you wanna do in a postmortem. I'm gonna tell you a story, and it's a slightly embarrassing story. But aren't those the best kind? So, before I was working heavily on Chef and Facebook, I was working on IPv6. And in the process, I brought down Facebook. And then when I say I brought down Facebook, I don't mean like 50,000 or
100,000 people couldn't click like. What I mean is, when you typed in Facebook, nothing happened. And then a couple hours later, I did it again. And then a couple hours later, I did it again. I brought down Facebook three times in 24 hours.
As you might imagine, I had to show up to subreview, which is our postmortem. That's what happens when you bring down a major website three times in 24 hours. And I walked into a room, and it was my first subreview. And there's a bunch of my fellow ICs, and there's a bunch of managers, and there's a VP, and I was nervous. I mean, people had told me it's a no-blame postmortem.
You have nothing to worry about. It's gonna be a challenging conversation, but you'll be fine. And I walked in, and I described what happened. And what we talked about for the next 20 minutes was, how can we make sure that a similar engineer in a similar place can't make the same mistakes? And some of the things I did were just genuine mistakes. And some of the things I did were the right thing that had mitigating factors.
But it didn't matter. We talked about what tools worked poorly, what tools can we fix? What tools would we fix but are on their way out anyway? And are the replacements for those tools going to be better in this regard? And you know what didn't happen? It didn't get fired, obviously. I also didn't get written up. I didn't get yelled at. I didn't get told I was stupid.
There was no disciplinary action and no negative words whatsoever. That's what a postmortem is. So now let's say you still don't trust this team. You've done the postmortems, you've done the training, you've done all this culture stuff, and you're like, you still don't wanna give them access to my code. They're not, gives me the willies.
The question you need to ask yourself in that moment is, is it you? And there's two ways that I want you to think about this question. The first is, did you really support them? And I don't mean you gave them all of the things. I mean, were you rooting for them? Did you want them to succeed? Because if you did not, whether you're their manager,
whether you're an IC on another team, whether you're the director of the department, then you half-assed all of it and start over. And the second thing I want you to think about is your preconceived idea. As humans, if you see an individual or a team not capable of doing something for a long period of time, but then you change all the variables.
As humans, we have a really hard time letting go of that earlier performance. It's just part of human nature. And what our responsibility is, as ushers in of this culture, is to let go of that idea and come up with a new one.
Really honestly look at this individual, this team, in a new light. And let go of your preconceived ideas. But, let's face it, sometimes that may happen and you may look deep inside. And you may have really been rooting for this individual or this team, and it didn't work.
And you've taken a fresh look at them and they just can't do it. So, what then? Well, if you really did all those things and it's really not you, then the only answer is to fire them. And I can't emphasize this enough. It should not be your first resort. In fact, it should be your last resort.
It must be your last resort, otherwise you're an asshole. But, it has to be on the table. If this is not a tool in your toolbox, you will fail. You cannot keep toxic people like this in your environment. These are the people who drag you down. These are the people who kill the model. These are the people that hurt your business
and hurt your technology and hurt your infrastructure. Do not keep these people in your environment. Now, one caveat. The assumption here is that you have some people over here who are doing the DevOps, and then you got the people over here who are not doing the DevOps. That may not be your case. You may be at a small company or a very large company just starting this, and if there were zero people
following this new model that you want at your company, you can't just fire the entire fucking org. Please don't do that. And if you do, please don't tell them that I told you to do that. You need to bring in people who will foster this model, who will build this model, and what you'll find is more often than not that people who couldn't do it will learn how to do it by example.
And again, if you support them, it'll be great, and if they still can't do it, then fire them. Don't fax them like they did in Back to the Future, but fire them. So trust your engineers. I tell people this all the time. I tell VCs this, I tell CEOs this, I tell managers this, I tell other individual contributors this. Trust your engineers. You hired them for a reason.
You need to trust your engineers, and if you absolutely can't, go hire new ones, and then trust them. Remember, people only need to know what they don't know. If you ask people to step out of their comfort zone, do something new, most people who have half a brain will know when they hit the edge of their boundary and test harder, ask for help, read docs.
It's the only real skill you need. So moving on, what about that second question? So I was at the Chef Community Summit last year, and for those of you who don't know, it's an unconference, which means there aren't talks, there's just people going to room and bullshit about things. And so there was a room on networking, and I love networking. So off I go trotting to this room,
and we're talking about new vendors that are publishing their cookbooks and tools, and it's a great conversation. This guy raises his hand and he says, I have a different problem. Can we talk about a different problem? And we said, yeah, sure, no problem. And he says, well, my networking team just refuses. Like, the ICs refuse, the director refuses, like I've talked to them multiple times, they just want nothing to do with automation,
giving other people access to their systems, and like, none of it. So that's what we're gonna talk about. And if this is the problem you're facing, I got three questions for you. The first is, have you shown them the way? The second is, have you provided them training? And the third is, have you provided them security? And once again, I saved the best for last.
Show them the benefits. That includes bringing them into other teams who have done this, who can explain to them how much better their life is. Maybe their monitoring is just auto-generated from their configs and they never have to worry about monitoring again, our dream. Maybe it's the fact that their world is super consistent, maybe their uptime is better, whatever it is.
Maybe they don't have to deal with the day-to-day bullshit and they get to work on their pie-in-the-sky projects. But you actually have to show them there's a benefit to them. More often than not, I see people talking about the benefit to the company, the benefit to the product. And people care about that, but it's not enough. When you're asking somebody who's ostensibly doing their job to now learn a whole new thing
and do their job entirely differently, you have to show them that there's benefit to them in their day-to-day life. The next two are in the same slide. So, you have to make people feel safe. And the easy one there is documentation and training. We talked about it earlier, I'm not gonna spend time on it. But people need to feel free to make mistakes
when they're trying something new. I don't know about you, but when I try something new, for the first time ever, I usually fuck it up. I'm good, but I'm not that good. People have to be free to fuck it up, see no blame postmortems. You're taking a team that's maybe managed their systems by hand, or been configuring routers by hand,
or configuring storage by hand, or whatever they've been doing for a long time, and they're good at it. They got the job, they've been keeping the site up. They've been writing the application, they've been doing whatever they do, and now you want them to do something entirely different. You want them to think about their entire world differently and learn a whole new set of tools.
Well, they have to have freedom to screw that all up, because that's a lot of new skills to learn. Mistakes have to be viewed as opportunities to fix and improve training and documentation and tooling, not as a chance to tell someone how stupid they are. Remember that you're asking people to step out of their comfort zone
so it is now your responsibility, our responsibility, to make people feel safe to give some counterbalance to the comfort that you're trying to take away. And lastly, you need to provide job security. If I'm on a team of six people somewhere, and I do a thing, let's say I manage some subset of prod, and you're like,
hey, Phil, you know, you and your team, you log in and you do the things, and it's great and it never goes down, but I really want you to automate the crap out of this. And also, while you're at it, could you give access to all the developers' route on all of your systems? Now, I'm gonna think about this for a minute, and I'm gonna, assuming I don't know anything about DevOps, I'm gonna do some math, and I'm gonna go, okay, so we're gonna automate a bunch of stuff.
That's gonna take the amount of work from here down here and I'm gonna give other people access and they're probably gonna do some stuff, and that takes my work down here, so there's gonna be room for one person left on the team so you're gonna fire me. I don't know about you, but that's never been a good motivator to get someone to learn something new and do a bunch of work is getting fired. It is our job, especially if you're in a position of management,
to let them know that that's not what's gonna happen. There will almost certainly be more interesting work on that team when they're done, and if there's not, you will find them a position somewhere else in the company because if someone can make this transformation, if someone can learn this model and encourage this model and teach this model, you want them. These are the people that we're all trying to hire,
so let them know that this is gonna make them more valuable and that they're gonna have a job. Excuse me. So, what if they still won't automate? What if you've done all the things, you've shown them the stuff, you've brought them into another team, you've done everything you possibly could
to try and show them the way, you've given them no play and postmortems, you've sat down with each individual engineer, you've been like, oh, this is gonna be awesome, and they're like, eh, fuck you. Seriously, man, there's so many people I talk to who just won't fire people. You have to be able to do this.
These people are more toxic than the last group of people, like by far. These are the people who are actively and maliciously working against, eh, maybe not maliciously, but actively working against this model, which means they're gonna tell other people why that model is bad, and they're just adding uphill battle for you, for your organization, for your business,
for your velocity, for all of it. Everything about that sucks. Don't let these toxic people drag down your organization and your company and your products and your speed. Remember that we're talking about people. People problems have two solutions. There are only two solutions that work.
There's a million solutions, but only two work. One is support people, and supporting people means lots of things. It means putting effort into them. It means training them. It means making them feel safe. It means giving them places to fuck up. It means all of these things. And if you support them,
and if you try and help them, and they cannot or will not be part of the movement you were trying to encourage, the company you were trying to build, the organization you were trying to build, then you should fire them. They have no place in your organization. Once again, hire good people and then trust them.
Trust your engineers. Trust your fellow engineers. Trust your subordinate engineers. Trust the engineers on the other side of the org. This is a link to the philosophy doc that I mentioned, as well as another philosophy doc that we put in our cookbooks repo to be more detailed, as well as some of the tools that I mentioned. And that is all I got for you today.
Thank you. Happy to take questions if people want. I got plenty of time. I spoke very quickly, apparently. Going once.
Going twice. Yes. One more time. What did I do to bring down Facebook? How do I do this without saying things I'm supposed to say? So we were using a particular piece of vendor software to do geo load balancing that had a bug in it.
And when we configured the thing, it totally blew out all of our ratios, which caused cascading failures. And then it turns out that the reason I did it multiple times is because every time that happens, someone running in being like, my deployment broke the site. And so I'd roll back and they'd roll back and not submitting any factors and blah, blah, blah.
And we'd do a bunch of research and they were convinced it was them. And so eventually I just did it at like four in the morning because that was like no one else was doing things. And then I was like, oh shit, it was me. So that's a much more involved story, but that's like the TLDR of it. Yeah, what's up?
So the question was, what about a team that can't improve because they're constantly under water? So that's a really good question. And this is actually where we started at Facebook. We had a team called SRE, which we rebranded as SRO and then eventually other teams. And they were sort of just underwater. And what we ended up doing was we grew the team slightly
and we carved out a bunch of people, a handful of people who were going to just work on like what was next and like getting these people out from underwater with the clear understanding, and this part's important, of continuing to bring people across that bridge. Like it wasn't like, oh, you're not good enough. It was like, okay, you two are gonna do the thing
and as soon as there's like enough room, then we're gonna move you over and then we're gonna move you over and then eventually that team just didn't exist. So it's important not to like single individual people out. The people we picked happened to have already been trying to work on automation in the corner and so like it was a logical thing. You have to hire a little bit and then you have to make some people who are gonna work on the next thing and as they automate more things and fix stuff,
this team gets less underwater and so on and so forth. Anyone else? Yeah, so the question was, is do you have like a Chef team that does these reviews and stuff?
So the team that I built to do the conversion was originally called Chef team and all we did was write ungodly amounts of cookbooks to provide APIs and then we set up the Chef servers and we owned all of that stuff. Once the conversion was done, having a dedicated team to that was overkill and we ended up merging with another team and becoming the operating systems team.
So now we handle all sorts of things. We have a much wider range of things that we own, including like packaging tools and yum repos and Anaconda and bodybodybot. But we still do copious amounts of code reviews around Chef and also many other tools. So packaging tools and blah, blah, blah, blah and various deployment changes
and we're kind of a consulting team now as well so we go out to a bunch of other teams and help them automate. But yeah, essentially we're the team that owns like the core of Chef and Facebook. And there was somebody who raised their hand at the exact same time over there, yeah?
Sure.
Yeah, yeah, yeah. So the question was, he worked in an organization where getting post-mortems to happen were hard for two reasons. One was they didn't have time for the post-mortem and two were they were worried that that was gonna create more work for them. So the answer here is the same answer for like why should I automate my infrastructure?
Why should I do code reviews? Why should I codify my work? And the answer is a little bit more work now means a whole lot less work later. It means that you're gonna have less outages, you're gonna have better infrastructure. So for example, one of the things I like to say about post-mortems is if people don't feel comfortable talking about their fuck-ups,
then they don't talk about their fuck-ups and they hide them, right? Like a dog hiding like a bad thing they did. And then you don't ever fix the actual problem, right? You don't ever fix that tool, you don't ever fix that process, you don't ever fix that monitoring, whatever it is. And what happens is you create so much more work for yourself because you're constantly having all these people who like fuck the thing up
and then go and hide it. And the next dude fucks the thing up and then he goes and hides it, right? But if you have the post-mortem and you're like, spend the hour to fix the thing, that's like hours and hours and hours and hours and hours you save of people not breaking your fucking stuff, right? And so that's the one argument. The other argument is do you want to be the dude
taking down the site all the time? Because I don't, so I'd really like to learn from other people's mistakes. Those are the two arguments I tend to make. Yeah, blue shirt. Yeah, you mentioned that one of the results of post-mortem in some organizations is a more copious change control policy and it's the point of load slowing the velocity down,
things like that. Do you have any recommendations on less loosening an organization's grip or desire for that kind of management as opposed to the post-mortem methodology? That's an interesting question. The question was, I had mentioned that some people poorly have an outcome of post-mortems that involves change management and red tape
and that sort of stuff. And how do you work to this less corporate grip, more post-mortem-y result? I think this is hard and I think that, I don't want to give a cop-out answer, but I think it depends a lot on your organization and where that pressure is coming from. So I would deal with that very differently
if it was coming from say like a CEO kind of level where they were not terribly involved and didn't understand versus if it was coming from say like the engineering managers who very clearly understood. I can have very different conversations with those people about what's going on and when and why and how. I can generally sit down with most engineering managers
and have the same kind of conversation that I'm having here with you guys and convince them that they're being stupid. With a CEO, I don't do that. I'm not gonna sit here and tell VPs or CEOs that like, hey, you're just doing it wrong. The approach that I think works better with higher level management is trying to show them
that the people underneath them, and I don't mean like the engineers, but the managers and directors, the people who are actually managing the technical organization are actually very good at their job and the things that they're doing are slowing the business down as opposed to making, like we're still having outages. So maybe we can try something else
and these people are very good at that. You hire those managers, let those managers and directors do what they do best. You can convince those managers and directors to try something new. That's a bit of a cop-out answer. It entirely depends on the organization and where the pressure's coming from, but those are two examples that I hope are helpful. Anyone else? When you start a post-mortem, you tend to wanna start with the context
of what happens. Correct. Push the button that little thing, and even more positive, you talk about the technology or the timeline, it just involves impersonating. You need someone who's gonna leave the post-mortem.
So it can't just be 20 engineers in a room talking, because what's gonna happen is that it's gonna turn into a blame fest, more often than not. So we have a director who runs our post-mortems. There's a team of them. They switch out. But we always have one director that's pretty much in every one, and a few ICs that are individual contributors that are in all of our post-mortems,
so they can do pattern recognition. And their job is to go, okay, tell us what happened, and you go through the thing, and then you go, okay, cool. What tooling could have been better? They have to lead that in a way that encourages people to not focus on who did the thing, but what happened. And don't get me wrong, I sit there and I said, I typed in the thing, and then I blow up the stuff,
and it was fine, because no one was like, geez, you broke down the site. No, that wasn't what happened. They said, cool. Why did that tool do that thing? I don't know yet. I opened up a bug. Okay, cool. Well, are we tracking that bug? Is there a way that we could have set this up in such a way that we would have detected it prior to the entire fucking site going down? I don't know. Let me think about that. Well, I suppose we could have added
this kind of monitoring. And it's those sorts of probing questions that are not, how could you have done it better? It's how could the system be better? And you have to have someone whose job it is is to think about how to ask those questions and leads that meeting in a way that makes people feel comfortable. A group of people can't do it. You get a mob mentality, and it just breaks down too quickly.
You need one person who's focusing on it, can call people on it when they're doing the wrong thing, and can sort of lead that conversation. And once you have that, what you find is people in the room start following suit. People just get in the habit. People start asking those same sorts of questions. So when that guy who's leading the, sorry, not guy, that person who's leading the post-mortem,
when they're asking these sorts of questions, you get other people who start asking other similar probing questions. Oh, why didn't that system catch it? And should we use this system? You get those sorts of questions because people get in that habit. But you need someone who leads the way. And I don't think they need to be management. I don't think they need to be senior management. I don't think they need to be an engineer.
I don't think they need to be anything. Just a person who is dedicated to building that model. I feel like I'm focusing on this side of the room. Questions over there? Yeah, what's up?
I think it depends on a couple of things. So, more often than not, I will go and have a long, at Facebook it's very easy for me to go and have a long conversation with their manager. I will usually try and find several other people who have seen the same behaviors so that it's not just Phil being a dick, because I can do that. And it's very clear
that this isn't a issue with me and that person, because that also happens, but that it's this person who is actively dragging things down. And more often than not, that manager will go and have a conversation and that person gets better or that person gets managed out. But part of building this sort of culture is that the idea that people have to be fireable
and supported must start to disseminate. You can start in your org, but then you go and you talk to the other orgs and you're like, look how great this worked out. And it has to disseminate. And sometimes you just have to bite the bullet and deal with the asshole for a year while you're working on cleaning up this part so that you can go clean up that part.
It's not zero to 100. You start at 1% and when you get your 1% release squeaky clean, you move on to 2% and so on and so forth. It's an iterative process. And I would argue that there is no 100%. There is no end goal. We change our culture at Facebook constantly. We have all these internal discussions about, hey, we made this change because we wanted
to make it easier for people to telecommute or we wanted to make it not easier for people to telecommute or we're trying to be more diverse or we're trying to do this or we're trying to do that. And it had this negative side effect. How do we deal with that? And we have these long threads with ICs and managers and everyone's allowed to be in on these conversations.
And sometimes you see some really ugly sides of people. But at the end of the day, when you have an open conversation about your culture and everyone's working towards this sort of culture, you sort of converge on generally good things. And sometimes you converge on a thing that sucks but then you have the conversation and you go,
oh, that one sucked. And then you try something else. There's no perfect formula. It's gonna be different at every one of your organizations. These are just starting points. Anyone else? Going once, going twice. Cool, thanks guys. Thank you. Thanks.