Breaking Technology Silos with Chef
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 50 | |
Author | ||
License | CC Attribution - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/34616 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
ChefConf 201639 / 50
4
6
9
10
11
12
14
15
22
29
34
35
40
42
43
45
46
47
48
00:00
Presentation of a groupView (database)Menu (computing)Division (mathematics)Integrated development environmentConfiguration spaceComputer fileSoftware developerServer (computing)AutomationTotal S.A.No free lunch in search and optimizationService (economics)Template (C++)CAN busSingle-precision floating-point formatFigurate numberEndliche ModelltheorieServer (computing)Integrated development environmentComputer fileService (economics)Musical ensembleRule of inferenceSoftware developerCuboidChemical equationSystem callLevel (video gaming)Configuration spaceWebsiteIndependence (probability theory)Operator (mathematics)Line (geometry)Template (C++)Greatest elementBlock (periodic table)SoftwareSoftware frameworkPerspective (visual)Data conversionMathematical optimizationState of matterShift operatorPhysical lawProjective planeVertex (graph theory)Mobile appBuildingCartesian coordinate systemSource codeDifferent (Kate Ryan album)Game theorySequelProduct (business)Asynchronous Transfer ModeVideoconferencingPresentation of a groupSelf-organizationStreaming mediaPersonal digital assistantMathematicsMultiplicationMenu (computing)DatabaseInformationHand fanData managementWeb 2.0Java appletCommunications protocolMultiplication signShared memorySphereLoop (music)Computer chessFiber bundleDisk read-and-write headGene clusterProcess (computing)Event horizonComputer configurationLastteilungInheritance (object-oriented programming)Point cloudStack (abstract data type)Total S.A.InternetworkingCloningValue-added network
09:20
CloningMiniDiscMathematicsSphereInterface (computing)Common Language InfrastructureCodeControl flowZugriffskontrolleKey (cryptography)Moving averageSoftware developerServer (computing)Computer fileSoftware frameworkService (economics)BlogMenu (computing)Template (C++)Java appletDefault (computer science)MereologyCuboidComputer chessDecision theoryMobile appBinary codeBranch (computer science)BitVotingEndliche ModelltheorieSphereAsynchronous Transfer ModeCategory of beingSound effectLibrary (computing)BuildingService (economics)InformationSoftware developerIntegrated development environmentPlug-in (computing)Server (computing)Term (mathematics)RootEvent horizonComputer configurationMetropolitan area networkGoodness of fitSpring (hydrology)CodeMathematicsGame controllerSoftware frameworkWebsitePerspective (visual)LoginData managementConnectivity (graph theory)Configuration spaceScaling (geometry)Standard deviationMultiplication signProjective planeLevel (video gaming)Patch (Unix)AnalogyProduct (business)Common Language InfrastructureDifferent (Kate Ryan album)Point (geometry)Set (mathematics)Order (biology)Data conversionArithmetic meanDisk read-and-write headProcess (computing)File formatKey (cryptography)Computer fileStaff (military)Natural numberCausalityAxiom of choiceImage resolutionLine (geometry)Operator (mathematics)Wrapper (data mining)LastteilungPattern languageRule of inferencePhysical systemMusical ensembleInternet service providerDebuggerPasswordGraphical user interfaceOnline chatBootstrap aggregatingEntire functionCartesian coordinate systemInstallation artRevision controlCASE <Informatik>Maxima and minimaBootingLecture/Conference
18:34
Software repositoryServer (computing)Service (economics)BlogMultitier architectureAsynchronous Transfer ModeRootVariable (mathematics)Source codeIntegrated development environmentLoginInstance (computer science)Context awarenessDefault (computer science)MathematicsKeyboard shortcutLevel (video gaming)Configuration spaceAttribute grammarScripting languageInstallation artOnline chatSet (mathematics)Template (C++)Spring (hydrology)BootingLattice (order)Normed vector spaceRule of inferenceSoftware developerNumberMobile appDegree (graph theory)Pattern languageKey (cryptography)BitRevision controlAsynchronous Transfer ModeSoftware developerServer (computing)Software framework3 (number)CuboidInjektivitätRight angleRun time (program lifecycle phase)Computer fileControl flowInformationDifferent (Kate Ryan album)Core dumpShift operatorData managementHash functionSet (mathematics)DatabaseSeries (mathematics)Level (video gaming)Video game consoleSoftware testingMathematicsIntegrated development environmentStandard deviationUniform resource locatorStructural loadModule (mathematics)Cartesian coordinate systemVariable (mathematics)Spring (hydrology)BuildingArithmetic meanSound effectComputer configurationDynamical systemRoundness (object)Physical systemInstance (computer science)Template (C++)Constraint (mathematics)Source codeConfiguration spaceDefault (computer science)BootingPoint (geometry)LoginProcess (computing)Copula (linguistics)Product (business)Goodness of fitBoundary value problemScripting languageInstallation artMiniDiscModal logicExistential quantificationKeyboard shortcutMusical ensembleService (economics)Electronic mailing listAssociative propertyINTEGRALAttribute grammarCASE <Informatik>Wrapper (data mining)Diffuser (automotive)Multiplication signWordCodeJava appletQuicksortDependent and independent variablesAnalytic setThread (computing)TunisNo free lunch in search and optimizationDrill commandsSoftware repositoryHeegaard splittingOperator (mathematics)Differenz <Mathematik>SimulationLink (knot theory)WritingVarianceRandom matrixLecture/Conference
27:49
Integrated development environmentRule of inferenceSoftware developerAttribute grammarRead-only memoryInterface (computing)Mobile appMenu (computing)FreewareServer (computing)AutomationConfiguration spaceTransformation (genetics)CodeSource codeRevision controlBit rateError messageHTTP cookieEuclidean vectorContinuous integrationPeer-to-peerCuboidComputer configurationNumberRight angleSoftware developerInterface (computing)CausalityReal numberRule of inferenceData conversionSoftware testingConfiguration spaceRevision controlComputer fileDifferent (Kate Ryan album)Software repositoryService (economics)Greatest elementDataflowCodeTouchscreenSpherical capLevel (video gaming)Product (business)SurgeryTerm (mathematics)Type theoryNeuroinformatikSet (mathematics)Wave packetParameter (computer programming)Cache (computing)Sheaf (mathematics)EmailMathematicsCalculationDynamical systemOverhead (computing)Cellular automatonDegree (graph theory)WritingInheritance (object-oriented programming)Uniform resource locatorUnit testingMultiplication signField (computer science)Profil (magazine)Mobile appUser interfaceFormal languageAsynchronous Transfer ModeIntegrated development environmentSpring (hydrology)OvalBoilerplate (text)Goodness of fitState of matterPlastikkarteMenu (computing)LastteilungAttribute grammarVideo gameLoginSoftware frameworkData storage deviceMusical ensembleProxy serverWrapper (data mining)Interactive televisionTwitterGame theoryKepler conjectureSource codeSemiconductor memoryServer (computing)Transformation (genetics)Confidence intervalTrailAreaQuicksortTunisStructural loadError messageOperator (mathematics)BootingDecision theoryElectronic mailing list2 (number)Exterior algebra
37:04
Error messageMereologyOnline chatError messageStandard deviationSoftwareVideoconferencingComplex (psychology)Integrated development environmentLibrary (computing)Software developerRevision controlInformationLaptopPlug-in (computing)Enterprise architectureWater vaporCuboidPoint (geometry)Data conversionServer (computing)QuicksortParameter (computer programming)Structural loadCodeOpen sourceContrast (vision)Scripting languageProduct (business)Goodness of fitMultiplication signSeries (mathematics)Software testingUniqueness quantificationFactory (trading post)Level (video gaming)Data managementFrequencyCASE <Informatik>Game controllerSphereNumberDirected graphProjective planeProcess (computing)Operator (mathematics)Peer-to-peerLie groupMeasurementSynchronizationTouchscreenMathematicsPublic key certificateMenu (computing)Source codeSet (mathematics)HookingTwitterUtility softwareShift operatorFilter <Stochastik>Graph coloringRoundness (object)Java appletSpring (hydrology)Term (mathematics)Endliche ModelltheorieVideo game consoleCycle (graph theory)LastteilungReduction of orderBootingGraphical user interfaceLecture/Conference
Transcript: English(auto-generated)
00:05
So yeah, my name is Sean. I'm on the infrastructure team at the National Football League where we take care of all the web, anything internet facing, like from your mobile phones to the websites and video streaming. This is a story about how we've changed as an organization for the past two and a half years and how Chef's really been the catalyst to that change.
00:27
If you've never been to a DevOps presentation before, they are all required by law to have a picture of either the Phoenix Project or some silos, and this will not disappoint. But we've got silos. We call them NFL.com. We've got what we call our club sites, which is your team, your various sporting teams.
00:43
The Fantasy Football mobile application and website, it's its own silo. Previously, three years ago, the mobile app was its own silo, and there's another handful of miniature silos of independent websites that were brought in. So it was a lot of independent things, and when I mean silo, I mean silo.
01:01
Three years ago, the tech stack for NFL.com and the club sites looked very similar, but it was all in their own gear. They each had their own nice big F5 pair of load balancers. They had their own VMware clusters that were not allowed to talk. They're all written in Java, but they deployed different ways. Some used OpenJDK, some used Oracle.
01:21
Fantasy was written on Bear Iron and PHP. The mobile guys wrote it in a completely different way of doing Java, first starting off in vSphere, another cluster of course, and then moved it to the cloud. And then if you look at it people-wise, it was just a bunch of walls in between groups. Every team had their own infrastructure team.
01:44
Every team had their own business people, had their own developers. They were on different floors. It was not. It was very, very siloed. And this all kind of came to a head. They were having trouble retaining employees, largely because you got on one team, it was impossible to move around.
02:01
And eventually we were brought in as an operations team, and we got in there and we looked around and said, I don't think these problems are technical. These are cultural problems. So how do we change a culture? Because the only way we're going to be successful as an operations team, the only way the NFL is going to be successful as an IT organization is to change the culture of the way they deliver software.
02:24
So I'm going to give you the bottom line up front. This story is told chronologically, but the answer is we're going to optimize everything for conversations. We're going to make a menu, and don't neglect your team. I'll get there. So we're a sporting league, and so it's very cyclical. We start off with kickoff in September.
02:44
We've got the regular season going until Super Bowl, which is beginning of February, end of January. Super Bowl is done. Then we have a few events in the offseason. We've got the Combine, which is kind of an athletic competition. You've got the draft, and then the preseason starts in July.
03:01
So because of this fragile infrastructure, they had built their whole development process around this season. So effectively, you would develop your stuff offseason. You'd realize the preseason's coming, so go faster. And then kickoff would come, so you've got to stop. And then we move into maintenance mode. So our busy time was actually the offseason because we moved from operations mode into development mode.
03:27
So let's look at offseason 2014. We decided we were going to build a product called NFL Now. It was called the Netflix of football. Essentially, it was a customized stream of videos and new channels.
03:41
It was a new way to engage with our fans. That's a screenshot from a commercial talking about the product that was aired before Super Bowl. And if you note the calendar, development doesn't start until after Super Bowl. So the scope was fixed, but we wanted to run this as an Agile project. So we've got our four silos, and now we've got another silo, even though we don't want one.
04:05
But hey, this is like a van. It's fast. It's moving. Maybe we can do better. So I had been with the NFL for a few months by then, and I was kind of put in to work on this project from the operations perspective. And I really wanted to fix three things. I wanted to get rid of environmental drift.
04:24
I want to fix the configuration file problem, and I want to get developers off the servers. When we talk about environmental drift, we mean QA doesn't look like staging, staging doesn't look like production, your development boxes don't look like production, and what you're building. As you move it through environments, you find all these problems. Files aren't where they're supposed to be. You're missing things.
04:43
But even within an environment, the old way of doing things when a developer wanted, say, a MySQL box, is they'd ask the operations team, please give me a MySQL box. The guy would go to VMware, clone a template, install my SQL. The developer would say, I actually wanted three of them. The guy would say, I'll shut it down. I'll clone it twice. Now you've got three staging boxes. And he'd say, I'm going to production tomorrow. Can I have five, please?
05:03
And he'd shut it down, clone it five more times, and that was it. And so we would get on production boxes, and one box would be different from another box. And so this was a big problem. So for NFL Now, we basically said Chef. Chef will be our savior. We will bring in Chef. We will template everything. We will have Chef do everything.
05:24
And for the most part, that was pretty good. I'll get into the things we learned. By the end of it, production was around 100 servers, depending on the features we were delivering. We had around 200 servers in total. It was one cookbook per app.
05:42
So each application was pretty much a snowflake. We would have some apps the developers wanted to write in a framework called Vertex. And they did one app in Vertex. And then some other guys liked it, and they did it in Vertex too. But they had multiple apps focusing on different ports in the same Vertex container. And then some guys wrote it in Tomcat, and then some guys wrote it in this.
06:00
So we had about a dozen-ish services, and they were all in their own cookbook to manage that. And then, because I really love this environment feature, the environment.json, or if you upload it to your Chef server, we put all the differences in the environment into the environment file so that we would just have a template variable, look for node.usermanager.database.
06:24
And then we would override. By the end of it, we had this gigantic-asked environment file that's all JSON. And it actually became a mess. So later on, we moved away from that and started extracting stuff to the recipes themselves and just had basically, if environment is this, then do this.
06:42
And that made it a lot easier for us to really reason about the recipes. And we had the single source of truth for the configuration. Talking about configuration, the old way was you'd either bundle it into the war file if you were on one silo. One silo would do a for loop with secure copy of the files.
07:03
Another silo would NFS them, put them on an NFS share. It was accessible to all the boxes and symlink them in. We'd have deployments where a developer would say, Oops, hang on, I'm just going to go fix something. And that something was, I'm going to go on prod boxes and go fix configuration files. So leading up to then configuration files was a nightmare.
07:22
And I really want to fix that because one of the biggest problems we had as an operations group was trying to manage these config files and make changes. So a developer, like we would say, we need to add an Apache redirect. And where do you add the Apache? Well, in QA it's over here, in staging it's over here. On this environment, it's actually symlinked. On this environment, you've got to go, and it was just a mess.
07:43
So still in love with Chef, I use templates. The developer would modify on the development server, they would create the config file. Because you've got to remember at this time, there was rapid development on these services, so they would continue to be adding new configuration options.
08:00
So this wasn't like we had a standardized template. This was something that over and over is being developed. We would go onto the box, and we would do a Chef Y run, like Chef client-w, to see what changed. And we would go resolve those diffs and update the template file until we could do a clean Chef build and reproduce that template on the dev server.
08:21
And then we'd promote it to staging. We'd go in and say, OK, we're going to meet your database, and staging is actually this, so go edit the environment file. And then we would deploy it to the other environments. This was very problematic, because as you get closer to crunch time, you've got developers working 16-hour days, and I didn't want to work 16-hour days. They're in California, I was in Winnipeg,
08:41
so they're three hours behind. So they'd be calling for these changes at weird hours. We ended up coming up with service-level agreements between the groups that config changes will be done within 12 hours. And it was a mess. We were fighting with the developers over these stupid configuration files. And finally, the old developers were on servers.
09:01
So they were used to it, because that was the way it was always done. We didn't really have ways to provide them the information that they needed on that server, so developers would get on the servers. We finally got it so that they were only on development servers. And then we would do the Chef templating. We would help them with stuff so that we could manage it in staging.
09:22
Until you get to crunch time, when you turn off Chef, and then I spent hours resolving all these differences in staging and production. So that didn't work so awesome. Stuff that did work. I was building hundreds of boxes. We were just getting into Chef. We were just learning vSphere,
09:41
getting into vSphere at scale. I found this nice plugin called Chef vSphere, and what it does is it automates the vSphere API. So you can clone a VM, you can bootstrap Chef on it, you can take snapshots. Really, it tries to be anything that the vCenter GUI lets you do. We started contributing patches back to it.
10:00
I started going into the issues tracker on GitHub and helping people, and eventually the guy who ran it said, you know what, I don't actually use this anymore. Would you take over maintenance of it? So I've been maintaining this plugin for the past little while. If you guys have a vSphere environment, I really can't talk highly enough about it. I encourage you to download and try it out and file an issue and work with me to fix it
10:22
if you've got any problems. So NFL Now was happening. We were moving into the season. One problem I had with NFL Now was Chef could create this box, it could bootstrap it, we could get the binary on the box, but it still didn't have the load balancer configs,
10:41
so the load balancer had to know about it. And that was a big problem because I loved that I could create one command and I could have a new user manager server up. But I still had to go over the load balancer and we'd forget it, and we'd have the settings be consistent, and Chef solved the server problem, it didn't solve the F5 problem. And there was some stuff out there
11:01
that didn't quite work the way I wanted it to. F5 talked about having a gem, but they later deprecated it. So I started off, first I wrote a thin wrapper around the SOAP API, and it's called F5 iControl. And really what I wanted to do about it, not only create a wrapper because there was already one out there, but I wanted a CLI tool that would let us get off of that horrible GUI on the F5
11:23
and do F5 pool status, F5 pool disable, so that not only could my operations teams manage it easier, but we could script it. So then during a deploy, we could have the deploy system bring nodes out of the pool, deploy the new binary, restart, health check,
11:41
bring it back up in the pool. And then with that Ruby library, I created a Chef lightweight resource that lets you do something nice like this. So if you have an app cookbook, you can have an F5 pool resource, and you can add it to the pool. So this isn't configuring it. As a new node is brought up, it goes and looks and says,
12:01
is the node in Chef or in the F5? No, it's not. Create the node. Is the pool there? No, it's not. Create the pool. Is the node in the pool? No, it's not. Add it. And it does a fairly decent job of automating all that so that after your first converge, you've got that node in the pool ready to take traffic. And that's a very simple example. It's got a few more options.
12:24
But we like Chef. So I said there was a few problems, but really, we were learning it. It was my first experience with Chef other than a bit of Vagrant and Chef Solo. We heard about this role, about having a base role in all your boxes where you have kind of your basic stuff, your basic packages.
12:40
We were moving everything to LDAP because some silos used LDAP. Some had local passwords. Some had whatever. So I eventually moved everything to LDAP. So I created a base role, and we used the role cookbook pattern so it's tied to a cookbook called role base so that we can version it and have a bit more code than just a traditional role. So that does LDAP.
13:00
It does access control. And some basic packages. I personally think it's a crime to not run a server with SAR on it, so install SAR. But we had a whole bunch of other boxes in our environment. The legacy, what we call the legacy, NFL.com stuff, all the old club stuff, all these old boxes. So I made a new role called role minimal. And all it does, its excuse is just to put itself on there.
13:22
It does some basic stuff like put my SSH key on it because it makes things easier. The base packages that I want on that box. But really it doesn't try to change LDAP. It doesn't try to change any of the conventions. But to me that was the thin end of the wedge. So now we've got Chef everywhere. And if someone's got a problem that Chef can solve,
13:40
all of a sudden Chef's already there. And they can just write a cookbook to solve their problem. So we don't have to go and we don't have to go take over the entire legacy stuff at one go. We've got Chef on. We can now start taking it over piecemeal and fixing component by component. And the key was getting a minimal just Chef install on that box.
14:02
So looking at 2014, Chef, nice vSphere and the F5 stuff, that worked great. A lot of manual steps kind of here and there that we had to clean up. I still don't know why developers need access to servers. I want to manage config files asynchronously. It's a problem I've been thinking a lot about. And it's the asynchronous versus synchronous nature
14:21
I think at this point in time that I'm getting in my head that I think is a problem. And also this non-standardization of frameworks. We had Chef managing 12 different applications. There are 12 different snowflakes. So what we had were 12 automated snowflakes. We hadn't really solved the problem. So now we're in the off-season of 2015. Our goal is to replace the NFL mobile application
14:44
and to do some work on the front-end site moving from the traditional kind of Tomcat JSP model into React, a more single-page app, we'll say. So this involves making some microservices. Again, because of the silos, everybody had their own kind of microservices or SOA or whatever the term of the day was.
15:03
There was no unified API for all the NFL properties. So part of this was to create services that any current or future NFL.com app could use. So, you know, as was the style of the time, we went with microservices, and that has been a good decision so far.
15:21
Still, really, I want to fix this stuff. I want the developers off the damn servers. I want to standardize the frameworks. And finally, I want to fix that config file problem. So my coworkers and I started reviewing the conversations we'd had over the past year with developers. In case I didn't mention it, I actually don't live in the same city.
15:41
We don't live in the same city as our developers. We do everything over chat. I went down to LA and sat down with them. We talked about this stuff, and I was like, what are the problems we've been having? And really, from their perspective, they didn't trust that the config on the box was what they thought it would be because their app wasn't behaving the way they expected it to.
16:00
Is it a problem with the code? Is it a problem with the config? And to be honest, a lot of the times, it was config or something else because of the previous configuration file problems. They hadn't really been fixed. They needed to see logs. We had started playing with centralized logging through Greylog, but not everybody was using the same framework, so they couldn't log it correctly or they were missing logs or their framework of choice wasn't compatible with,
16:23
didn't have a library that would talk this GELF format. You know, this framework looks cool. I think I'm going to give it a shot. Just a lack of trust, and it went both ways. And again, going both ways, we didn't realize that the stuff we were doing was making each other's lives harder.
16:42
And I kind of realized all these things were related. All these things have the same root cause problems. It's all about the conversations we weren't having. It's about the trust. So in 2015, we tried to drastically change the way we were doing things.
17:01
We realized we're a restaurant. We have one dish. It's a very good dish. We will work at making it the best dish possible, and if you're a developer, you can have this dish, and it'll do everything you need, so that's kind of an analogy. So basically we made a menu, is really what it comes down to, trying to keep with this whole chef, you know, chef and stuff.
17:22
One of the problems was, you know, when we were talking about things, we weren't talking about the same things often. So we said, look, everything's going to have a name, an app's going to have a name, and everything we do will have that name in it. If we're talking about the load balancer pool, it's going to have that name in it. It's going to be a short tag. The name of the server will have that name in it. Your Jenkins jobs, everything you want to do,
17:42
when you say SSO, I know what SSO means. The QA team knows what SSO means. Ops knows what SSO means. We're going to start with a build pipeline from day one. So from day one, when you have your first commit, it'll go to a staging environment. It may not do anything, but as soon as we know about this thing, we can put that thing in staging, and we're ready for you to push button to prod.
18:03
You're going to start out your projects from a template. We talked for a long time with developers about what do you actually want to, like what kind of apps are you building? You're building services for the most part. Why are you using anything but Spring Boot? We're not trying to solve any massive problems here.
18:20
You guys are all familiar with Spring Boot. You really like it, so let's just start from that as a default, which means you can log with SLF4J, and don't worry about logging locally. We'll inject that later. So a lot of these things, we're deferring some decisions about the way we're going to run the app in production until later, so more of this asynchronous stuff.
18:41
Don't worry about logging. You just use that framework, which looks very natural to you, and on the server, we'll inject it. You don't have to have any code in your repo. Same thing with analytics, the instrumentation. We're going to put all that stuff on the server. We'll give you ways if you want to use it during development, but really, don't worry about it during development mode. We're going to go with a deployable fat jar
19:02
for anything Java, so we don't have to worry about carrying dependencies or having dependencies on the server. We're just going to build a fat jar. We don't have to worry about Tomcat. We don't have to worry about a whole bunch of stuff. You just run that jar file, and they're all the same. They all get run the same way. They've got the same command line parameters. Every server will look the same, just have a different app, and we're going to have one Chef recipe,
19:20
so for all of our microservices, we have one Chef recipe called NFL Apps, and once you have that name, I think I use SSO here. If you tag your instance with app colon SSO, and we use tags for a lot of things. I really love tags. I know there's a bit of hate for tags, but I think they're awesome, so if you tag with app colon something,
19:41
that is the name of the app, and when you include NFL Apps, the recipe in your run list, it calls that recipe. It creates a pool for every environment, so we've got integration. We've got production. We've got staging. You know your servers are going to be the location, the environment, SSO, and then some ordinal. You know that the service name you get on that box.
20:01
How do you restart it? Service SSO restart. The logs, if you need them, our var log SSO, we eventually got to the point where in production we don't need a log to disk anymore. It all goes to gray log because we get to inject that stuff, and I'll get into that, but we get to inject that stuff at build time. The deployable will always be called SSO.jar,
20:21
and you'll always know where it is, so our ops guys, if somebody complains about what version of SSO is running on this box, it doesn't look like it's the right one or something's not running, they can go. They know where the binary is. They can look at the sim link and see what SHA is running on that box. Gray logs, we tag. Because we're modifying the server config on the fly,
20:40
we're tagging those logs with app colon SSO, so you go to gray log. If you want your app, you can go app colon SSO and environment colon production or whatever, and you've got your logs. We use AppDynamics for performance monitoring. We're doing very heavily, so when you go in AppDynamics and drill down, it's going to be called SSO, and finally, your repo's going to be called SSO,
21:02
and really finally, all your Jenkins jobs will start with SSO, and they're all going to be consistent, so you know that your build and test of your diff, as you're doing a pull request, is going to be number zero. We have a build and test after it gets merged to master, and then we have a series of deploys,
21:20
so you're deployed upstage. By default, when you land on master, it goes to staging. You manually push the button for prod, and then we have a series of smoke tests and such that run. So this logging thing, this asynchronous pattern we found was very helpful. We just drop a standard chef template with a log back dot XML,
21:40
which overrides whatever's in the jar file, and so we can then pass it. I guess the key to this one, I mean, the template's fairly easy, but we can pass the template information about that server, so we know which environment it's running in. We know which instance it is, so if we have 60 user manager servers, we know that it's number 42. We know the name of the app,
22:00
so that in the template, we can pass a bit of XML that's log back specific, but we can pass that information so that when all their logs are tagged with that information, and they can easily search on it. Another thing we had was developers would often say, can you increase the logging of this certain package
22:20
to debug? I'm having a problem with it. I don't know what it is, and by default, I said info, so we made a shortcut for our ops team to be able to add an attribute to a node, either in a wrapper cookbook or anywhere that the attribute goes. If you put in an array called debug packages and app dot debug packages,
22:41
it'll write out to log back that package name at the log level debug, and then one of the features we have configured in the log back is every minute it checks for changes so that within a minute of that node converging, it's going to start logging at debug level. But this is something a developer can go into a chat room and pretty much ask anybody on the team, can you please enable me logging at this level,
23:02
and we can do that. And if there's more features they need, like they want to decrease logging of a certain package or whatever, we have another way through a hash that we can do that. We make it easy for... We normally do that through roles,
23:21
which makes it easier to search on and to reuse. Again, we have this pattern we use where we will have a third-party agent on the box, and we have a few of these. And the third-party agent in this case is AppDynamics, which is our... It's like New Relic, if you've familiar with that one.
23:41
It's an application performance monitoring tool. So we have a cookbook called NFL AppDynamics. And its responsibility is to put that agent on the server based on what attributes it sees, configure that agent, but it knows nothing about the application. So it's not put in the startup scripts. It's the job of the app recipe
24:00
to do an include recipe and then adjust any startup scripts necessary. Which means on the legacy stuff, we can add NFL AppDynamics to the run list. It'll install AppDynamics, but because it knows nothing about the startup scripts, we can go manually change the startup scripts. So now if we want to roll out an AppDynamics upgrade
24:21
of the agent, we can modify just that one recipe, and it'll work on both legacy and non-legacy applications. And so we use this pattern of extracting to a separate cookbook and have that kind of boundary between knowledge of the application in a few places.
24:40
So now configuration was the last thing I really wanted to fix. We moved to console. So we don't write config files to disk anymore. The easiest way to solve this problem was just to pretend we no longer have that option. So now how do we solve it? So Chef again. We have an NFL console cookbook. It drops console,
25:00
joins it to the cluster. We have one cluster per Chef environment. So Chef manages the agents. The recipe itself manages the startup scripts. And then we have just conventions within the system on where an app can find its configuration. So at boot, it loads its configuration from console. It's all spring boot. So we're actually able to write a module
25:21
that brings console data into spring boot variables. And finally, that config is now in a repo that anybody can use. And I'm just going to prefix this with it took us a while to iterate on this. The first version I did required Linus Torvald's level
25:43
of Git knowledge to merge between environments. And my coworker Paul just looked at that and ended up fixing it with something else. And then we iterated on that for a while. And now we're really happy with it. We call it consolation. We use the same pattern as before. I don't want to dwell on that.
26:01
But we use consolation. We call it consolation. So basically, it's a bunch of YAML files. YAML, sure, it's good. But now it's text files they can commit. One repo, we do peer review on it. So a developer wants to make a config change. They submit it. We use Fabricator. So they submit it for review when they land it.
26:22
The other thing we do within Fabricator is when we see that diff, we'll run linting on it. We'll try to compile it. So this is all kind of interpreted at build time because we put the secrets in HashiCorp Vault. So at runtime, it's compiling all this stuff. And we'll do that on every diff just to make sure that if someone deploys it,
26:41
it's not going to break. And then finally, as soon as that's merged, we push it to all environments. So this is a problem when people are talking about distributed config. They've got this thing where is your config locked to your particular version of the application? And that came up as a problem. And I think there's a couple of different ways to solve it. And one is to version your config, have a different copy of the config per version. And the other is to say, look, I'm
27:01
going to have a series of constraints on myself. That if I make sure that everything is in constellation, there can exist no config elsewhere, and if a key changes its meaning, then I'll have to make a new key. So if I once have db.url and all of a sudden I'm splitting my databases into two databases, I now have to create a new set of two keys for those two different databases.
27:21
I can't reuse it because that is not backwards compatible. And the other thing is if you see a key in your config and you don't recognize it, it must be for a version ahead of you. So for backwards compatibility, just ignore it. And so if you follow these three rules, what you get is you don't have to really worry about an old version of an app picking up
27:42
a new config or a new version of the app picking up an old config. As long as you're really OK with just things like if you're, say, tuning thread numbers. As long as those, you're fine with different versions. But generally, we're tuning those differently from the development train. So these set of rules have really worked well for us.
28:00
So now we have this asynchronous config. We find that attributes are a good config, a good interface between, say, my group, which is really focused on the developer interactions, and the ops team, which is focusing on the day-to-day. But really, we try to write attributes thinking
28:21
of how it's going to be used in the field. So one common thing we have is an app's running for a few months and the profile of the JVM memory changes. So we want the ops team to be able to notice this and make changes. So when we started, we had this calculation it would do. But really, we noticed they either want to change the OS overhead. So if you've got a six gig box
28:43
and you want to give the OS, say, a gig, then the JVM will get five gigs. But sometimes they want to set the JVM memory. So you may have that six gig box, but they want to try it at four. So we expose attributes to let them set those parameters,
29:03
however it fits our workflow. And we can either do it in a wrapper cookbook. You can test it directly on node attributes. Or again, what we use a lot is roles. And roles makes it easy for our ops team to try these things out.
29:20
It's really easy. People can go in and look and see how it's done because it's fairly well described. I'll say, though, that we are now experimenting with data bags. So literally just started the past week or so. So this may change if I give this talk again. I mean, it often comes up when you say,
29:41
we only do things one way, is, what if I really do need something that's not covered by the menu? And yes, we are totally fine with people doing things off the menu. But it means we have to talk. Our goal is, if a developer wants to create a new app, that we can have that whole pipeline and everything built.
30:00
They have to make no decisions. They just have it. It's easy. It's done fast. If you want to go off the menu, yeah, we can do it. But we're doing some custom work now. And then we have to start talking about conventions. So if your app, you want to use some new framework, and it's deployable as a jar, runs with AppDynamics, logs to Greylog, this is an easy conversation.
30:20
We can probably do that. We'll see. What do you actually want to get out of it? If you want to try something super new, something crazy, we have to really talk. Are you all right that we have no logging anymore? That we have no performance monitoring on that thing? Really? So these conversations, turning this to conversations about what does your app do and what do you really need out of it,
30:40
and maybe we have to adjust the menu for the future, that has proved very helpful in reducing the snowflakes and making it much easier to maintain our services. So looking back on 2015, the config evolved, but in the end it worked great. I'm very happy with it. The standardized apps actually meant that we don't need dev boxes anymore.
31:02
They're running Spring Boot. They run it straight within their IDEs now. They didn't have to simulate five, six different environments on their box, several different JVMs. It just works. They run all their stuff within the IDE. It means they don't have to go ask somebody to go turn that dev box into debug mode.
31:20
Now because they weren't on servers, because they had centralized logging, because if they had a question about the app's performance, they could just go into AppD and find everything, they basically didn't have to go in boxes anymore. We got it down to about half an hour for the developer to say, I want a new service, to staging production boxes built,
31:40
to a pipeline in Jenkins, that repo being created, load balancer config, AppD config, all that stuff, because we were able to automate it and really drive down the number of options. And now, when I started, we used to deploy at something like midnight on Tuesdays and Wednesdays,
32:01
because they were scared. Deploy of NFL.com, my first one took six hours. Now we deploy these things anytime. We can deploy right before a game, we can deploy in a game if we have to, because we have much more confidence. Now that we've got these pipelines, developers started testing, we're able to really get our deploys much faster
32:23
and much more reliable, because we've automated, say, the load balancer, this deploy goes along, it takes a pool, it carves it up, takes a quarter of them out of the pool, drains the connections, and it's reliable. It happens the same way every time, and it just works.
32:40
Most of our conversations with developers are much higher value, so we can start talking about how these apps are going to change the business, how these apps are going to work in the wild. We can talk about much more interesting things than I want to try this new data store, or why does my config file not look right, because we just, all our conversations are now better,
33:01
because we've been able to get rid of the stupid stuff. But now that we think about Shaft as automating our servers, automated our load balancer, we start to think, what else can we automate? We're looking at things like DNS, our app dynamic configuration, our builds and deploys, and also we use a bunch of CDNs, but we've been really happy with Fastly because of how easily they've been able to automate,
33:23
and I'll talk about that in a second. But if you consider the code the source of truth, if you can have code go through some kind of transformation and then go to an API to configure some service, then you can start configuring that service in code. Rather than it being, if you're dumping, say, that service into Git, that's just an audit trail.
33:42
But if you can go to making that configuration code, you can now do peer review on it. You can store that history in Git, and you can start making it easier to actually use, because Shaft taught us that configuration doesn't have to suck. This is an example of Fastly. If you've used Varnish before, it's a reverse proxy.
34:02
It uses language called VCL, which can look very big. But most of our configurations have got a lot of boilerplate. So we're able to take these configurations and use ERB and write a templating language for it so that we can generate these VCL files. But now that we're generating VCL files and putting them into Git and Fabricator and Jenkins,
34:23
we can also write unit tests for these. We can deploy these to an alternate location in Fastly. We can say these URLs should be cached, and it'll go, and it'll go through that list of URLs and curl them, curl them again, make sure that the cache headers are there. You can say this must have this arbitrary header.
34:42
So now we've gone from some web interface where you're typing in code to code you're typing on your computer, compiling, running through like a development pipeline, and then it's getting uploaded to production. AppDynamics has got this ugly way to go through a wizard of four steps
35:01
to create a health rule, but in the end, you're only really worried about half a dozen options. So what we did there was made a little DSL in Ruby to upload those directly, and again, we've got Git, we've got peer review. But before you think about automating everything,
35:20
look at the things you're automating and make sure you're not automating away conversations. I could get that half hour down to like five or 10 minutes, but if I did that, that would get rid of that conversation I had with a developer about what are you doing, how is this going to benefit the company, how are my ops guys going to run this app, and so we don't automate everything because we want to keep those conversations.
35:40
So now we really want to turn our attention to using Chef better as an ops team. Again, conversations will help us. So we do our very best to try and do some level of testing. This is just kind of a screen cap from Fabricator, so when we upload a Chef recipe, it runs the specs on it, and we found one real good test
36:02
is will it converge, and if it converges, chances are you don't have any unnecessary coupling to other cookbooks. If it doesn't converge, you probably have a problem, and even in the absence of an awesome testing regiment, just a basic will it converge is very helpful. We like peer review, so this helps us get better.
36:21
We got a lot of different levels of skill on our team, and when I started, I was the only person committing to Chef, committing to our cookbooks, and now we've got a team of 20, and a lot of them will commit, and so peer review helps us talk about code, talk about best ways of doing things, learn new ways of doing things. Just some examples,
36:42
like just talking about different ways of doing things, and on the bottom, a guy wanted to try something new, so I was talking about, well, how are we going to communicate this change back to everybody else? Another thing that helped us keep things clean is auto-linting with Food Critic and RuboCop. We try not to be too fussy about it,
37:00
but really, we're just trying to keep the level of errors to not increase. We know we got a lot of cruft in there. Actually, this is an example from this morning, where I just popped into the chat channel, and somebody uploaded a cookbook to role base, and they got a Food Critic error, and so this is kind of keeping our kitchen clean. And also, just as a team, work on your workflow.
37:21
We have this recurring problem where someone will upload to Chef and not upload their changes to Git, and then the version numbers get out of sync, so this is something we constantly work on, and this one is an especially bad one that happened a few weeks ago. But trying to work on that, because that improves your cadence as a delivery team. And finally, when you're coding,
37:41
you're really coding for the least experienced member of your team, not necessarily like the junior developer, but the new person on your team. So you want to make your variable names descriptive. You want to expose stuff as an operations person would think about it, not as a developer would. So really, you're trying to make sure that what you've written
38:01
is very easily usable by your peers. Just want to quickly wrap this up. We think about things in terms of conversations. Are we having better conversations or worse conversations? And how do we use Chef and our automation to increase those good conversations? Chef is way more than just configs on servers. You configure your load balancer.
38:21
When I get back, I'm looking forward to doing some stuff on some of our network gear. Reduce complexity aggressively and do that by coming up with a shared menu and automating the hell out of that. The Chef way about code becomes the infrastructure. See if you can apply that other places.
38:41
And finally, keep improving. My contact information, my Twitter handle. If you're going for one of your Linux certifications, I wrote a book that came out a little while ago. It's good. On that, thank you very much for your time.
39:02
All right, guys. We have about 10 minutes for questions. So if you have one, please raise your hand. I'll bring the mic to you. Thank you. Hi, Sean. I'm just curious about the knife vSphere plug-in that you have. Yeah. Can you compare or contrast to the Chef provision vSphere plug-in?
39:22
I don't know if they're complementary or they overlap. Have you used that, Y1 versus the other, when you're using VMware, using Chef to automate provisioning of VMs within VMware hypervisors? Yeah, we generally don't build new environments
39:40
very frequently. So we're mostly building it box by box. And the knife vSphere, you can think of as a replacement for going into the GUI and doing stuff, whereas the provisioner is like, I want to go build an environment that's got five SSO servers, six user managed servers. Have at her. So under the hood, they're using the same library. But knife vSphere is more focused
40:02
on kind of the operations level of running a vSphere cluster. Does that make sense? Much of what you said around how you're dealing with developers is they can have it in any color so long as it's black. And so how did you develop the relationship with them?
40:23
How did you gain that level of control, trust, et cetera? Because that's really the crux of the matter in the end, isn't it? Yeah. So thank you for that question, because one thing I forgot to say was kind of in this 2015 season when we realized that the problem was conversations. We actually reorganized our ops team around this idea.
40:41
So we went from kind of having an ops team and a project team to having this devops team. I'm sorry for using the name. But there's like four of us, and our job is really just to work with developers and make sure their lives are easier. And then we've got projects teams and ops teams, but having these dedicated people that they go from interacting with like 30 people to interacting with just a couple people
41:01
for their development work. So that one-on-one, it took us a while. I said two and a half years. And there's some people we're still working on.
41:23
You mentioned using the knife command triggered by automation. What are you using to orchestrate those? Are you manually running that knife command each time, or do you have some sort of automatic tool generating those commands and running them? We use basically shell scripts that wrap that long command.
41:44
I don't think I showed it. I ended up taking it out because it was long and didn't make sense. Yeah, we do a lot of tweaking. So we'll build like, for sake of argument, five production user managed servers. Then we'll go do a series of load tests and realize we've got to run like 30 of them. So we didn't really find value in something like provisioning
42:01
just because we don't tear down environments frequently. We're very still in that old model of VMs and growing and shrinking. Got it. Thank you.
42:22
Working with the developers, did you give them access to the basic Chef repos that you were working with and encourage them to manipulate Chef in their development environments? So everything we do now internally is public. So all our Chef recipes are public.
42:42
So far there's been very little interest in doing that other than if they're curious about how something will look. It hasn't really been a thing. They're Java guys for the most part. They're not interested in Ruby. I see it the opposite way around myself, but not enough factories.
43:04
Hi, Sean. Is the console Spring Boot library you wrote for them to be able to integrate with that open source? It is not, but I'm now going to bring that back to my guy and see what we'd have to do. I don't see any reason why it shouldn't be. OK, awesome.
43:21
Thanks. So you mentioned you have some standard cookbooks which the developers can use.
43:40
Can developers write their own cookbook and execute on their environments? For the most part, they don't. So if they wanted to, they could. I'd be happy if they did, guys. You're watching the video. I'd be happy. Yeah, for the most part, they don't even use dev environments anymore. They'll run it on their laptops, just like within their IDE.
44:00
So it's really been a non-issue. So they do have access, but they don't do that. So what do you think? Yeah, we have an internal GitHub enterprise, and we open our cookbooks for everybody to look at inside the company. But if they do, will you allow them to execute it directly, or does it go through a review cycle of an ops team member has to review whatever cookbook
44:23
has been written to be pushed on? We do code review internally. I would want somebody from my team to look at it. But chances are we'd be talking about it with them first. So if they wanted to change, they wouldn't just go make it. They'd be saying, I would like to talk about changing this.
44:40
And whether or not they make the code, or I make the code, or Paul or Colin makes the code, at that point, that's not. It's a conversation we're trying to encourage about what are we not doing for you. OK. We've got time for one or two more questions. We've got one in the back.
45:01
Let's see if there's any others. From looking at the problem that you put on the screen, where the guy uploaded to the Chef server, well, we solved that by making people upload to Git. Then we got a hook that puts it into Chef. So they're not allowed to go straight to the Chef server.
45:23
Yeah, that's an interesting way of doing it. Yeah, either that or there's Chef Guard, I think, which just prevents you from doing it. I actually prefer your approach, so I have to give it some thought. Cool, and the last question.
45:42
I was just going to follow on from the thing you were saying with a useful tool we found called Knife Inspect, which basically makes, you basically run it in Jenkins and it makes sure that what's in Git is in sync with what's on the Chef server, and then the job fails if it's out of sync. That's brilliant, yes. It's awesome. That's our problem, is people will leave,
46:01
and then also, I use Berkshire myself, but I don't want to tell my coworkers how to do their work, so some of them will use Knife, and they'll forget to go dash dash freeze. Like in that case, actually, that guy had actually stomped on an old version of Cookbook, but because we use version locks in staging and production, that saved them, but Knife Inspect, looking it up when I get home.
46:22
Awesome, guys. Thanks for coming. Let's give one more round of applause for Sean.