Effective CI/CD for Large Systems
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 56 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/67199 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 20225 / 56
22
26
38
46
56
00:00
Goodness of fitMusical ensemblePhysical systemAreaSound effectException handlingLetterpress printingXMLUMLComputer animation
00:38
ExistencePhysical systemFuzzy logicStatistical hypothesis testingStatistical hypothesis testingSound effectSoftwareInterface (computing)Complex systemComplex (psychology)ConsistencyMultiplication signComputer animation
02:03
Scale (map)Revision controlLemma (mathematics)Order (biology)Logical constantExplosionComplex (psychology)NP-hardScaling (geometry)Term (mathematics)InternetworkingWeb pagePhysical systemLimit (category theory)Point (geometry)PlastikkarteBand matrixGreen's functionCodeSoftwareGroup actionDifferent (Kate Ryan album)Observational studyMilitary baseDemosceneProduct (business)Dot productLine (geometry)Basis <Mathematik>System callSoftware bugINTEGRALData managementMathematicsProjective planeModel theoryBinary codeState of matterMultiplication signLevel (video gaming)Continuous integrationCASE <Informatik>Real numberLogistic distributionPhysicalismExpected valueCondition numberProcess (computing)Computer animation
08:18
CodeFunction (mathematics)Social classModul <Datentyp>Message passingTrailLatent heatNumberMereologySoftwareOperator (mathematics)Different (Kate Ryan album)Bit rateSet (mathematics)Condition numberSound effectNeuroinformatikStatistical hypothesis testingResultantMultiplication signComponent-based software engineeringFormal languageGroup actionCombinational logicRange (statistics)FeedbackCombinatoricsTraffic reportingLibrary (computing)Figurate numberEnterprise architectureAnnihilator (ring theory)Order (biology)Functional (mathematics)Element (mathematics)Content (media)Complex systemInformationIntegrated development environmentSystem callBranch (computer science)CASE <Informatik>Physical systemComplex (psychology)Maxima and minimaInteractive televisionPhysicalismProduct (business)Parameter (computer programming)Web pageMathematicsState of matterControl flowCodeSoftware developeroutputReduction of orderDesign by contractModule (mathematics)BitSocial classAsynchronous Transfer ModeSemiconductor memoryUtility softwareBoundary value problemService (economics)VotingArithmetic meanComputer programmingComputer animationLecture/Conference
16:56
Internationalization and localizationMultiplication signMereologySoftware developerDesign by contractCausalityLatent heatComponent-based software engineeringCategory of beingInteractive televisionFeedbackUniform resource locatorSystem callImplementationCodeWordComputer animation
17:55
Confidence intervalFeedbackBuildingStatistical hypothesis testingKolmogorov complexityContext awarenessStapeldateiMultiplication signInsertion lossPhysical systemLinear regressionStatistical hypothesis testingSoftware bugCodeControl flowProduct (business)InformationSound effectSoftwareInformation securityProgram slicingData recoveryDirected graphMessage passingGoodness of fitNumberLevel (video gaming)Rollback (data management)Context awarenessProcess (computing)Set (mathematics)Software developerGroup actionConfidence intervalFlagImpulse responseAnalytic continuationMeta elementMereologyDisk read-and-write headContinuous integrationOrder (biology)Electric generatorFeedbackIntegrated development environmentRelevance feedbackElement (mathematics)CASE <Informatik>Unit testingBitDesign by contractINTEGRALComplex (psychology)WordSource codeBeta functionScaling (geometry)Configuration spaceMathematicsTask (computing)State observerDataflowDecision theoryPhase transition2 (number)Formal languagePoint (geometry)Logic gateStructural loadService (economics)Exterior algebraCapability Maturity ModelPatch (Unix)Mechanism designDifferent (Kate Ryan album)BuildingComputer animationDiagram
25:55
Kolmogorov complexityContext awarenessStapeldateiVolumePoint cloudMiniDiscService (economics)Statistical hypothesis testingIntegrated development environmentStatistical hypothesis testingQueue (abstract data type)Product (business)Operator (mathematics)Software developerStapeldateiProcess (computing)Component-based software engineeringCASE <Informatik>MiniDiscLevel (video gaming)Semiconductor memoryBitVolume (thermodynamics)SpacetimeMathematicsMultiplication signVirtual machineSystem callPhysical systemData managementPoint cloudBoundary value problemBasis <Mathematik>Observational studyOrder (biology)PhysicalismRight angleIntegrated development environmentService (economics)Extension (kinesiology)FeedbackHoaxMereologySoftwareDesign by contractAxiom of choiceScaling (geometry)Cloud computingConstraint (mathematics)Software bugClient (computing)Mechanism designDescriptive statisticsCodeCausalityBuildingType theoryInstance (computer science)Task (computing)Execution unitGame controllerArithmetic meanRadiusElement (mathematics)Analytic continuationDifferent (Kate Ryan album)Electric generatorInternationalization and localizationAdditionCommitment schemeComputer animation
33:54
Scale (map)FlagRollback (data management)Statistical hypothesis testingCodeCode refactoringWritingLogic gateProcess (computing)Message passingRight angleConfidence intervalKey (cryptography)Multiplication signStatistical hypothesis testingMereologySoftware developerThresholding (image processing)Rollback (data management)Graph (mathematics)FlagSubsetState observerMathematicsFacebookGoogolDatabaseBitSoftwareFormal languageCodeComplex (psychology)Computer fileContinuous integrationGroup actionCode refactoringComputer configurationTerm (mathematics)Physical systemProduct (business)Integrated development environmentHand fanConfiguration spaceBranch (computer science)Uniform resource locatorProjective planeLine (geometry)Figurate numberJava appletSelf-organizationMultiplicationTask (computing)Mechanism designCoordinate systemArithmetic meanHoaxTest-driven developmentComputer animation
41:47
Software developerProcess (computing)Multiplication signSoftware developerDependent and independent variablesDatabaseTwitterProduct (business)InformationComputer animation
42:40
Linker (computing)Musical ensembleLecture/ConferenceJSONXMLUML
Transcript: English(auto-generated)
00:07
Good morning, everyone. My name is Josh Reed, and as was said, this is effective CI CD for large systems. I'm talking about this because it's my passion. I'm weird, I'm a build nerd. Fortunately, I get to work in release engineering, which is basically taking care of these kinds of things.
00:24
I work at Ivan. I'm not really wearing any Ivan swag except for these bad boys. So if you want some socks, you can't have these socks. These are out of print, but you can have other socks and they're down in the partner area. So come see us. This talk is not gonna be about any magic solutions
00:42
to your CI CD problems. If you're interested in that, I don't know what to tell you. It's stuff is hard. These are lessons we've learned at Ivan from trying to do this with thousands and thousands of VMs and systems that spin up VMs up and down all the time, use a lot of resources. There's no easy way around this,
01:01
but as you scale, this stuff gets hard. We're also not talking about cool ways to test your system. I've heard talks about things like Jepson or fuzz testing or fun testing tools. It's all great. There's things that will unplug network interfaces to try and verify the guarantees your system tries to make about consistency, and we're not doing that.
01:23
I would love to do that someday, but it's not this talk. This is more practical and grounded, I hope. This is not a pitch for a particular tool. If you use Jenkins, if you use GitLab, if you use CircleCI, I don't care. I mean, I have my opinions about those systems, but I'm not gonna tell you that one solution solves all your problems.
01:42
When I started this talk, when I was going to submit it, I actually wanted to call it effective CI CD for large complex systems, but I kind of realized I hated that because I hate complexity. It's actually the bane of my existence. Testing a complex system is really hard, and really the goal of this talk
02:02
and what I hope you get out of it is that we wanna find simplicity wherever we can in order to tackle these problems. It's a constant battle because things get complex with scale, then things get complex with scope. It makes things hard, but we really don't win in the battle against complexity
02:20
because we have combinatorial explosions. We are fighting against complexity that will try to consume us, and we can never hope to scale our own resources up to meet that challenge without finding some simplicity in the middle of all of it. So what is a large system? Why am I talking about large systems? Oh well, one of the things that Berlin is buzzing
02:42
about right now is scale, and when we get big, things get complicated. When you have a lot of data, it takes a long time to move it around. When you have to manage lots of systems, things happen that wouldn't happen before, and there's mathematical reasons why that's the case. Things just get harder, and in general, a lot of things just were not designed with certain scales in mind
03:01
when they built them in the first place, and we take things to new heights, it gets harder. There's also the more classical way, maybe you heard of scope creep. We used to talk about this a lot more, I feel, than we do now. We talk about scale a lot these days, but scope is still a thing. As you add more features to your system, as you get that extra demand, oh, I wanted to do this, this, and that,
03:21
as well as the thing I originally told you, you add more conditions, you add more complexity to your system. That makes it larger. We also grow things in terms of lines of code. As someone who deals with CICD a lot, as someone who deals with growing code bases, this is a problem. It's more code to test. It's a greater liability.
03:41
We haven't done a lot of studies on the quality of code in general, but one thing we do know pretty well is that more code means more bugs, so it gets harder when you have bigger code bases. One thing that I think is really instructive when thinking about quality and thinking about how we get things shipped in software is to compare it to other industries.
04:03
In the automobile industry, what do you, let me ask you a question, what do you think it takes for the typical new model car to go from conception to, you know, it's in a showroom floor, it's in a dealership? How long do you think? It's about five years. It's about, actually, five to six years
04:20
is about the correct reasoning for that. So what if your CEO or your project manager or some major stakeholder for you came to you and asked you, hey, I want to ship this new software feature, and you told them, it'll take about five to six years, you'd be laughed out of the room, you'd be laughed out of the industry, they wouldn't give you a job again. So, because that's not the expectation that we have in software.
04:41
Like things are just easier for us. We don't have to deal with the, you know, the physical logistics of real production. And that's awesome. It lets us do things that other industries could only dream of, but it comes with this problem of we can kind of outrun ourselves in terms of complexity pretty easily.
05:02
How many VMs can you spin up in 15 minutes? How many do you think? Thousands. How, I mean, how much data? And so there's a lot of data people here. How much data can you store in 15 minutes? You just generate random data. What do you think?
05:21
Terrible. I don't know the action or the answer. At some point you'll be limited by either your network bandwidth or your quotas or credit card. You know, like at some point there's a limit that's coming into play. No other industry gets this ability. So we have this crazy scaling ability that nobody else does. And it comes with great with other problems because you can ship a feature
05:41
and you know, it doesn't take you five to six years. You can ship a feature in two or three days sometimes. You can also destroy a company in 10 minutes. So we have to worry about quality. We have to have some checks in it. And how do we maintain a quality in systems that grow this big, this fast? Do we hire large QA departments?
06:02
Does anybody work in a company that has a very large QA department for a product? Anybody here? I'm curious. So one person in this room. It's not how we do things anymore. It did because one thing that's really slow. It's not always a bad call but it's just not how we do things in software anymore.
06:22
We use CICD, which asks some really simple questions about the code we make on a regular basis. Can I build it? Is it okay to merge these changes? Can I merge them into, can I integrate them back into, that's the integration question and continuous integration. Can I integrate them back into my pipeline?
06:40
And can I deploy or deliver it to the next level? That's how we approach these things these days. These are fundamentally reductive questions. We take a lot of complexity about what the state of our system and reduce them to very, very simple binaries of yes or no. And these questions are reductive because the actions we take are reductive.
07:02
It's you either have built your software or you have not. The build worked or didn't. You can merge it or you cannot. There's no half merge. I mean, you can rewrite the code and do less of it but that's still, there's no half merge. There's no real half deploy. There kind of is, but we'll get to that.
07:22
But you still have to ship it somewhere and it's either shipped or it isn't. So real systems, unlike that binary, tend to look more like this. This is just a, I scraped some Grafana page from the internet. But when I ask is it up, I don't get a simple answer. I get this. I get a bunch of different data and I get, oh,
07:40
maybe something's kind of in the red here and this looks okay. And this is just, I think, one cluster in a Grafana page here. But when we're talking about large systems, there's typically always kind of a more complex answer to the question of is it working? If you ever see a status page, status.insertyourfavoritesystem.io here,
08:02
there'll be a bunch of green dots. Those are lies. Those are lies that someone has put there to make you feel better because there's always something going on behind the scenes that's probably bad. It's just being taken care of and they haven't updated the page to say, okay, this is in a degraded state or there's an outage. But CI-CD is fundamentally pass fail.
08:22
That's a much simpler picture than the real picture of a complex system. Now is, you might think that's a bad thing. We're reduced, we're taking away information, but not really. Remember the actions we take based on this are themselves reductive. So we need this simplicity. You know, we need red and green builds. Otherwise we end up with a bunch of brown
08:41
and dealing with this as an operator, when someone gives you incomplete, inconclusive information. Well, it feels like poop emoji. It's not, you have to, if someone gives you a test report and says, figure out whether this needs to be deployed or not, it's not fun. Unfortunately, in order to reduce this thing,
09:01
is this all down, we need some kind of reducing function. And typically that reducing function is, and does test one pass and does test two pass and does test three pass and does test four pass on and on and on, do all my tests pass. Mathematically, this starts to become a problem because any one of these, the more tests we get,
09:21
any one of these tests can make the whole thing fail. We could do things like say, like a certain number of things fail, but that comes with its own set of problems. But typically this is actually what we want. But at a, say a 1.1% failure rate, say one out of a thousand times, you have a test that can fail and any test can fail.
09:42
AssertTrue can fail if your computer turns off in the middle of the operation. Any test can fail, just they have different rates of failure, of uncaused failure. But at that 1.1% failure rate, it only takes about 220 tests before you start to see a 20% failure of the overall system.
10:02
It only takes 693 before it starts becoming more likely than not that your test fails. So as you add more tests, you start to introduce this problem of flakiness over and over and over. So systems actually get much harder in that case. This can be exacerbated as well by tests
10:20
that rely on more moving parts, which makes that flake rate change quite a bit. It can go up. One of the ways that people have dealt with this traditionally, and this still is an okay way to deal with it sometimes, is to retry things that fail. This assumes that you can have retryable code, which is not always an easy assumption,
10:42
particularly when state is involved. And it can break test isolation, depending on how you set up your tests. But more crucially, it adds a lot of time. If you have to retry things over and over again, that can, especially in cases when you have a legitimate failure that you need to be informing your developer
11:01
about retrying an expensive test can add minutes and even hours to test sometimes. It's not necessarily a bad solution, but it's not the only one we should be looking at, especially when you consider the number of tests we actually need to verify the behavior of software. This ugly picture is an example of a system with, say, four components.
11:21
Each one of these components has some interaction with the other component, some number of interactions with the other components. So it could be a method call. It could be a different branch that the code takes as a result of a different parameter. It doesn't matter. This is simplified for the purpose of an illustration. But although ugly, it's not that big of a system.
11:40
It's four components with three to six interactions each. But if I wanna test every possible combination of that, that grows very quickly. Just this small thing, I need 1,800 tests to exercise every possible combination of it. What percentage of tests are we actually going to write?
12:01
If you need to write 1,800 tests to test the system, how many are you gonna write? Zero, that's a good, I think somewhere, actually, I should've put 0%, somewhere in that range. Like, how many of those do we have, do we know?
12:20
Probably not. If we've glommed our systems together like this, we probably don't even know how much we have or how much we even need. You probably don't actually need all of these tests. Maybe some of these combinations are unreachable or so irrelevant they don't matter. But do you know? No, you just ship it. Whatever. There's some other ways we can deal with
12:41
this growing number of tests if you keep writing tests to try and fight this battle against combinatorial complexity. But then you can do things like parallelizing tests, which we do at Ivan and a lot of places do because you need to to get things faster. But you start running into even more problems as a result of this. You have resource contention.
13:02
You ever had a test that failed because another test was eating all its memory? That's a problem we have. There are ways of dealing with it, but again, that's complexity. There are race conditions that can happen. Not all code is meant to be tested together. And getting good isolation between those things can be hard. What if you're using external resources?
13:22
I know one of the things that we do at Ivan is we have a distributed lock on systems that we test in order to maximize resource usage for tests that require lots of VMs. If you're having a distributed lock in your tests, you're having tests that themselves need test code,
13:43
which starts to get really complicated. Distributed locks are not simple pieces of software to write and they're easy to get wrong. So when you have test cases that themselves need test cases, you start realizing you're fighting complexity with more complexity and maybe there's a better answer.
14:01
The best way I know to escape this is actually to take a page from the auto industry and start using parts, use components. Physical industries have to do this because there's just no way to bring their products to market without doing this. If you tried to build a car the way we build software, then you'd run out of money really fast because you'd be making,
14:21
if you took a part and you made for a car and as soon as you finished making it, you just screwed it right back onto the car and then you put it all together and tried to test it, that wouldn't work. And so it doesn't work in software either, but we do it a lot all the time. I mean, we can kind of make it work because we're lucky and we're in our industry.
14:42
But we tend to ignore this component thing until it's a little too late. I wanna be clear that components can be anything. I'm not here to tell anyone that you need to go start writing microservices so you can test. Don't take that message away. That's not what I'm saying. You can functions, classes, modules, whatever facility for isolating things
15:00
that your language or environment has, those are all things that can be components. You can ship libraries internally within your company. You can do service-oriented architectures for this. That's another thing that can be considered a component. In fact, you can consider microservices architectures as a way of architecturally guaranteeing that there are some component boundaries,
15:22
but it's not necessary to do these kinds of things. An important element of components, at least in what I mean, is that they have contracts. In the same way that an auto manufacturer would expect that their suppliers build components to spec, we need all the constituent parts of our software to have specifications that we know what they can be tested to.
15:43
I'm not gonna, I don't, when I call a contract, I don't necessarily mean something very specific like design by contract, which has specific ideas about pre and post conditions. That's cool. That's an interesting topic. You can Google that, search that, it's great, but I wanna mean something more generic.
16:00
Just software component is, what does this thing need to work? What are its inputs? What are its dependencies? What does it do or its side effects and what are its return values? And what are its failure modes? How can it fail? That alone is a pretty powerful tool for simplifying our software because it means that I don't have to write a test
16:21
over combinatorial explosion of things because I don't have to put everything together. I can test each part to contract. And you might have to go on a leap of faith with me. This actually reduces a combinatorial problem to a linear problem because now I need to write a set number of tests per component that I write versus that huge combinatorial explosion
16:43
that we saw before. That doesn't mean I'm never going to write tests that cover the whole thing, but this really makes the whole thing much more tractable and it puts us on track to have the kind of rapid feedback that we want in CI. It also, having components, gives us this property
17:03
that I like to call localizability. I'm pretty sure I made that word up, but I hope you can figure out what I mean by it. It means that you can assign causes and find locations when you encounter problems. How many people here have a situation where some of your best devs spend a lot of their time
17:20
looking for, investigating code, looking for complex problems that result from interactions between weird parts of the code? Yeah, it happens all the time and that's your best developers often that are spending a lot of time doing those kinds of things. Having more clear contracts on what things shouldn't do tends to make this easier.
17:41
It doesn't perfectly solve it. You're still going to have weird interactions that come up from time to time, but generally speaking, you'll see, oh, this component failed in its contract. That's either the failure of the implementation or it's an under-specification in the contract. With this in mind, we can start talking about kind of what a good CI CD system is going to look like.
18:02
CI CD is, of course, it's not Jenkins. A lot of people, what is CI CD? Oh, it's Jenkins. Or things that are replaced Jenkins like GitHub actions or GitLab or whatever else solution you might be using. CI CD is a set of practices and that's about continuously entering in code
18:22
based on rapid feedback. It should be, your system should be helping you get confidence to merge and confidence to know when to deploy. If it's not doing that and it's not doing that, letting you do that in a fast way, then it's failing in its job. Fundamentally, it is a system of gates.
18:41
It is, again, binary questions that happen at different points. What does it look like? Normally, when you have a commit, it's going to automatically trigger a continuous integration process that will happen before you merge. You're going to run linting on your code. You're going to say, does this look good? You're going to run your build.
19:00
If you, some of you may use dynamic languages which don't strictly need to compile, but often there's still build elements involved like installing dependencies or doing code generation or other things that might need to happen in order for your source to be ready to test. After that, you want to have fail fast tests. I really hesitate to use words like unit tests
19:21
because people will now spend hours arguing over what that means. I don't care. What I care is that they are fast and they are low resource usage. So whatever you want to call that is fine with me. Those things can run in 30 seconds to a minute. That gives you fast feedback to developers. Is this basically right?
19:42
Then after that, you can have some more thorough tests and that's where you have your integration tests, your tests against contracts. Even some light end-to-end testing that I would classify as smoke testing. Typically very, very simple cases where you can just see, does this work? Kind of where there's smoke, there's fire kind of situations.
20:03
Each one of these steps is designed to give you fast feedback. If things fail, that should be relevant feedback for a developer that's getting that information. If it failed because of irrelevant things, that's a problem in your system. You need to correct it. That's what I spent a lot of time doing on release engineering is trying to correct
20:21
those irrelevant failures that come up the crop up from time to time. This whole stage in an ideal world, five to 15 minutes. Why that number? Because that's about as long as you can go before things start leaving your brain. I mean, maybe you're different. Some people might be able to hold problems in their head longer, but I'm not.
20:42
I mean, I probably have two minutes. You know, what's the shiny thing going on? But no, five to 15 minutes, it's enough time to get coffee. It's enough time to have a smoke break. It's enough time to, you know, I don't know, check slack, although that may be more distracting, depends on how you deal with distractions.
21:01
But anyways, it's really critical that this thing stays very fast. And if it starts to creep above this number and you don't do anything about it, it can cause a lot of problems and it can become very, very, very difficult to get it back down again. After that, you wanna have a code review stage typically, well, which is a human stage.
21:20
This is not a, I mean, you can have automated processes that help you with code review, but typically this is a human looking at things guided by information they see from continuous integration. Maybe that's helping them make that decision. But, you know, of course you don't wanna merge things that have failing tests. But, you know, then after that,
21:42
we can talk about merging. We move on to our continuous delivery stage, which is typically seen as a more mature phase of this, but it requires a little bit more automation on your side but we start moving on to places like a staging environment, could be any alternate testing environment. From there, if we have problems,
22:01
say doing things like load testing or performance testing, we can roll back and question whether we really should have merged in the first place. You can always revert things. A merge is not permanent. Don't be afraid of merging. This whole system should be designed to help you have confidence, not make you be more scared because you have to merge.
22:21
From there, another cool thing we can do is something called canary deployments, where we actually start talking about deploying to different parts of our production environments, just little slices of it to see how does this work. It's one of the nice things about software as a service is you get to do stuff like that. And as we get more information, we use our monitoring and observability tools
22:43
to see how our code is doing. We can move it to the rest of production. As a side note, kind of at the end, I mentioned feature flags, which is a way we can actually change the behavior of software in production even without doing code changes. It's the way you build in configurability
23:00
into your code as you go so that, for example, you can dark launch features. You can say, I wanna ship this, but I don't wanna turn it on yet. Once I test it in a few places out in production, and I turn it on, make it general availability for somebody. Betas are a good example of this, but you can actually get really sophisticated with this, and many companies that do this effectively
23:21
at scale do. This process, if you thought five to 15 minutes for CI was scary, then 15 to 30 minutes for deployment should be probably a little bit scarier. But there's good reasons why you want those times. In the case of continuous delivery,
23:40
why you want that in a short time is because you end up getting exposed to a lot of risk if you don't have it. If it takes you a long time to do a deployment, it'll take you a long time to do a rollback, and that will make you scared. That will increase a lot of fear in your system. Having easy rollbacks and having easy deployments will continue to give you confidence
24:01
that you can fix bugs, you can patch security issues, you can do things on the fly, and even though it's a lot of work, it inspires a lot of confidence. The last important part of this whole picture is that you need an overall meta-feedback mechanism from your continuous delivery process back into your CI. If you discover bugs later in the system,
24:23
that one of your first impulses should not be just to fix the bug, but also to write a fast test that can reproduce that bug in your CI system. Otherwise, you're gonna end up with regressions. Otherwise, you're never gonna be improving the CI system, and you'll lose confidence in it because it's not actually solving the problems
24:41
that you were aiming to solve in the first place. Like I mentioned before, the time thing, waiting is a time sink. You are burning money. I think if we understood how much money we actually lose just by waiting for builds to complete, by waiting for tests to come back, we would invest a lot more in these kinds of things.
25:03
Like I said, it increases recovery time. It increases exposure to your security bugs. It has a lot of negative effects on developers. Slowness tends to lead to complexity. Not only, like I said, developers, you're not gonna keep things in your head more than five, 10, 15 minutes.
25:20
So when you get the test result back later, you're thinking that must be someone else's problem. That can't really be my code that's causing that problem. You'll have unloaded things from your brain and context switched out. Everyone, I think, in here probably knows when you're developing, you'd like to lock in, put on the headphones, get a good bit of flow going,
25:44
because you're loading a lot of information in your brain and reasoning about a lot of complex pieces. Once you unload all of that from your head, it's work to get it back in. And if you're switching between tasks because you're waiting on CI to complete, you're just forcing this to happen all the time for your developers every day. In addition, it leads to a temptation to batch,
26:02
which by that I mean, you put unrelated changes together because you fear the long CI process, because you fear that it's gonna take a long time to get your changes through, and you don't wanna go through that process again. This really decreases localizability. Imagine if you're an operator and you found a bug in production,
26:22
and you're trying to isolate the cause, what that is, and you go to a commit and there's a bunch of weird unrelated things that have been lumped together and you don't know why, and there's no description for it, you're gonna have a hard time figuring out how to deal with that. And you might end up reverting whole changes to deal with it because someone batched
26:40
and they batched because they were waiting for a CI to complete. Why do we get slow pipelines? The main reason is because we build our systems in ways that are hard to test and are hard to build in small increments. And that goes down to having a lack of contracts between what components you're supposed to do.
27:01
Sometimes it's just because we don't, we're just trying to get something to work and we don't really care about this stuff. And we don't really consider it as a design constraint. Another element that plays a big part in this is having really coarse-grained dependencies. And what I mean by that is when you have, say, component A needs dependency X in order to work,
27:23
but you don't have A to X as a dependency in your system. You have all of A, B, C, D, E, and F need X now because you don't have any more fine-grained way of determining what belongs to what. So assigning meaningful dependencies and having smaller build units can actually speed these things up in a big way.
27:44
I'll skip this for now. If you were talking about having contracts and how would you divide this up, I wanted to go through kind of a quick case study of what this might look like, how we can do this. This is an example of something we do at Ivan.
28:01
This is a real thing we do. Sometimes when we need to rescue machines from, we need to rescue machines from a situation where they filled up their disk, we'll actually use cloud technology to add more disk to that machine, extend volumes, get what we need done, and then move to the next machine
28:20
that has the right disk space again. It's an important tool for certain situations that come up in operations. So this process is really owned by our ops team, but it still crosses the boundaries of three different teams, the ops SRE team, the cloud team that manages cloud infrastructure, and the VM team that writes the software that sits on all our VMs.
28:41
This process works like this. The operator will issue the command add VM volume, which produces a work item in a work queue, which is consumed by the cloud resource manager, which actually makes the call to the cloud provider and says, hey, add some more disks to this machine. And then when the disk is dynamically added,
29:00
that machine can extend the logical volumes there, and then you have more disk space. Important use case for us, we have some tests that run on this. And we have, what, we have eight public clouds now, is that right? I think something like that. And all kinds of different, all kinds of different instance types
29:23
that end up going through there. So in order to run this effectively, we have to run tons and tons of this test. And it takes a long time because this is stuff that happens in the cloud and actually involves physical infrastructure. But in order to act, but really a lot of this stuff can be tested very well in isolation. We already have contracts here.
29:42
The work item itself, that's a contract. So all I have to test to know that this part of it works right, all I have to test to know that the command works right, can you create a work item? That doesn't change that often, so I don't have to worry about it. For the resource manager, the cloud API calls don't change that often. So I can just check, can I fake up a work item
30:03
and can you produce the right API call code from that? That's most of the check I need. I'm not saying I'm never gonna call this, I'm never gonna call the public clouds again, I'll test them some other times, but I don't need to test them every commit. That's a lot of resources and it's a lot of wasted time because that doesn't change that often.
30:22
For the extending logical volumes, that one's a bit trickier because it's not obvious how to test that in memory, but we're working on it. These are some of the challenges that we face. But still, by dividing things like this up, we actually are able to get this into tests that can work at a CI level, and then we can take this whole test case
30:41
and we can keep running it on a daily basis or whatever and get more information out of it. But we're not blocking developers because the only way we know to test this is to string it all together and see if it works. We actually have been able to divide this meaningfully into small pieces and assert that each piece works as it's supposed to.
31:02
There's more to quality than CI. I think a lot of people, when you start doing CI systems, you think that all my tests need to run all the time in CI systems. And once you realize that you don't have to stop testing when you're done with CI, you can test later. You can have cron jobs that run tests one at a time
31:22
and generate bugs for people, that's perfectly fine. You need a different, it's a different feedback mechanism. It's this test fails, create a ticket in your work tracking system and pass that to the developers in charge of that thing. It's a longer feedback mechanism and you need to make sure
31:40
you actually exercise those tickets. Otherwise, the whole thing falls apart. But it means that you get those long, expensive tests out of the way of your developer's day-to-day work. And you can use the longer running tests as a feedback mechanism for your shorter feedback mechanism. Other places for testing besides CI
32:01
are things like staging environments. You can set up a separate environment that's very, very, very close to your production environment and run continuous tests on that. Like I said, you can do canary deployments, which is deploying to just little parts of your system. Or what some people are doing more and more these days is something called multi-tenancy in production. And this is just using your production system,
32:21
but using data isolation to, before you have a test tenant that owns this data and you can handle building and things through that. But that's, so you're just using your production system and you use your production monitoring tools and things to monitor it to see if things work.
32:40
That's actually a really cool way to exercise your automated rollout and deployment mechanisms as well. So it's advanced, but it's a very powerful tool. One of the cool things about being in software and software as a service is that you get to test in production. In the old days when we shipped software and it was like, it went to the client
33:01
and then you had to configure it and then you had to leave them alone. You didn't get to do these things because once it was there, if there was a bug, it's a long process to fix it. But now that we actually manage our own systems, we can do this much easier. Like I mentioned before, many times canary deployments are really cool.
33:21
You can, if you have enough traffic, if you have, you can benefit from scale and 10% of your customer base will be enough to find a lot of things that you didn't find in CI. And you can, and usually if your blast radius on your changes is small enough, it's fine. You can either just roll back or you can fix forward. You can make the choices you see.
33:42
This effectively replaces traditional QA. You just use your customers as your QA and you have to have really good controls in your system to do this properly, but it's a great way to save money and keep moving forward in confidence. If there's anything that I want you to take away about what CI is about, it's about these simple,
34:02
being able to answer these simple questions with confidence. Can I merge? Can I deploy? And are my gates doing their job of giving me this confidence? If they're not, if you have a pass in CI and you feel uncomfortable about merging, then your CI is not doing its job right. If you constantly don't know when you should deploy
34:22
or what you should deploy, then your CD is not doing its job right. So lastly, I want to talk of just an overview of a few tools that can allow you to be a little bit more effective in this. Nothing huge here, but just some options that you may want to explore.
34:41
Just as a note before all of this, the kind of rapid release ability and automatic deployment, automatic rollback mechanisms that we were talking about, this just takes hard work and it usually takes designing for it specifically. Retrofitting that into a system that wasn't built for it can be quite a daunting task that takes coordination from multiple parts of your organization.
35:02
So I want to convince you that this pays off in a huge way by not wasting your developer's time, by giving you a lot of confidence and not letting you fear things that happen in production, but it's not going to happen overnight and you need to convince people in your org that this is worth it. That being said, there's some of these practices that can really help along the way.
35:21
I'm a big fan of trunk-based development, which is often opposed to say feature branch development, where you would keep, in a feature branch, normally you're going to keep your feature going until you think it's good enough to merge and then merge it. To me, that's the opposite of what continuous integration means. Continuous integration means constantly integrating your code back into things as you can.
35:45
This is made a little bit easier by things like feature flags I mentioned before, where you can do stuff by start implementing a change, hide it behind a flag, and then keep implementing it and getting it better until you're ready to turn that flag on and say, okay, now this is ready to go. Feature flags can be as simple as configuration files.
36:03
You can just mock up a JSON file that this is featureflags.json. You can put them in a database, or there are whole full-fledged feature flag frameworks out there that companies that have edge devices will use to decide which devices actually run which lines of code.
36:20
You don't need that to make this work. You can start as simple as a JSON file, like I said, but it's a really powerful way to keep merging, keep integrating without breaking things all the time. You'll also want to have smaller pipelines. If there's anything you've gotten away from this, the one of the ways that you can manage large systems
36:41
is by chunking them into smaller pieces. There's not really another good way to do it. So the easiest way to start making smaller pipelines is if you have something that's well isolated, just split it off into a smaller project. I know not every company wants to do this all the time, but it can work really well. You can also do things that detect changes by location.
37:01
Most CI systems offer a way to say, okay, the changes were in this group of files, run this pipeline. The changes were in this group of files, run this pipeline. Those are cool. Those are pretty simple. The problem with them is that you have to end up managing the interdependencies between those pipelines, which itself can be kind of complicated. If you have a lot of complex interdependencies, there's mono repo tooling,
37:21
which can do this in a very, very, very sophisticated way where you can say, okay, this part of my code base changed. That means I need to run this test and all of its transitive dependencies, because this depends on this, which depends on this, which depends on this. That's a sophisticated tool. Check out monorepo.tools to learn more.
37:42
If you're interested in this, you'll need to select one that's right for your language and your deployment environment. These are very complex tools that make a lot of kind of crazy assumptions about how you work. So this is a solution that you should, if you choose, you should choose it with care, but it can help solve some of these problems
38:02
if you already have a complex interdependencies. Observability is a real key issue here as well. When I know observability is another buzzword and maybe it gets overused as a marketing term, but it's actually a really cool as a developer. If you're able to, if you can meet that 30 minutes to production,
38:24
target time, that means you can merge your change and you can watch it go out in production on your observability systems and see, hey, did things go wrong? Are things going right? You can do your own like basic acceptance testing by just watching the graphs or watching the monitoring systems. It can also allow you to have things
38:41
like automated rollbacks if certain thresholds are met or whatever. So we'll talk with your SRE teams, talk with your DevOps teams about this. As I mentioned before, none of this works without heavy investment in deployment automation. If there's any part of your deployment that requires a manual step other than just pushing the button to do it, then you're doing it wrong.
39:02
It absolutely all needs to be automated. It needs to work really well all the time. And there's no better way to make sure it works well all the time than doing it really often. Likewise, rollback automation is the key to giving you real confidence to deploy. If you can roll back a change, you have no reason to fear making the change in the first place.
39:21
This is pretty easy for software generally. It's a little bit harder when you start talking about database schemas and things like that. So I'm not gonna say that's gonna happen overnight for every change. But the Googles, the Facebooks, the GitHubs, the world make heavy investments in this kind of thing. And it's because it gives their developers a lot of freedom to make rapid changes all the time.
39:45
If you are going into code that you don't trust and you wanna start adding these kinds of features to it and adding these kinds of tests, you may struggle with this particular conundrum which I have many times of when you approach some old code that you wanna make changes to
40:00
because you need to add the ability to release rapidly. You look at it and say, oh, I need to refactor this. I don't know what's going on here. You can't refactor without tests, but I can't really write tests on this code without refactoring it. Or maybe I can hack some of the other, some mocks or something. But that's a hard situation to be in.
40:20
But there are tools available. These are a couple books. These are old books that talk mostly about Java and C++. Even if you don't work in those languages, especially working effectively with the legacy code is kind of really, that's revolutionized the way I look at legacy code. Like I can go in, I can be confident and figure out ways to actually test things.
40:42
There's a lot of tricks in there that are really practical. And some of it's gonna put it on the language you're using, but I definitely recommend those things. For new code, I'm a big fan of test-driven development. A lot of people don't like this because it has a terrible name. It's really, I should be calling something like example-guided development or something else.
41:02
It's not a testing methodology. It's a way of writing code that produces testable code. I actually don't write that many tests when I do TDD. I just, I write enough tests to know that my code is testable. And then if things fail later, I can always add more tests. And that's what's really great about it is I get to solve that problem of, oh, if I see a problem in production,
41:21
I wanna add a fast test. It's the spot is there and ready to do it, ready to accept it. I'll skip this for now. We're kind of running out of time. Another, this is just a good movie. I couldn't find a better picture for this, but another really important thing is you'll have sometimes systems that you cannot really mock out.
41:41
So you'll need to invest in having quality fakes. Okay. I've been doing this a lot right now. We're kind of running out of time. So I wanna skip down to the last thing, question I wanna ask you is the can the developers at your company accomplish things in your org that they couldn't accomplish other places? Do your processes enable them to move faster,
42:02
to build more than they could if they were somewhere else? If not, then maybe you need to be investing more in these processes and in these tools. My name is Josh Reed. You can find me on Twitter where I do nothing, but maybe if you DM me or something, I'll respond. I don't know.
42:20
LinkedIn, that's my contact information. I work in release engineering at ivan.io. We don't sell a CI CD product, but we do sell some cool database stuff and data streaming technologies. If you're tired of managing that by hand, maybe you can free up some time to focus on your processes and less time focusing on your databases. Thank you.