We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Solid Snakes

00:00

Formal Metadata

Title
Solid Snakes
Title of Series
Number of Parts
160
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Solid Snakes [EuroPython 2017 - Talk - 2017-07-11 - Anfiteatro 1] [Rimini, Italy] No matter whether you run a web app, search for gravitational waves, or maintain a backup script: being responsible for a piece of software or infrastructure means that you either get a pager right away, or that you get angry calls from people affected by outages. Being paged at 4am in everyday life is bad enough. Having to fix problems from hotel rooms while your travel buddies go for brunch is even worse. And while incidents can’t be prevented completely, there are ways to make your systems more reliable and minimize the need for (your!) manual intervention. This talk will help you to get calm nights and relaxing vacations by teaching you some of them
95
Thumbnail
1:04:08
102
119
Thumbnail
1:00:51
ComputerSoftwareIntelHand fanMultiplication signMetadataSolid geometryGame controllerDigital photographyDependent and independent variablesProjective planeClassical physicsElectronic mailing listProcess (computing)Perspective (visual)Lecture/Conference
Euler anglesProcedural programmingMultiplication signFunctional (mathematics)Drop (liquid)Object (grammar)Fitness functionSoftwareComplex (psychology)Touch typingSoftware engineeringBitInstallation artLink (knot theory)Digital rights managementInflection pointGraph (mathematics)Web pageComputer programMathematicsTerm (mathematics)Incidence algebraPhysical systemControl flowMereologyBoss CorporationBuildingLine (geometry)Projective planeAuthorizationPoint (geometry)Online helpNumberNetwork topologyProgrammschleifeConstructor (object-oriented programming)FamilyEuler anglesRevision controlGoodness of fitConcurrency (computer science)Solid geometryForcing (mathematics)Component-based software engineeringMetric systemInterface (computing)Interrupt <Informatik>Complex systemTrajectoryCASE <Informatik>WordSoftware testingFunction (mathematics)View (database)Ocean currentDampingType theoryAbstractionComputerSign (mathematics)Right angleGraph coloringMusical ensembleQuicksortObject-oriented programmingMoment (mathematics)Level (video gaming)Computer animation
Computer programmingKolmogorov complexityFormal languageBitProjective planePhysical systemMereologyHierarchyCartesian coordinate systemClient (computing)SoftwarePoint (geometry)Content delivery networkCuboidCache (computing)Complex (psychology)NamespaceAttribute grammarSubstitute goodEndliche ModelltheorieLogicRule of inferenceInheritance (object-oriented programming)Network topologyDefault (computer science)Social classSoftware bugEqualiser (mathematics)IterationAddress spaceTupleGodObject (grammar)Optical disc driveWeb serviceCodeLibrary (computing)Error messageType theoryArrow of timeArmMonad (category theory)AbstractionLink (knot theory)Partition (number theory)BlogCategory of beingDifferent (Kate Ryan album)Heegaard splittingSystem callPower (physics)CASE <Informatik>User interfaceSoftware testingValidity (statistics)Data conversionDatabaseWritingView (database)Right angleAerodynamicsDampingFactory (trading post)Product (business)Scaling (geometry)Level (video gaming)
Service (economics)Service (economics)Physical systemMessage passingDecision theorySlide ruleTracing (software)LoginComputer fileMassComputer animation
Interface (computing)Stress (mechanics)Boundary value problemIntegrated development environmentModule (mathematics)Web serviceSoftwareMereologyArm
Kolmogorov complexityMultiplication signConvex setCombinational logicComplex (psychology)Product (business)MereologyInheritance (object-oriented programming)XMLUML
Product (business)PredictabilitySpeech synthesisHyperbolischer RaumFocus (optics)QuicksortProcess (computing)Error messageArithmetic meanPhysical systemNetwork topologyCausalityBuildingXMLUML
Product (business)System callClique problemCASE <Informatik>Integrated development environmentDatabaseGroup actionSoftware developerMereologyOperator (mathematics)PasswordBuildingRight angle
Computer programPulse (signal processing)outputPhysical law
Database normalizationPlot (narrative)TwitterData typeResultantPhysical systemSoftware engineeringExpected valueStaff (military)Computer2 (number)Complex (psychology)Normal (geometry)Error messageWebsiteMereologySpectrum (functional analysis)Enumerated typeSocial classString (computer science)Symbol tableJordan-NormalformQuicksortFunctional (mathematics)ParsingParsingData structureDecision theoryCodeKey (cryptography)Data dictionaryINTEGRALData recoverySoftwareValidity (statistics)Shift operatorVideo gameGoodness of fitSolid geometryTwitterConfidence intervalComputer programWeb browserDependent and independent variablesLibrary (computing)IntegerCASE <Informatik>Power (physics)Right angleBit rateMultiplication signXML
VideoconferencingPlanningScaling (geometry)Open sourceMathematicsError messageRow (database)Physical systemTraffic reportingState of matter
CodeLocal ringMultiplication signDatabaseExistential quantificationDevice driverError messageConnected spaceLecture/Conference
Closed setSystem callOpen setDatabase normalizationDirect numerical simulationInternet service providerInternetworkingService (economics)Digital electronicsData storage deviceLevel (video gaming)Physical systemStructural loadError messageSoftware2 (number)Server (computing)Pattern languageProxy serverGame theoryClient (computing)BackupResultantMultiplication signComponent-based software engineeringGoodness of fitData centerDistanceNormal (geometry)Right angleControl flow1 (number)State of matterDrop (liquid)Point (geometry)Execution unitClosed setCodeComputer animation
Exponential functionError messageException handlingNP-hardCrash (computing)Asynchronous Transfer ModeData recoveryFocus (optics)Ultraviolet photoelectron spectroscopyBackupRegular graphProcedural programmingCASE <Informatik>Task (computing)Standard deviationContingency tableInformation securityChecklistHand fanState of matterGoodness of fitException handlingClient (computing)Queue (abstract data type)Mobile appFault-tolerant systemPhysical systemCodeLink (knot theory)Complex (psychology)EmailService (economics)Structural loadCrash (computing)Scripting languageFront and back endsMoment (mathematics)SpacetimeDigital rights managementMultiplication signSemiconductor memorySoftware bugDependent and independent variablesSoftware testingDatabaseProcess (computing)PlanningHypermediaPotenz <Mathematik>Theory of relativityLie groupCartesian coordinate systemRandom number generationData recoveryLevel (video gaming)QR codeTracing (software)Computer hardwareServer (computing)FrequencyError messageScheduling (computing)BootingBitSolid geometryFerry CorstenNP-hardConnected spaceWeb serviceGrand Unified TheoryCausalityFunctional (mathematics)ChainProgram flowchart
Transcript: English(auto-generated)
All right, hi everyone. So first of all, I'm not a fan of Metal Gear Solid, I'm just a fan of cultural references. Yeah, as I said, I am Hynek, and I'm here to talk about last year,
about my last year, because after I've wrapped up PyCon US, which was in Portland, which is the Pacific Northwest of the US, I took advantage of being that far away from home, and took the classic road trip which people just do, usually in white cabrioles. I went from San Francisco down to LA, up to Las Vegas, through the mountains,
back to San Francisco, and a small detour to Hawaii. Now, this little tour, plus PyCon, took me more than five weeks, which means that I've been more than five weeks absent from work. And with every photo I've posted, people kept asking me how I do it,
and whether I still have a job. Now, it turns out I still do have a job, last time I checked. And that job, I'm responsible for almost 70 projects. And it still worked out that well that I'm gonna even take seven weeks this year,
when I'm gonna spend a month in Cape Town, South Africa. So, let's see if that's gonna be another talk. But it also turns out that the answer is a lot longer than 140 characters. So that's how here, at the airport of Kauai, with a seven dollar plastic cup of beer,
this talk was conceived. And I'm here to tell you my story, or more like the bits of it that fit into 45 minutes, and that I think are vital for my sleep. And I'm gonna touch on a lot of topics. It's gonna range from DevOps-y things, like people are usually used for me, but I will also talk about general software engineering.
So there should be something in there for everyone. And each topic by itself is a talk by itself. So, as usual, there will be a link at the end of a page with all the links for you to study, if you wanna know more about certain sub-topic. Now, with that said, let's get going.
Like every big change, it all starts with your attitude. You have to start prioritizing quality, which unfortunately means that you will spend a significant time on non-features. And that brings us directly to incentives.
Because nowadays, everything is top priority. If you ask your manager if quality is a priority, they will probably tell you that it's the biggest priority ever. However, how are the incentives? If you don't have time to write tests, to instrument your systems, it's kind of like a construction company that claims that safety is number one priority,
but forces their workers to buy their safety gear out of their own pocket. And it's true that reliable systems do take time. And if your only performance metric is to ship new features, you may have a conflict of interest here. However, the good news is that there is a solid business case to be made for spending time on quality and reliability.
And that is that the fewer urgent interruptions you have because something is going up on fire, the more you get to work on important stuff. Or if you put it the other way around, if you build a lot of unreliable systems,
unreliable programs, apps, whatever, there is an inflection point where you just stop being productive because you and your coworkers are busy fighting fires all day and you do not get to ship a new feature at all. And we've seen it with a lot of startups that have a very fast trajectory and then like two years nothing happens.
Well, this is what happens. And this is also how I ended with 70 projects because I'm forced, because I'm a very small company, I cannot just tell other people to fix my crap, so I'm forced to build quality stuff and just accumulate. This of course takes some long-term thinking.
Which is sadly not too common in our line of work. And unfortunately that is the part I cannot help you with. Yes, but yeah, let's get more practical. So the first thing, I'm gonna defer to authority and no one less than Tony Hoare who gave us Quicksort,
was quite accomplished with all things concurrent systems. And what he says is the price of reliability is the pursuit of the utmost simplicity. And if you prefer a Dutch sage, which at a Python conference is completely reasonable, then dice crusts are basically the same.
So you need simplicity, otherwise there will be no reliability. And at this point it's important to stress that you must not conflate simple and easy. Because easy solutions usually are not simple. Easy solution is usually a hack.
And simple solutions, like solutions that will serve you well in the long term, are usually not easy to find. And I find a good way to approach simplicity is from the other side. So let's talk about complexity for a moment. So complexity in software is something I would say
is the number of concepts you have to keep in your mind when you are trying to reason about the behavior of a system, of a program, of a piece of something. And especially how many things you have to keep in mind when you're trying to reason what happens if you change a little corner of a system.
And humans are naturally limited in the numbers of things we can juggle, both literally and figuratively in your mind. And if you have too much things going on, too many balls in the air, you start dropping things. And that's when normal accidents happen. This term has been coined by Charles Perrow
in the wake of the Three Mile Island incident, which was a nuclear power plant incident in the United States. And it's called normal because they are inevitable in extremely complex and tightly coupled systems. Now, if your program is a contraption like this,
it's impossible to reason about changes. What will happen if you play with this knob? It's impossible to tell. And what breaks if the part with the knob breaks? I would say it's impossible to say, but actually it's very easy to tell.
The whole thing is going down in flames. That's the problem of having tightly coupled complex systems. Now, the irony is that if you try to make this thing safer, you are adding more complexity. What does it mean? The system's more likely to fail.
So you cannot fix a system or a program that looks like this. You can only rewrite it from scratch. Now, when we are talking about complexity, we have to differentiate between two types. The first one is the essential complexity, which is the good complexity. That's the complexity your customer or your boss is paying for you to solve.
It's inherent to the problem you're solving. The other one is accidental complexity. This one comes from using wrong abstractions, having cumbersome deployment procedures that stop you from deploying, or just using ancient and inadequate tools like Python 2.
Yeah, I cannot drop this mic, so. So, obviously, only in a perfect world you can work only on essential problems. But still, you should always keep in mind what problem are you solving here?
Is this essential or is this accidental? How is your ratio between those two? Now, what a simple software. And that, of course, is a talk by itself, if not a conference by itself. But given we are in Italy, I really like the ravioli metaphor. Now, given that normal accidents happen in tightly coupled and complex systems,
it follows that you should prefer to have simple objects that are self-contained and that have simple relationships, just like ravioli. Because if you look at it, they are small, they are self-contained, nothing is leaking out unless you overcook it.
And I heard in Italy, there's jail time on overcooking pasta. Yes, yes, yes, that's an adequate thing to do. Now, in other words, your objects, your functions should do as little as possible. But more importantly, it should know as little as possible
about other objects, other functions. And this gives them a clear interface, which you can program against. And the few assumptions help you to have simple relationships. Because as many object-oriented mentors said before, dependencies will kill you. And by dependencies, I do not mean things you install from PyPI.
Donald's stuff is doing great work. Install as much as you want. But dependencies between your objects. And especially, I really like to not have bi-directional relationships, which are almost always bad. And I like to think of an object graph like a family tree.
If you have loops in your family tree, it's a bad sign, right? So, think about your object graph similarly. Now, all of this, if you follow that, gives you system components that are really easy to test. Because you have clear interfaces that you have to either fake out or mock or whatever you're doing.
And it's few dependencies. This is good. Now, bad designs, staying on a pasta theme, is the opposite. It's big classes, also known as God objects. And a God object is an object that knows too much and it does too much. And those are obviously very hard to test.
Because just to instantiate such an object, you need a dozen of other objects. And you get yourself to fight accidental complexity really, really fast. The sad thing though is that God objects are pretty common in the Python ecosystem.
And the reason for that is that writing classes in Python is a bit annoying. But I will touch on that later. Another mostly self-inflicted complexity is basing your design around subclassing. Or as Corey puts it, you will regret it.
And I'm not here to hate on subclassing per se, although I think it's well known that I personally try to avoid it. But subclassing is subject to a lot of misuse. So if you think back, what subclassing has been invented by its inventors for is specialization, not code reuse.
If you're doing a subclassing for code reuse, you're kind of making the inventors of subclassing sad. So don't do it. And there's a bunch of rules around it, like Liskov substitution principle,
the open close principle, all these things. And if everyone followed those rules, there wouldn't be any regret Corey's talking about here. And it takes actually a lot of experience in design and modeling to do subclassing really well. Now, the thing is, subclassing makes your software
always more complex to understand. It's easier to write. It's less typing, no question there. But it is harder to understand because you end up with namespace confusion. So where is this attribute coming from? Can I name my attribute like this? Or is there some attribute up there in an inheritance tree that has the same name
and everything will break? You have to understand the MRO. Your logic is not only distributed in one level, like one method calls the other. No, you have to traverse hierarchies to understand where your calls are going through. So this is all not great.
But it is about making trade-offs. So I'm not saying you must never subclass, but you should be aware that you're doing a trade-off here between easiness and complexity. And more importantly, this is not where you should start. So subclassing may emerge as part of a design
out of practical reasons, but it shouldn't be the thing you build everything on. So don't start projects by inventing hierarchies. And as a reminder, many new and modern languages that came up in the past years, like Go or Rust, are doing just fine without having any subclassing at all.
And I'm gonna say that Swift only has subclassing because Apple needed to ensure compatibility with Objective-C. So relatedly, metaclasses. Very powerful, often overused. They're a bit of Python's monads because people tend to read into them, then they find them interesting,
and then they write a blog post explaining everyone how great and simple it is. But it isn't. You're again paying with complexity for mostly syntactic sugar. So leave them to David Beasley. He will do something nice and depraved in his evil lair. And one is heavy.
So just to be clear, I'm not talking about abstract base classes here, which are a metaclass and subclassing. It's just the way they are used. So that's something different. Although I personally prefer the approach of the ZOIP interface, which uses class decorators for it. But yeah, ABCs is not what I meant here. Now, I mentioned that writing classes is tedious in Python.
So I decided to do something about that. And I wrote address. How many of you have heard of address? Ah, okay. It's getting better by the year. So those whose arms stay low
are probably restless in your seats dying to tell me about name tuples, right? So turns out I do know about name tuples. And maybe I know too much about name tuples. And it's also such a thing that people discover them and then start tweeting out that it's the best thing since sliced bread.
So first of all, that's a low bar because carbs are really bad for you. But other than that, name tuples have a history. At least that's what I've been told by some ancient Python sages. They were made for the standard library
if they return a tuple so they can attach a name to the tuple so it's more readable. It makes sense, right? Named tuple, tuple with name. So, and they're great for that. But they are terrible as a class replacement because then you end up with the tuple type in your inheritance tree,
which means that you have very odd rules for equality. You have accidental iteration. You can accidentally unpack your class which gives you very weird errors and yeah, bugs which don't make any sense. Or you want to attach a method to a name tuple,
well, you have to subclass it. You want to influence the initialization, have fun implementing dunder new. But people do it because it's convenient. That's why I wrote address because name tuples, if you ever wondered when to use them, it's very simple.
You have a tuple and you want to attach names to it. You send a name tuple. Other than that, write a class or use address which will write the class for you. And it has many more goodies like validators, converters, default values, including factories. So for the few poor who didn't know about it,
check it out. And you may be asking, is this a serious project you should put into production? And I'm so glad you asked. Yes, it is. I have stickers, get them. Now, moving on, operational complexity. Which is a complexity of running your infrastructure
or parts of infrastructure. And I think it's not controversial to say the distributed systems are hard, right? Now let's look at a very, very simple one which many wouldn't even consider a distributed system. We have a client that speaks to a content delivery network which is not part of your infrastructure.
We have an application that has a work queue, salary for example, a database, and a Redis cache. So far so standard, many run something like this, I run something like that. Now, the problem here is that every box you have there is a point of failure. If any box goes down, the whole thing is down.
But wait, there's more. Every arrow between those boxes is a point of failure too. So in this case, you come up with 10 independent points of failures. Which, if you know something about probabilities, they almost add up. And the reason is simply that network partitions are a thing.
You may live in denial because it never happened to you before, but it will hit you too eventually. So it's better to accept your fate and learn from others. They are kind of rare if you don't run at a huge scale, but especially if you run at scale, rare occurrences are kind of common.
Now, all I'm saying is think twice before you add more boxes and more links between those because each of them introduces new exciting ways to fail. And with that in mind, let's talk about microservices. Microservices is about splitting up an application
in many small web services. There are many great reasons to do that, but you end up with many more red boxes and many more red arrows between them. So your big monolith that has certain annoying properties we all know about ends up being a highly distributed system,
which you may or may not be equipped to deal with well. Because now you have new decisions to make. For example, are you gonna go for this tangled mess, which is so tangled that I was too lazy to make this slide myself and took it from Andrew Godwin? Or are you gonna use a message bus? What message bus?
And all the new problems that come with it, like service discovery, you cannot really run this without service discovery. Which service discovery? Aggregate logging. You cannot just look at the files anymore. Tracing. You shouldn't even think about microservices without having a distributed tracing in place. Which, by the way, I heard there will be a talk
later today about it, so you should check it out. And of course, all the other fallacies of distributed systems that we like to forget and that come back and bite our butts later. Now what you really want and what you really need are boundaries. You want to define and adhere between your boundaries,
between your modules and your packages. That's what you really want. You do not need a network between your classes, okay? And then you can have also separate teams working on separate parts if you have clear interfaces. And once you have that, it's very easy to graduate into microservices
because the boundary's already there. Now you just put a network between them. And there is a place for microservices if you need to scale out because, as we know, we have a thing called the GIL in Python which is kind of annoying. Or if you live in a heterogeneous environment like I do. I need to interface PHP and Perl, so web services it is.
Now, in that way, I need to stress that complexity is not the devil by itself. But it's a price you pay to get things done. So you should more consider it like a currency. But you have to be conscious of your budget. Because the budget depends entirely on you,
on the time you can invest in running something. And the human resources you can throw at this problem. And the price can be quite high. So if you take, for example, Kubernetes, which is an amazing product, but it's super complex to run. It's easy to set up.
I've set it up like five times by now and it's really, really easy. But do you really know how to keep it running properly? With all its HCDs, Flannels, Prometheus, Docker, each of these parts are pretty complex by themselves. So if you can afford it using your complexity budget,
great, get it, run it, put some people on it. But if you can't, if you put tech in production, you don't master because it looks cool. Dante has some opinions and predictions about your immediate future.
Now, speaking of stupidity, this is obviously an hyperbole because things only seem stupid in hindsight. But people tend to act stupid. I tend to act stupid. Just today I forgot my wallet.
But you act stupid for good reasons too. For example, if you're sleep deprived because you just came over Atlantic and are jet lagged or the baby cried all night or you're at a busy airport on your way to a conference and you need to quickly deploy a hot fix or you are hustling really hard to satisfy some well-meaning venture capitalist
that has the best intentions for you and your future. Now, I have opinions on that and my opinions are the same as the one of John Allspaw, the city of Etsy. I don't believe in human errors. I believe that if a human causes an outage,
it's a system that failed them. And if you remember the big S3 outage earlier this year, there was a great postmortem written by AWS and you should really read it. Like any postmortem is worth reading, but this one is also very good. And it turns out that a human used a tool the wrong way
and it took down half of the internet. Now, the postmortem doesn't focus on that. It focuses on the tool and how to prevent this from happening again. Because people make mistakes and it's your job to prevent them from doing that. You need to protect them. Now, when building tools and APIs,
you should always assume that the operator is drugged from the dentist, consoling a crying baby, or is just sitting in a boring conference talk. There are many reasons to be distracted from the thing you do and still need to do it. So take it in an account. It's kind of like products
for people with physical disabilities. It always turns out that those products are better for everyone. And it's true with this just as well. And if it takes just one click or one API call to lose all your data, someone will make that click and will make an API call. Maybe they were just cleaning their phone and just hit it by accident, but they will do it.
There was other data was the story of a junior developer who kind of accidentally dropped the database of some startup and he got fired and the CTO was threatening him with legal action and everything. And it turned out that setting up a development environment
consisted of running a bunch of commands. You had to copy and paste from some document and you have to replace passwords and usernames. And if you have sharp edges like this, it's your fault if people cut themselves or cut your company. So in this case, they should have fired
the CTO right away. But yeah, you know how it is. Anyhow, part of that is how you handle input. And it doesn't matter if the input is maliciously broken or by mistake, you still have to be careful
about what you let in. And I fundamentally disagree with Postel's law here. I think you should be conservative in both what you send and what you receive. And this, by the way, was neither malicious nor a mistake. It was literally the only thing I liked about San Francisco.
Now, invalid data. I like to think of it of a time bomb. And I'm talking about things like an invalid date string, for example. If you have something like that, you let it in your system, it starts wandering through your system. And the deeper it gets, the more damage will it cause.
And the sooner you catch it, the better can be a response. So just for an example, you let the user give you a date string. If you catch it ideally in the browser, you can tell the user, okay, this is wrong. You have to do it like this. If it reaches your ORM, you will probably show,
you will get an integrity error and the user gets a 500. That's not good. Other edges are, of course, you're done there in it or command line parsing. So validate your data at edges, almost. I think validation is not enough. I think you should always strive
for a simple canonical form of your data that the rest of your system can rely on because your business code should not know what even JSON is. It should just take dicts or other native Python data types or if you know the structure of the data, it should actually use a class.
And I find the decision between the classes and dictionaries actually pretty simple. If you don't ever iterate over the keys of the dictionary, you probably should use a class because a class is much better at catching typos and other little accidents while developing it, while dictionaries make sort of work
and just shadow some mistakes for you. Now, passing strings around in general is also not great because if a function takes a string it becomes a parser. It might be a simple parser, but it's still a parser
and parsers are not that simple and if they screw up, it can be really bad. So I don't like strings in APIs at all. So, and if you need a symbol or something, in Python 3.4 later we have enums and there's also a PyPI package that will give you the same classes and they are great.
Now, I could keep talking and talking. However, given that complexity leads to normal accidents and computers are kind of complex, right? And distributed systems are kind of complex to the second.
It kind of follows that failure is inevitable. Everything I told you before is about minimizing risks and you should do it, but eventually something will fail. So if you work in computers, failures are part of your life whether you like it or not. So in practice, your reliability will land on a spectrum. Somewhere between Twitter in 2007,
which had 98% uptime, which means a downtime of six days and NASA in 1969, who landed on the moon despite a human error on descent, but the software was robust enough to compensate for the stupid mistake of the astronaut.
And I think we can argue that astronauts generally are not stupid and yet they make mistakes. Now the problem with NASA's reliability is that you need an actual genius to write your software. A genius that will just invent software engineering on the site while writing the moon landing software.
No big deal. So you might have to scale down your expectations unless you have Ms. Hamilton on staff. In that case, I would like to apply for an internship. So failure is inevitable. Inevitable. All you can do is minimize risks. You can prepare for it to happen and then you have to deal with it.
Don't ask me how long it took me to get this done. So and I'm gonna give the rest of the talk away right now. A long vacation is the result of good failure containment and solid recovery. And let's shift gears here and talk about that.
So you've embraced failure. How do you expect failure? So if you need a system or a program, do anything reliably, then you need to monitor it. It's that simple. Because for all you know, it's down. If you need to have confidence on it working,
you have to monitor it. And so you need to check for outages, you need to instrument it, you need to instrument your system, and you need solid error reporting. Because a silent failure is terrible. I've talked about these topics in the past two years. My taste didn't change at all. I still love Prometheus for instrumentation and monitoring, and I still love Sentry
for error reporting. Full disclosure, David Kramer may or may not have a video of me singing Beyonce in a karaoke bar in Bilbao. But both is open source. Sentry has reasonably priced paid plans which you can scale in or scale out to, so there's no lock-in either way.
Now, your code. How do you expect failures in your code? Well, if it's local, it's simple. You just try accept and that's it. Remotely, you're lucky if it's that simple. Because instant errors are rare and fortunate,
like connection refuse or a 500 or something like that, but you know immediately, okay, something's going wrong. The first failure scenario is that nothing happens. So whatever you do remotely, you have to put a timeout on it. And one missing timeout in a database driver was enough to ground an entire American airline once.
So always put timeouts on it. Now, if you have a timeout, what do you do? You've just hit a timeout. Do you carry on like nothing happens? That gives you very slow, very useless requests.
It's a bad experience for users because they have to wait 30 seconds to get a 500. And it's giving more load to a system that might be overloaded. And for that, there's a circuit breaker pattern. How many of you have heard of it? Okay, I'm gonna make it really quick. It's very simple, very useful. So it's kind of a local proxy
between you and a remote system. And in a normal state, it's closed like a drawbridge. Okay, so the requests go to the remote API, the results go back, simple. Now, if something fails, it doesn't have to be a timeout, but timeout is the best example for that. And it fails more than once
because in distributed systems, let's face it, things fail once all the time. So if you have a certain amount of failures, it will switch from closed to open and the requests have to wait like at a drawbridge. At this point, it starts ignoring the remote system and it's just giving errors back to the client very, very fast.
So after a certain time, it will send out a probe, one lucky request from the client to check whether the remote is still broken. And if it succeeds, great. The drawbridge is closed again. Everything is like before. If it fails, well, it stays open
and we will try again later. This is very simple, very effective. Now, I've said that adding more components to a system is usually bad. However, if it's the same component and if you hide it behind another component that is way more reliable than your crappy code,
then it might become something good. So for example, if you take something like HAProxy, which I don't say lightly is very good software. It's a piece of software that never let me down. So if you put multiple of yours and protect it with HAProxy, your reliability is actually going up.
And in military, they have to saying that two is one, one is none. So meaning you should have at least two of everything. And if something is really, really important, you should have three of them. So you never even get into the situation that you have only one of something. Now, this principle made the internet almost unbreakable,
unless everyone uses the same DNS provider or unless everyone uses the same storage system in the same region, including the monitoring panel of, say, storage service. And this works at any level at the network. You should have more than one switch, more than one uplink.
You should have more than one server. You should have more than one data center that's like level up game. Not everyone needs that, but you certainly should have more than one backup because if this is all it takes for you to lose your data, you do not have backups.
And also if you do not test your backups, you also do not have backups. Ask GitLab, they have a story to tell. Now, if you want to be dispensable, you must not be a knowledge silo. So if people ask you something regularly, write it down.
If you have regular tasks and standard procedures that cannot be automated, write them down. And if you have something that needs to be done in case of emergency, write it down. Thinking that you will remember it when you need it is a very dangerous fallacy. And it has been a staple of aviation security
to have a checklist for everything. Like literally everything, ask any pilot. And having a contingency plan you can stick to if shit is hitting the fan is priceless. You can believe me. And this can be also communication. It doesn't have to be something you do.
It can be something you tell someone like your social media team. So they don't lie to your customers that everything is dandy and they should start to reboot their computers or whatever. So, finally, how to deal with a failure in your application. Well, your database is gone. A web service is returning invalid data
or the infamous timeout. So first, failure containment. Don't make it worse. Because cascading failures can happen and that means that everything just melts under you, even completely unrelated systems.
And in relation to that, let's talk about something very simple with a very big impact. And that retries. Retries are essential in distributed systems. Because as you've seen before, transient errors are completely common. So you have to take them into account. But they're also very, very dangerous. Because if you use them loosely, you can dose yourself.
Or even worse, you can dose a third party. Which can mean that you land on a blacklist or get even into legal problems. So, what do you do? You back off. But how long do you back off? If you just run into a deployment
or a packet got lost or whatever, one second is fine. If a system is overloaded and you're just waiting for your scheduler to provision more servers, one minute may be adequate. And if there's like a hardware failure, maybe a switch and some poor schmuck has to go into the DC and replace hardware, well, every five minutes just check
if he or she had done it and try again. The trick is to do all of it. You try with a very short period which you then race exponentially. And this is pretty good. But there's still a problem. Because if all your services do it at once, you still have a lot of load in the same moment.
And that's where a jitter comes in. A jitter is just a random number that you add to the computed exponential back off. And it will space out the retries between your systems so you don't get this thundering heard. Okay, so this is intuitive, I think. But let's dig deeper.
Let's say you have a policy of three retries per client click, whatever. And let's assume that the backend is your responsibility. And because you read the orange news page, you want microservices. So let's do it. So at this point, one request in the front end,
like a one click means nine hits to backend A in the first case. Because three times three. But we are not even close to microservices. Let's add one more layer and you are dozing yourself again. This is called the combinatorial retry explosion.
And one user click means that a C gets hit with three times three times three 27 requests. Now imagine if C was slightly flaky because it was overloaded before, now it's toast. And the simplest solution, the one I use, is that you retry only at the top, like in the client.
But this only works if you know where your top is, which may or may not be true. And then you have to do something more complicated like per request retry budgets. But then you have a state that has to wander around with your requests. That's not that simple anymore. And this is a good example for complexity
through microservices because such a problem just doesn't exist in a monolith. And there's a lot more to be said. Back pressure, have it, have a way to signal to the app that you are overloaded. Unbounded queues do not have them. It's mathematically proven that an unbounded queue is worse than no queue at all. Do not use unbounded queues.
And there's a bunch of links on my page through that. Now, next one, insight into failures. If something fails, you want to know what happened and why did it happen? Because nothing is worse than a silently failing backup script. Again, ask GitLab. In the best case, you put something like Sentry into it, but at least send an email if something doesn't work.
And this is true in code too. And if you do something like this, especially in libraries, I have a lot of opinions about you. Because how am I supposed to debug this? I have no chance. The same is true if you do something like this, what I like to call a vanity exception, like an exception that is special
to your very special library, very special application. The error detail is still lost. Don't do that. Instead, do exception chaining. So this is the Python 3 syntax. This will attach the original exception in the under cause and you can introspect what actually happened.
If you're using Python 2, you have my sympathy and there's a function called raise underscore from in the six library, so you can use it too. But really, if you do not know how to, what to do about an exception, just let it be. Chances are that the user knows how to cope with it better
than you do deep in the guts of an application. So the next one is a bit counterintuitive. If your application is unfit for work, like the classic thing is that your connection pool turns invalid because this database server has to be restarted, you could start adding complexity to reconnect the pool but only in one thread
and serve some intermediate error to your user, or you just add one line. And more often than not, this is just fine. And it also runs assist dot, so assist dot exit also runs your at exit handlers, which means that you get to clean up properly.
And this is called crash only software. And it's not something I just made up. There's solid science behind that. There's a lot of papers which you can, which you should read. And there's a bit more nuance to it, like micro reboots but I found them a bit hard in Python. So to sum it up, fail fast, fail loudly. A crash is always better than a hang.
As a user, I prefer a 500 over waiting for 30 seconds and get a timeout or then an error. And depending on your audience, you can just present a stack trace too. For example, Redis takes it to the next level. They have, if they have a crash, they show you a stack trace. But because most of the crashes
are caused by faulty memory, they will also run a fast memory test to tell you that your memory is flawed before you open a bug on there, I think, GitHub. So once you have that, you need to focus on recovery. And this is where the MTTR reigns supreme. The MTTR is the mean time to recovery.
And if you've accepted that failures happen and maybe started writing crash only software, it becomes much less important that something goes down and it becomes much more important that it comes up fast. And that's why humans cannot be allowed for the restoring of service. Because if I pull the plug in your DC
and your interest starts getting up and a lot of systems need a hand holding to come up, how do you choose? No, this has to happen on its own. And the prerequisite for that is that your app does not expect what's there, like the database. You try to connect, if it doesn't work,
you try again, again, again. Maybe you just crash if your process manager allows you to add back-offs. But I just use loops and back-offs. Now, what is the secret to a long undisturbed vacation? Apparently, $9 Mai Tais and plastic cups.
But also, more practically, what you can do about it, build fault tolerant systems that recover autonomously and throw your phone into the sea. This is the promised link. The QR code will bring you there too. Follow me on Twitter, Biodomains from VarioMedia. Thank you very much.