We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Designing for Failure

00:00

Formal Metadata

Title
Designing for Failure
Subtitle
Fault Injection, Circuit Breakers and Fast Recovery
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
While we all work very hard to build high-available, fault-tolerant and resillient applications and infrastructures the end-goal is currently often something along the lines of loosly-coupled/microservices with zero-downtime in mind. Upgrades are tied to CI/CD pipelines and we should be sipping pina coladas on the beach. Time to unleash the Chaos Monkey, because that is what Netflix does, so we should try it as well.
InjektivitätData recovery2 (number)FamilyProduct (business)Computer animation
Product (business)Computer animation
SurgeryView (database)InformationComputer animationLecture/ConferenceProgram flowchart
Multiplication signBackupSoftwarePoint (geometry)Computer animation
Computer configurationMultiplication signProcess (computing)InfinitySoftwareComputer animation
Real numberAngleLatent heatMultiplication signBridging (networking)Video gameGoodness of fitServer (computing)Product (business)AreaComputer animation
SoftwareRAIDSpecial unitary groupHeat transferComputer fileType theoryRAIDQuality of serviceSoftwareMiniDiscDifferent (Kate Ryan album)Stability theoryBand matrixComputer animation
Data recoveryUncertainty principleException handlingSteady state (chemistry)Scaling (geometry)Database normalizationExecution unitResonanceMultiplicationPoint (geometry)Single-precision floating-point formatDatabase normalizationPlanningMultiplication signPlastikkartePhysical systemProjective planeWebsitePoint cloudFault-tolerant systemException handlingProgramming languageNeuroinformatikFormal languageWordComputer animation
ResonanceScaling (geometry)Exception handlingDatabase normalizationLink (knot theory)Execution unitSteady state (chemistry)Real numberArithmetic meanStrategy gameGoogolMultiplication signObject (grammar)Information technology consultingComputer animation
Time domainPoint cloudMultiplicationLastteilungService (economics)Constraint (mathematics)Server (computing)Front and back endsSoftwareBitPoint cloudDifferent (Kate Ryan album)Computer animation
Point cloudHill differential equationTime domainSalem, IllinoisMultiplicationLink (knot theory)LaptopMultiplication signServer (computing)System administratorMiniDiscMultiplicationCartesian coordinate systemData loggerComputer fileComputer animation
Point cloudTime domainMultiplicationSoftwareContinuous integrationContinuous functionError messageWechselseitige InformationConvex hullExecution unitWordProduct (business)Moment (mathematics)Level (video gaming)Constraint (mathematics)Error messageSoftwareIntegrated development environmentDirect numerical simulationStructural loadNormal (geometry)Chaos (cosmogony)Multiplication signWebsitePhysical systemDigital electronicsCartesian coordinate systemSuite (music)MereologyControl flowBitDatabasePower (physics)Programming paradigmTelecommunicationLogicFront and back endsWeb pageContext awarenessRight angleService (economics)Goodness of fitConnected spaceSearch engine (computing)Computer animation
Execution unitRankingCartesian coordinate systemLevel (video gaming)IterationComputer animationXML
Point cloudFacebookOpen sourceComputer animation
Transcript: English(auto-generated)
Alright, originally there was supposed to be a talk here about Metal Cubed, the project, but unfortunately the speaker couldn't make it. Then, as a secondary option, just we had Ganesh and Abhijit speaking,
and the second half hour of that talk was going to be replaced by someone from my company. However, that person had family issues, so couldn't make it. So, henceforth, I would like to introduce our next speaker, Walter Heck. He will now present this session.
Thank you, Walter, for the introduction. Before we get started, I would like you all to get up, please. Everybody stand up. Now, if you have never brought down a production infrastructure,
a production piece of software, you can stay standing. Everybody else, get back on your seats.
So, my talk is about designing for failure.
My session is also an example of failure. You can never know when failure is happening, and as I said, I am the secondary backup for this session.
So, forgive me if it's a little bit rough. I was notified yesterday by myself that I was going to do something. Designing for failure. It seems like something very simple and something that you do by default,
but many times when you try to explain this, the answer is, we're not designing for failure, we're designing for working software, so just make sure that works. However, we all know that the world is unfortunately not that simple, and the more complex your piece of software is,
the more likely it is to fail at some point. I learned this lesson a very long time ago by someone I was doing freelance work for, Arian Lents from OpenQuery. If you're watching, Arian, thank you for teaching me this lesson.
Given infinite time, failure is definitely going to happen. It's not an if, it's a when question. Therefore, if you're not designing for this failure to happen, you are setting yourself up for failure. If you're lucky, however, you have a couple of options to not encounter failure.
A, you move to a different job or a different company before that failure happens, or B, the lifetime of your software or infrastructure or whatever the piece of engineering is that you have designed or created lives less long than the time for it to fail.
However, as we've just seen, I think there were maybe 10 people standing after I asked the question, so most of us have seen failure more times than we care to remember. I actually did RMDec-RF slash on a production server
somewhere in the beginning of my career. Not a proud moment, but it's a good learning experience. Failure happens in real life lots and lots of times. Since I heard yesterday that I was going to give this talk,
I was looking at the real world around us and how many things are actually designed to fail. A very famous example is the Tacoma Narrowness Bridge, which was a bridge built in, I guess, the 40s, and they forgot to account for a specific wind angle and speed,
and so right after the bridge was finished, it started wobbling and eventually broke down quite badly. On the right, we see a bridge, the Millau Viaduct in the south of France, a giant spanning bridge, and as you can see,
it's in a canyon and it's designed for hurricane wind speeds that they've never seen in that area. So it's much more resilient and designed for failure.
But there are many examples of where we design for failure. In software and IT in general, we also have lots and lots of things that are designed to at least partially fail without immediately giving problems.
Two examples, RAID technology designed for one or more disks to fail without actually giving you any data loss, and on the right-hand side, we see QoS. These days, with the growth of bandwidth and stability of networking,
it's not something that many of us encounter very often anymore, but it used to be much more important. Basically, it says we have different types of traffic and there's different priorities for this traffic. So at the bottom, we see web, email, and file transfer traffic,
and as important as that might seem that you can load your Reddit front page, it's actually much less important than the traffic that's at the top. So there, within the network world, QoS tries to make sure
that at least audio can continue even when other types of traffic are no longer available. Recovery-Oriented Computing is quite an interesting project
started by some very smart gentlemen over at UC Berkeley. I didn't have enough time to fully dive into it, but the website is full of interesting papers that talk about recovery-oriented computing, which means that instead of assuming that everything will always work,
let's talk about how do we recover from failure and how do we make sure that a failure is not immediately a disaster. So, for instance, by having not one but two backup plans for this room for this half an hour,
you have a degraded user experience, which is me, but that's still better than looking at an empty room with no speaker. And recovery-oriented computing dives into the concepts that are related to this subject.
Designing for failure can be done in many different ways. Unfortunately, failure is everywhere and ubiquitous and we have to deal with it. In code, we look at things like exception handling. There are some programming languages that don't do exception handling.
However, most languages allow for you to handle exceptional situations and very often this is not done correctly, but it's something that you should definitely think about. Fault tolerance and isolation. How do we make sure that when something is not running the way we expect it,
the rest of our system is still functioning. Fallbacks and degraded experiences. I just discussed degraded experiences. Autoscaling. What if we have more traffic than we originally assumed? How do we deal with that? In our cloud computing world, it's a lot easier subject than in many other environments,
but autoscaling is something that can be very useful to deal with these things. Lastly, redundancy. We've all hopefully heard of the word single point of failure.
Try to reduce the number of single points of failure that you have by introducing redundancy. The fun exercise that I always like to do is try to look at an architecture and pinpoint the single points of failure.
There will always be single points of failure, even if the entire system is a single point of failure. So the question there is how much is it worth to you to make sure that that single point of failure does not fail?
It is fine. As a real world example, S3 has 11 nines durability, which means that if you have 10,000 objects, a single object can be lost every 10 billion years, which is great, except that it's only four nines availability.
So that doesn't mean that your objects will always be available. It just means that they will be persistent and eventually will be available. Out of the SRE theory, hope is not a strategy.
I think that we've all seen one or more situations. I'm in consulting, I work in consulting, infrastructure consulting, so I see a lot of places where people say, oh, we just hope that this never happens. Unfortunately, hope is not a strategy. I quite like that motto.
We cannot just hope that that single thing will never happen. It will definitely happen, given infinite time. You just might get lucky. And to show that designing for failure is not something easy,
a company like Google has 100,000 employees, and still they're suffering failures. So if you're a small company, don't feel too bad. Failure happens, and it's good to be able to deal with that. When I talked about the person who taught me this designing for failure,
one of the things that he said was, in that consulting company, we didn't do any emergency support. The idea was that we told clients, okay, we don't do any emergency support because we build everything in a high available fashion so that even if a failure happens,
we don't have to immediately wake up and get to action. We can deal with it the next day. That was specifically MySQL consulting. So in the MySQL world, it's relatively easy to make sure that when a failure happens, it's not immediately a disaster.
One of the things that I never liked is people asking me, what kind of SLA do you guarantee me? I don't know. Something less than zero downtime. It is almost impossible to reach zero downtime, and if you do reach it, you are incredibly lucky.
SLAs are important in the traditional world, but they mean very little to engineers. It just simply means that we know how badly we're going to get scolded at
when we break an SLA. Simply put, zero downtime is not reasonable, and therefore it's something that you can strive for. If you're lucky, you can maybe reach it over a short period of time, but in the longer run, it's nearly impossible to reach zero downtime.
In the infrastructure domain, to put it a little bit closer to where we are, to the topics that we're talking about today, we have lots of different examples of failure and how to deal with that, and make sure that we can actually continue and not have a disaster when a failure happens.
Fault tolerance, you can easily deploy a load balancer in front of your servers, so it should help with improving, forgetting words, it should help with improving uptime and making sure that the availability is good. High availability, a load balancer in front of a single server
can or cannot do very good things. It still doesn't allow for that single server to fail. However, it does allow for a degraded experience. The load balancer will simply return a 503 or something similar to indicate that there are no backends that are available to server traffic,
but that's still better than a service that's trying to ping another service and gets zero answers. Resilience, how can we make sure that we adapt to a situation based on load, as I said before?
This is, in the cloud world, very easy with autoscaling policies, but even when I say very easy, very often it's not actually that easy, because it implies a whole bunch of design constraints on the software that you're trying to autoscale.
Not the least of which are readiness and liveness probes, sometimes referred to as health checks. When you have a server that is up, it does not necessarily mean that it is working. Those are two entirely different things.
My laptop is popped, it's just not working. For instance, I guess we've all... I hope you've never seen it, but I've seen it more times than you care to remember. A server that is technically still running, except some stupid log file decided to run away. Actually, some stupid administrator decided to not configure the thing properly,
which made a log file fill up the disk. The server is still running, just not serving any traffic. A simple health check within an application that's running on that server can very easily determine, hey, is this server ready to serve traffic?
Is it still able to handle incoming requests? Those are actually two different things, especially when you're talking about autoscaling. When a server comes up initially, the fact that the OS is up and network traffic is there does not necessarily mean that you have a working server
and it's able to serve traffic. So you want to make sure you have a health check that checks if the server is ready to serve traffic. The liveness probe, on the other hand, is more important over time, where you're wondering, is this server still able to handle incoming requests? Maybe, indeed, the disk ran full,
and it's no longer able to handle incoming requests, and the liveness probe can tell you, hey, this server is not healthy anymore, let's replace it with another one. Some example technologies, CoreSync, Pacemaker are all Linux tools that I am quite familiar with.
MySQL Galera is quite an interesting tool to make sure that we have multiple MySQL servers and one can easily die without having the others affected.
Cloud-based infrastructure. So, for instance, in... I'm not allowed to take a word. We can see that the DNS can actually fail over our traffic to the static side
if the original backend is not working anymore, which means that we have a degraded experience, but at least we can show some kind of context to the end user. Multi-AZ region in Cloud, so you can design for failure.
I'll talk a little bit more about it in a minute, but basically, it depends on how much money you want to spend for how much failure you can account. So, I always tell people to not overdo this. If you're Netflix, it's really cool that you can have a whole region go out and nobody has to wake up. If you're not Netflix, then you're probably okay
with having some kind of a simpler setup because it will cost you a hell of a lot less money. I should be preaching to the choir about these things, but as I said, I work at consulting, so I've seen more than one environment
that makes me not super excited. To give an example of a failure in a software domain, at the top we see a snippet that connects to MySQL, and if the MySQL connection fails,
then we store the fact that the database is not working at the moment, and if it's up, then we make sure that we store that so we can, in the snippet at the bottom, check whether or not the database is up or down. The reason that you want to store that is that if you just try this, this might actually fail slowly
because the connection hangs, the attempt to have the connection hangs. Asynchronous need, check if it's possible to connect to the database. If it's not possible to connect to the database,
very simply, designing for failure is sometimes not more than having an if statement. Say, hey, I still give a response, but I don't actually try to connect to the database because if you didn't have that, this would probably hang, and cause a much worse experience than being able to immediately respond saying,
hey, the database is currently not working. Henceforth, you're getting an error message. In software design, there's also the circuit breaker paradigm, and the circuit breaker paradigm is relatively easy
for those of you who know something about electronics. The circuit breaker is normally always in a closed state, which means that electricity can flow freely, and in this case, traffic and logic can flow freely. If something happens in the electrical engineering world,
we open the circuit breaker so that the power can no longer flow. In the software world, you open the circuit breaker to make sure that nothing can continue anymore, and then you can open it with, for instance, one request or one query,
depending on the system that you're actually doing a circuit breaker design for. If that works, then you close the circuit breaker, and the application can continue to work again. CI-CD, quite an interesting one.
I've seen too many CI-CD pipelines where CI is relatively easy. We do a bunch of tests, and everything's fine. The CD part of CI-CD can fail in 723 spectacular ways, and that's rounded down by a lot.
I've seen more pipelines than I care to remember that break in spectacular ways because of whatever reason the situation that you think you're going to deploy to is not actually that way at all.
It's good to design for that also in your CI-CD pipelines. Specifically, if you want to be able to recover from failure, from one pipeline failing, it has a tendency to leave things hanging in the middle, which, if you don't have an automated way to recover from such a phase,
you're still looking at manual work, which is not necessarily something you want to do. On the right side, we're talking a little bit about chaos engineering. If you've never done this before, try to break not your production environment. That's going to get you a lot less angry faces
than trying to break your production environment. Try to see how your environment deals with failure. Turn something off and see what happens. Once you're more confident in breaking your environment, you can try to do this on a production environment
and see if that survives. There is a well-known suite of chaos engineering tools by boys from Netflix also that deals with trying to break environments in production.
Make sure you do this during work hours, because during work hours on a Tuesday at 11 o'clock in the morning, you're around. You're ready, you're prepared for the failure, you are able to deal with it, which is a lot better than on Christmas Eve at 3am
when you are hopefully wanting to sleep. At the bottom, an error budget is something that can be quite useful. Instead of always implying very strict requirements on when you can deploy and how you can deploy
and things you can change in your environment, sometimes it can be good to instead have an error budget, which basically says as long as the SLA stays above 99%, we're free to do whatever we want to the production environment, because it means that clearly we have a handle on properly dealing with the production infrastructure.
The moment we fall below 99%, a bunch of additional checks get put in place. So all of a sudden, when you get below 99% and these levels are obviously up to you and the constraints are also up to you, you can say, okay, when we get below 99% uptime,
we will no longer automatically deploy and all deploys need to be approved by person X. In that way, during normal operation, you won't put too many restrictions on your team.
However, if things don't go well for a while, then you fall back to a more cautious way of working. If you're looking to start a little bit with designing for failures, some of the things that you can do is
look back at the last X times your environment failed, let's say 10, and ask yourself, could this failure have been better experienced by your end user? So instead of the user getting a completely not working system,
the fact that because a search engine failed, does that mean that we could have just disabled search on the website? If the caching layer fails, does it mean that the user is okay with a website that loads a bunch slower
instead of getting an error message because the caching service is not available? That can give you a good idea of, okay, where do we need to look at starting to design for failure? Another thing you can do is look at your biggest risks.
What are the things that will give you the biggest disruption, and how can you design the end user experience because that's what you really do when you're designing for failure. How can you make sure that the end user experience is as good as possible? Let's say a database is not working.
That would be a problem depending on whatever your application is doing, but maybe you can still serve some kind of page that doesn't require a database. Maybe only rights to your database are failing,
so maybe you can still have your application do read-only. If you want to do this, you can deal with increasing levels of detail. Start with small things and iterate from there.
It's basically a never-ending exercise, so don't expect that you can do a project and make sure that your failures are better dealt with. Designing for failure should be in your workflow in every thought that you put into an architecture, and from there you should iterate over time.
Walter, you have minus three minutes. Thank you for the notification. I'll skip this slide, and I'll sell you. Thank you very much for your attention.