We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Resilient by Design

00:00

Formal Metadata

Title
Resilient by Design
Title of Series
Part Number
35
Number of Parts
94
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Modern distributed systems have aggressive requirements around uptime and performance, they need to face harsh realities such as sudden rush of visitors, network issues, tangled databases and other unforeseen bugs. With so many moving parts involved even in the simplest of services, it becomes mandatory to adopt defensive patterns which would guard against some of these problems and identify anti-patterns before they trigger cascading failures across systems. This talk is for all those developers who hate getting a oncall at 4 AM in the morning.
Physical systemSequelMereologyService (economics)Ordinary differential equationInternetworkingSpherical capResolvent formalismTheoremCore dumpComputer animation
WebsiteDatabaseService (economics)Physical systemDefault (computer science)Product (business)Multiplication signUtility softwareServer (computing)Customer relationship managementQueue (abstract data type)Message passingSoftware design patternPattern languageCore dumpWeb pageDependent and independent variablesChannel capacityCache (computing)2 (number)Response time (technology)WeightInsertion lossNumberCASE <Informatik>SoftwareSingle-precision floating-point formatResultantCycle (graph theory)VideoconferencingPublic domainGoodness of fitQuicksortComputer configurationSoftware maintenanceSoftware testingFocus (optics)LengthSoftware bugSoftware developerData storage deviceUnit testingSystem callPhysical lawMereologyResonatorIntegrated development environmentTotal S.A.Different (Kate Ryan album)Negative numberMedical imagingDivisorView (database)VotingMoment <Mathematik>Bit rateLevel (video gaming)TheoryBlock (periodic table)FluidCodeMedianWordComputer animation
Service (economics)Multiplication signSystem callPhysical systemMessage passingQueue (abstract data type)Discounts and allowancesGame controller1 (number)EmailUtility softwareBit rateCASE <Informatik>Response time (technology)Digital electronicsExpected valueOrder (biology)Point (geometry)Channel capacityWebsiteCustomer relationship managementServer (computing)Extension (kinesiology)Link (knot theory)MathematicsE-bookTask (computing)2 (number)Arithmetic meanLevel (video gaming)MassOscillationEstimatorWater vaporHypermediaContext awarenessGodProcess (computing)AreaComputer animation
Digital electronicsLogicSemiconductor memoryWeb 2.0Cartesian coordinate systemInternetworkingService (economics)Physical systemGroup actionRhombusWebsiteMereologyProcess (computing)GodLibrary (computing)BefehlsprozessorSoftware developerMultiplication signAreaCASE <Informatik>Uniqueness quantificationPressureFiberPhysical lawServer (computing)Game controller2 (number)Message passingBuffer solutionAuthorizationDefault (computer science)Model theorySimilarity (geometry)Form (programming)Symbol tableKey (cryptography)WordExistenceRight anglePoint (geometry)Limit (category theory)State of matterNetwork socketFlagFigurate numberLevel (video gaming)Matrix (mathematics)Disk read-and-write headWater vaporCategory of beingBitBound stateQueue (abstract data type)Loop (music)DemonCodeReading (process)Internet service providerSoftware design patternWechselseitiger AusschlussCore dumpParsingPattern languageComputer animation
Client (computing)Server (computing)Service (economics)Multiplication signDigital electronicsSystem callLibrary (computing)ImplementationPoint (geometry)Goodness of fitConnected spaceState of matterExecution unitCASE <Informatik>Customer relationship managementThresholding (image processing)AreaProgram flowchart
Computer fileService (economics)Convex hullDirected graphSingle-precision floating-point formatComputer animation
Service (economics)Thread (computing)Structural loadWebsiteProduct (business)System callDifferent (Kate Ryan album)Server (computing)Logistic distributionInformationPhysical systemCategory of beingDigital electronicsChannel capacitySeries (mathematics)Sound effectPoint (geometry)CASE <Informatik>Content (media)Table (information)Archaeological field surveyModel theoryComputer animation
Rule of inferencePoint (geometry)Forcing (mathematics)Steady state (chemistry)Interactive televisionPhysical systemDifferent (Kate Ryan album)AreaService (economics)Water vaporLibrary (computing)ResonatorMedical imagingGraph (mathematics)HypermediaRotationHoaxSpacetimeDigital electronicsOrder (biology)Scripting languageCASE <Informatik>Strategy gameAddress spacePublic domainNumbering schemeSystem callNumerical integrationFile archiverError messagePattern languageMiniDiscLoginSoftwareComputer animation
Transcript: English(auto-generated)
So today, I will be talking about Resilient by Design. I think the previous talk about microservices
with mentioned cap theorem and everything, I think this is like the sequel to that talk now that you want to build systems that don't go down. My name is Smith. I'm part of Bandar Core team member. Until very recently, I used to maintain the dependency
resolver. I also occasionally contribute to JRuby. And this is why I am on internet. I work at Flipkart. Many of you would not know that. It's an e-commerce website in India. We have a lot of scaling problems.
And I'm very thankful for them to sponsor my trip here. So let's just start. So why do we actually care about resilience? Companies have increasingly, over the years,
started depending on software. And for them, at this stage, any sort of downtime would actually result in loss of business. For customers, it's also bad because the customers are also relying on the software to be up.
To give an example, Flipkart actually makes more than $1 billion in sales. So even a single minute of downtime results in loss of $2,000. And the interesting fact is that it's never evenly distributed like that.
There's no numbers that are that evenly distributed. So what happens is that there are 20% of times which actually amount to 80% of the total revenue. And it's during those peak times your systems are most vulnerable. And those times, even going down for a single minute
could mean that you might lose close to $8,000 just at that minute. So the companies cannot afford any downtime on their systems. So what will they do? What they are going to do is they are going to rely on developers, support
engineers, and all of them. So the famous on-call is there just because of that reason. So it's up to the developer to make sure that he responds to on-call whenever it is, like if it's late at night or anything. And it's up to him to make sure
that systems are running up. The second reason is that even the simplest system today is dependent on other services. At the very least, it will be dependent on a database which is on another server. And as the previous talk said, the network
is not really reliable. So in those kind of things, it's very important that the thought about resilience should be put at forefront. Otherwise, I don't think any of us here would like to handle on-call over the weekend.
That's pretty irritating, to be honest. So yes, so then the question becomes, how do we actually build a resilient system? So I think in the 90s, like resilience testing used to be an implicit requirement. The requirement was that your core should run,
and it should work, and tests should be there, not be there. But it's like implicit requirement. Maintainable code, even those things were very implicit. But you see, those things are not the same. In the Ruby community itself, testing code, maintainability,
all those things are something that a lot of focus is put into it. It's not something like an afterthought. But the problem with resilience is that today, it's more of an implicit requirement. The management expects that the system should be up all the time.
And the developers also think that, OK, I wrote this system. I used this data store. I have that. It's going to be up. And I think there is no thought put into resilience like when designing the system.
You'll be very lucky to find any bugs before production, because most of the bugs that you see that deal with the resilience of the system happen in the production environment, happen when the systems are at peak load, the utilization is up.
That's when you see those bugs. And because you haven't thought about it, they are going to come and bite you. There's just no other way around. The second thing is that human bias. So humans have inherently a bias that they only think about the happy parts, where everything is working,
like your caching servers are up, your database is there, the services that you're talking to are responding every time you make a request. And that's why we fail to see the path where those things are not actually up, those things when they're not actually working. And so the only way to actually think in a different way
is that we need to think about resilience from the start. Whenever you think about your system, like whenever you are designing your system, you need to think about, OK, if my system goes down,
like, OK, have I planned for capacity if my caching servers go down, or are they highly available? All those thoughts have to be put in from the start. There are things that actually can help you. And that's what my talk is about, resilient design patterns. However, I would like to put up a disclaimer for this talk
that these are not silver bullet. It's not that if you use all of these patterns in your system, and they are guaranteed to never go down. Things are never simple as that. A lot of the thing that it depends, it depends on the domain as well,
like the system that you're designing. For example, on Slipkart website, our core thing is to be able to serve the product page for the customer to see if it's available. And once he clicks on Buy, he should get whatever he has ordered.
That is our main thing. So if recommendation system is facing an issue, we could load the page without it. If the comments or reviews are not showing up, we can decide to not show them if those systems are down. Obviously, for each service, those kind of trade-offs are very dependent on it.
Like, for example, Netflix, if their bookmarking service is down, what they do is they will not give you an option of resuming the playback. They'll just start from there. But the reason why they do that is that they know the main thing is to be able
to watch the videos. And hence, the only thing that I want to say, that it depends on your domain. It depends on how you have designed your systems. And I think that's a really good thing because there's really no free lunch. If you're designing a system like this,
you need to put thought into it. You need to think about the cases. So yeah, with that in mind, let's just start with the patterns. So I think this is the most important pattern in this talk. Like, this is why I'm putting it first. Like, if you don't take anything out of this talk,
like anything else, just take this pattern out of it. The thing is that the most wastage of resources is like burning cycles and clock times only to get results that you have to throw away. Failing fast is the best thing you can do
if the system or the other services that you're talking to are not responding or you know are going to fail. In fact, the reason behind failing fast actually comes from a mathematical idea called queuing theory. So this is John Little's law, for those who know.
So the length of a queue. So say your system actually handles incoming messages and that's it. So your length of your queue is going to be dependent upon the arrival of your messages and the amount of time it takes for them in the system.
Like the amount of time it takes to process them. If your response times go up, the amount of time in the system goes up, the size of the queue will increase. So now let's say if you're talking to a service which is not responding and you didn't even bother
to change the default time out of net HTTP which is 60 seconds. It's going to take 60 seconds for your system, till it's going to take 60 seconds for it to fail. So your response times will be very, very high and that will indirectly increase the queue size.
The other thing that is highly dependent on your responses is your utilization of your system. So the utilization goes, if you see this graph, the utilization goes up if the response time goes up.
So for each request, if it's taking 60 seconds, your utilization of your entire service will be very, very high. The only way you can do anything about this is to, you can only add more servers and hope for the best in this kind of scenario.
The cool thing about this is you can also look at the other way. So say you optimize your code, like you did your best and you got the response times to a certain extent. After that, if your utilization is going above 80%, still, like if you're going above 80%, you can easily see that it's going to have
a very negative impact on your performance of your system and that point you can do capacity planning on based on that. And I think the other thing that's very cool about this is that say you are an agile team and the utilization of your team is around 90%
and now your manager comes with some adopt task. Just using queuing theory, you can figure out that the turnaround time for that particular task is going to be very, very high. So I think math is pretty cool.
You cannot run away from the math. So the only thing you can do is, like in this case, is that you keep your response times as low as possible. So this is one example system that I have created just to illustrate. So say you have an e-book download service,
like if you buy the e-book, they guarantee that they give you an SLA of five minutes. Like if you buy the e-book in five minutes, you'll get an email with the download link and you can download it. And this is pretty basic stuff. You have a checkout service which sends out messages to the payment service
through a message queue because we don't want to lose those messages. And the payment talks to an external service to verify that this payment is authentic and then it processes further. Now let's assume that the external service that we are talking to starts failing,
that it starts timing out. So intermediate calls are failing when talking to the external service. Now what's happening there is that each payment call to external service is going to fail. It's going to take 60 seconds. And because of that,
the incoming message is at that rate you can't really control in this case. So what's going to happen is that messages are going to start piling up in that queue, that message queue. At this stage, what happens is even if the system comes up, even if the system comes live, external service, what you would have is
you would have a pile of messages and now you would also have incoming messages from the website. Like people are still placing orders. So you would fail to meet the SLA for orders which were placed when the external service was down. And you will also, because of that, you'll also fail to meet the expectation
for newly placed orders. In case of that, like we embrace that things are going to be bad. We use a circuit breaker in between. We realize that the call to external services are going to fail. And what we do is in fallback, what we do is we store those messages,
retry them later, and because of that our response times are still the same. So what it's going to do is your messaging queue will still be empty because your response times are actually much better because it's not even bothering to call the external service.
In this case, when the system actually comes up, what you can do is newly placed orders can meet the SLA. Like they still get the download links and the messages which are stored in a different system or a different queue and to retry later. You know those message,
I mean those customers will not get their download link on time. You can send out a special mail or you can give them some discount. But the main thing is in this scenario, you are in control. Like you know that this message is the ones that have failed. And you can design your system.
You're not dependent on it. So yes, so this is the most important thing in this talk. But now let's say now how do we actually make use of it? So the first thing to achieve that is through bounding. So if any place in your system,
if you have unbounded access to resources, that is something really, really terrible. Like you don't want anything like that. So in bounding, I want to cover three things. Like bounding is a huge topic of its own, but I specifically want to cover three things. The first is timeouts.
So the default timeouts in any of your library are horrible. And like net HTTP, like I mentioned earlier, has a timeout of 60 seconds. So it takes 60 seconds for the read timeout to kick in and tell you that you can't access that server you're trying to go.
And I think the scary part is that some of the system, things don't have a timeout at all. Like they never timeout. We had a system at Flipkart. What we used it for was to, so it would collect messages from the local service
and it would send it to the main messaging queue. Its job was to relay those messages. And only through this service, only through this infra piece, any service would be able to talk with the outside world. Now, this infra piece would get hung
every two weeks or three weeks or so. It's written in Ruby. And we couldn't figure out what was wrong. And when we went down into it, when we looked there, what we found out that it sent matrices to StatsD on your UDP port.
And there was one matrices that we were not reading at all. It was kind of like nobody was making use of it. And what that was causing is the buffer size of the UDP is 128 KB in Linux. And that was getting full. So if that is full at that point,
it would just get stuck in that state. And the only way to, like the way we solved it was that we use a socket non-block flag, which is you can do it using write non-block in Ruby.
But so yeah, so some systems don't even have a timeout. Like, and those kinds of things, you need to look down in your application and see like, does my application have a proper timeout? And the greatest thing the timeout provides is a fault isolation. So if it's another service or another thing
that is not responding, it shields your system. Like you can have a timeout. And you can use timeout in conjunction with the circuit breaker, which is the next part and I'll talk about. Or you can, if nothing else, you can use it with a retry logic.
The second thing is limit memory use. So again, whenever people use caching or something like Redis, this is something they completely forget about like limiting their memory use.
Or say you're given their web servers, like application servers. In those cases, like in case of Unicon, you can have a watch on each of your workers and you can say that the 85% it's okay. As soon as it crosses over 85%,
you can have, you can notify the developers or something like that. The thing that happens is that when you don't have any of it and you let, so there was another case at Flipkart itself. What we had was that we had a system and every, again, every two, three weeks,
the memory usage would increase so much that it would start to use the swap and the performance of that particular host would be really, really terrible. When we actually looked into that, we found out that at one place it was doing a JSON parse and it was using symbols
and unfortunately one of the keys was unique every single time. And that was actually, and those who are new to Ruby, symbols are not garbage collected until very recently in Ruby 2.2.
So any symbol that you created in your system, they'll stay till you restart your process. I mean, kill and restart your process. So in that case, however, none of us had to get up late at night or early in the morning to fix any of the system. What we had was, we had a worker monitoring system
and what it would do is if the worker would go above 90%, it would actually restart the worker. And things would still go and work well. Or if it crosses, yeah. So that helped us out a lot.
I think that's not actually an ideal solution but what it gives you is time to actually debug the issue. Otherwise, anytime if it starts hitting the swap, it's going to impact the business and that is something that you cannot take. The other point is to limit CPU.
So a lot of times what happens is that on your host, there are daemon processes running that do certain things, maybe provide health checks or things like that. And those processes are not the primary thing that's running, it's your service that's running on that host which is the most important.
But sometimes what happens is that the code in that daemon or something like that, I mean, the library you're using goes into some kind of an infinite loop or it starts using more and more of the resources of the system. However, you can easily limit that daemon using cgroups
and what that provides you is like an isolation. So even if that daemon starts to use all your resources, it's only using one core of your system and because of that, it will not go down. And finally, every time you use a mutex.lock
or like a buffer in your system, all of those are implicit queues in your system which you have no control over. Like there's no control over those things and it is much better to have an explicit bounded queue, like a messaging queue which sends messages to your service
and that could be bounded, like it could apply back pressure in the case of it's full. What this gives you is much more control and over just using an implicit queue. So the next pattern,
I think it's one of the coolest patterns in existence. It's called the circuit breaker pattern. So circuit breakers, the way they work is they are in between the client and the server or the supplier and what they do is like if everything is fine,
then yeah, it doesn't actually even come into play. But it's when you make a request and it starts timing out. There is some connection problem between the client and the server. And in this case, what it does is that after a certain threshold of errors,
it realizes that the other service is facing some difficulty, like it's not able to do it. So it actually trips the circuit. At that point onwards, any future calls are not even made to the server. What it does is it fails right then and then.
Later on what happens is that after a certain point of time, what it will do is it will actually make a call to that other service and it will see if it's up or not. If it's up, it closes the circuit and everything goes back to normal.
But if it's still timing out, the circuit will still be in open state and you wouldn't even need to make the call to it. Like there are really good examples of circuit breakers. I think Semion by Shopify is a pretty good implementation in Ruby for circuit breakers.
And if you use JRuby, you can just make use of Histrix, which is written by Netflix. It's a very well written and battle tested library. So that is something you can do. Going forward, I think,
going forward, bulkheads are actually a concept that comes from ship. Bulkheads are actually watertight compartments in your ship. So even if a hull is damaged by a certain, partially damaged, it won't sink the ship.
So the idea behind is that a single failure doesn't bring down the entire ship. And that is something that you can actually use in your service. So say your website, and like in logistics, both need product information.
So website needs product information to show it to the user. Logistics needs to know the product information to determine if the item is dangerous or can it be transported using air or road, depending on what category the item is. Now, in this case, say the website is facing tremendous load.
Like a lot of people, there's some, so there's a product launch and a lot of people are making use of it. What's going to happen is that the load on website is going to affect the product service. So eventually what's going to happen is that
the website will bring down product service because of the high load it's experiencing. So at that point, even logistics can't do anything. Even logistics is impacted. Once the logistics system is down, any systems which are dependent on that service
will also go down. And this could actually trigger a cascading failure throughout the system, like each dependent piece going down. However, using the bulkhead pattern, what we can do is we can actually have a dedicated servers for the website and logistics in the product service. So even if the other one is experiencing
a lot of problems, the other service is actually shielded by it. It won't be impacted by that. And the thing is, bulkheads are very different from adding more capacity. Like adding more capacity could still result in the problems
that I mentioned earlier. Here it's separating the servers so both of them don't impact each other. However, there are multiple other things which you can also use for bulkheads for. So bulkhead as a concept is very powerful. So say you're using circuit breakers
and you have a thread pool for each of the servers while making the call. And each of those are different thread pools and one of the thread pools you realize is completely saturated. You realize that there's no free threads. At that point, you can actually fail to call that service
like you can fail there and you can use the fallback instead. So in that sense, one system will not forcibly bring down everything else. And finally, the last thing that I actually want to talk about is steady state. So say you use all of these patterns.
Your systems are staying up and nothing wrong can happen. Actually, that's not true. If you have to fiddle your systems manually, like if there has to be a human intervention
to make sure your system is going on for weeks, like restarting them or something like that, that actually introduces a chance of introducing the error into the system. So what you want is as little as
human effort as possible. And there are a lot of things about it, like you can set up a deployment and all that, but there are two specific points that I actually want to talk about. First is have log rotation in place. So the worst thing that you want
is that you have a log which are weeks old and one day you realize that your service is out of disk space. At that point, because of logs, because there's no way to log further, it could bring down your entire service on that host.
So set up a log rotation. Like it takes five minutes to do it and a lot of folks don't do that. But there's something that never actually makes it to the first draft of the system. It's an archiving strategy. So the way archiving data, archiving actually works is like people will have a script
and the DBA or someone like that would actually archive the data for you. And that is really terrible because depending on your system, based on that, in case of Flipkart, if the order is delivered or if it's customer canceled,
those are terminal states. Like at that point, nothing else will be done in that order. We know that nothing else could be done about that order. At that point, we could archive any data associated with that particular order, any unit, anything. So your archiving strategy is highly dependent
on your domain. And that's something that you can always think about when you're actually designing your system. Because once you have your schema set, once you have everything set, you can't introduce a different kind of archiving strategy later on.
So lastly, I want to end this talk on this quote by Michael Nygaard. So Michael Nygaard actually wrote a book called Release It which is a Bible about building resilient systems.
So he says that software design actually only talk about what a system should do. It doesn't address what a system should not do. And to actually build really a resilient system, it's very important that we also think about what system should not be doing.
And putting it together, what we want is, we want to fail fast. If we realize that the call we are making to the system is going to fail, then we want to fail fast. We also want to bound our resources, use timeouts, at least discover what are the different timeouts by the libraries you're using.
Use circuit breakers at every integration point in your system. If you're making a call to a different service or something, use a circuit breaker. So if that system goes down, you can clearly use a fallback instead. And that fallback could be a cached value or stale,
or it could be just failing fast. And finally, you want to isolate your failures. You want to use bulkheads, and make sure that if one service is behaving badly, the failure could be contained to just that, and it wouldn't affect other systems. So yeah, that's it.
That's enough.