Merken

Resilient by Design

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
to do this in some of the genes in the and so so today I about I will be talking on as by design and I think the previous talk about like her services with mention kept then everything I think this is like a sequel to that of now that you know how you want to build systems that don't go down the path of minus met on ventricle and part of the ODE member and very silly idea to maintain the dependencies of the II also occasionally contribute to do we and this is what I am on internet I work at the
back of many of you would not know that it's e-commerce our website it yeah but we we get a lot of the sky the only we have a lot of scaling problems and yeah and I'm very thankful for that to sponsor might appear so let us start
so why knew we actually care what resilience the of companies have increasingly over the years our started depending on the sulfur of and offered them all at this stage of any sort of downtime would actually result in loss of business of for customers at also bad because customers are also relying on no this offers to be of art to give an example of the kind actually makes of more than 1 billion dollars in you of so even a single minute of downtime our results in loss of 2 thousand dollars and that the interesting fact is that you know it it's never evenly like that there's a known number like there that evenly distributed so what happens is that there are you know our it to any person of times which actually a Monte person of total revenue and it's during the peak times your systems were most learnable and and those times you know have been going on for a single minute could mean that you might closely thousand dollars doesn't that mean so are the companies cannot afford to go any downtime on their systems so what will they do what I've been going are going to do is going to rely on developpers support engineers and all of them so the famous on call is there just because of that reason but so it is up to the devil over to make sure that the response to on-call whenever it is like a fleet at nite or anything and it's up to him to make sure that systems are running up the 2nd reason is that even the simplest system to be is dependent on other services like the lake at the very least he will be dependent on a database which is on another server and and as the previous talk said the network is on on on the reliable so you those kind of things it's very important that you know the the part about resonance should be put up at the forefront of underlying so like I don't think any of us here would like to you know do not handle on called law began like that predated issuing on so so yes so then the question becomes like how do we actually believe the residence system so I think about if I'm in the nineties up like tests like like residence testing is to be an implicit requirement like the requirement was that you know you portion of the and you know it to bar again unit test should be there not be there but it's like implicit requirement of median of all of those things were very implicit and but you see those things that are not the same in the Ruby community itself all our testing code mom maintainability all those things are something that that's a lot of focus is put into it like this on something like an after talk but the problem residences that today it's more of an implicit requirement management expects that the system should be of all the time and the developers also think that you know all the I wrote this system I use this data store I have that it's going to you know it's going to be up and about and I think it's more like I think if you there is no torque would do like wine designing the system up you'll be very likely to find any birds before production because most of the votes that you see that dealt with a view with the resilience of the system at all what happened in the production environment happen the node of all other systems are at the fluid utilization is all that's when you see those blogs and but any because you haven't thought about it they're going to come and bite you there's there's no other way around the 2nd thing is that human bias so humans have inherently a bias that the only thing that was so happy that that where you know everything is working like your caching servers are out the databases that services that talking to orresponding every time you make a is quest and that's why you need me feel deceived developed the pattern that those things are not actually out there those things when they're not actually working and so so only me to actually think of in a different way is that we need to think of a residence from the start that whenever you think about your system like they were you are designing the system you need to think about OK are uh if my system of goes down for like OK can I climb forget the city of my caching servers are go down or I'll highly-available all those sorts have to be put in more from style of the things that we actually can help you and that's what my talk is about all the other is being
designed however I would like to point but Lima on all on the stock that these are not absolute with that like I mean it is under the if you use all of this patterns a neural system and they're guaranteed to never go down these are never simple as that of a lot of of of the knowledge that it depends it depends on the domain as well like the justice system that you're designing for example on the government side by not poor think is to you know be able to serve the Berkeley each word the customer to see if it's all available and once he clicks on by what you should get whatever he is ordered that is I mean things so if recommendation system is facing an issue because depleted out that if the comments or reviews are not showing up at the beginning of this site you know not show them with those systems are down all this mean for each of of service those kind of all of the jails are very dependent on it like for example netflix out of it there bookmarking service is down what they do is the we like option of resumed resuming the lead that the user start from there but but the reason I mean you that is that they know that means the used to be able to watch the videos and hence that my hence the only thing that I want to say that it depends on your domain it depends on how you have design system and I think the that's a really good thing because of like there's really no free lunch and if you're designing a system was like this you need to put tied into it that you need to know what the the pieces so so yeah but that in mind let's start with the patterns
are so I think this is the most important factor not in this talk like this is why I'm putting it 1st like if you know it take anything out of this talk like but any this did this by the out of the that the book is that you know that most of you know the city of resources like my earnings like those and block times when you get results that you have patrol rate of feeling fast is the best thing you can do if the system will their services so that you're talking do while I'm responding our you know why going to feel in fact that the reason around kind of feeling father actually comes from a mathematical idea called queuing theory so how this is John little slot but for those who not so the land of you not so say system actually you know got and as incoming messages and but that's a two-year landed q is going to be dependent upon the arrival of the messages and the amount of time it takes for them not in the system at time 1 and time it takes to process that are if you're response times I go up the demand of time the system goes up the size of the queue will increase the like so so now let's say you have if you're talking to of a service of which is not responding and you know you don't even bother to change the different dialog of negative images 60 seconds is that 60 seconds for years since the gulf of the Legislative 60 seconds for it to free so you're response times will be very very high and that was being dyadically doesn't the you size a lot of other things that go is highly dependent on your responses is your utilization of fear system so the
utilization of the goes on but if you see this draft utilization on what goes out if the response time was up so if for each requested it's sticking 60 seconds you know the utilization of the entire service will be very very high and the only way you can do anything about this is you you know you can only add more servers and no hope for the best of this kind of scenario the good thing on this is like you can also look at the other way use the same you know you optimize your code like you did your best and you know you got the response times to a certain extent often very few utilization is going above 80 per cent still like if you're going award 80 per cent I you can easily see that but it's going to have a very negative are impact on your performance of your system and that point you will you can do get the city planning on based on that and I think that the other thing that is a very good about this is that you hire an agent the then this the utilization of European itself around 90 per cent and now you manage a constant some of us are not just using you during here you can figure out that you know the turnaround time for that particular task is going to be very very high up so I think that like map is pretty cool but you cannot run away from the lab so the only thing you can do is like in this case is that you keep your response times not as low as possible so so this is 1 of the example
system that I of creative or do you just illustrate so you say are you have you will go down the service like you buy the book are being getting the that the fighting it's like if you buy the book in 5 minutes you get an e-mail with the nonmalignant can you can onward and the British wizard step you a checkout service our major sends out of messages that the payment service or to a message queue because we don't want to lose those messages and that the main god toxin external service to verify that you know this be is authentic and go up and then in the end you know they did processes were not let's assume that the external service that we're talking to of our starts feeling that it starts climbing out so doing it follows the of feeling of when talking the excellence of its now what's happening there is that of of each payment called dual axonal service is going to feel is going to make 60 seconds and because of that the incoming of our method is that it you can't really control in this case so what's going to happen is that the messages are going to start winding up and that you know the message queue at this stage what happens is you and if the system comes up you and the system goes like of external so what you would have is that you would have all required of messages but I and now you would also have incoming messages from our demand side like people are still placing orders so you would feel you need the estimator afford waters please when the external services don't any will also because of that you lots of you to meet the are expectations for a newly please orders you again so that in case of that like the embrace the things are going to be bad I We use a circuit breaker area in between media less that the called external services are going to feel and what we do is informed that what we do is we store those messages are we tried to later and because of that I adapt our response times are still the same so what's it going to do is you're messaging you will still be our and because your response times are actually much better because it's not even when they're in the context of service in this case but when the system actually comes up now what you can do is a newly place orders in the army the oscillate like these to get the dollar links and the messages which are stored in a different system or a different you and duty try later you can you know your you know those of those of masses I mean those customers will not get out of the learning on time you can send out especially in there you can you give them some discount but the meaning is in this scenario you are in control like you know that this method is over the ones that have been and are you can now design your system you're not dependent on it so so yes so this is the overall most important thing about the stock would not let see now how do we actually you know make
use of it so the 1st thing do you know what you all that is through bonding like you need to so what if any place in your system if you have unbounded access to resources that is something really terrible like you don't want anything like that so in bonding i.e. 1 encoded 3 things like bombing is a huge topic of its own but I mean specifically what authority 3 things the 1st sense I not so but the default I'm wanted any of the library are away and are like inanity be like I mentioned earlier has a time of 60 seconds so it takes 60 seconds for you know at the time of the peak in a deli that you know you kind of X is that the server you're trying to look at our house and I think the scary part is that some of the system of these doing have a dialog act on that day never die of now we have a system and all that kind of what we are what we use it for was still so it would collect the messages while from the local service and it was centered at the mean messaging you but it got its job was to really those messages and not only through this service that when isn't this in properties any service would be able to talk with the outside world now what this service alarming missing from these would get hung up every 2 weeks or 3 weeks or so but it's written in Ruby and we couldn't figure out like what was the wrong and like many then down into and then we look at that 1 the following that it does send a matrices through that's the 1 you know you may be bored and there is 1 that is is that we were not reading and all it was kind of like you know what he was making use of it and the 1 that was causing is still offer size of the UDP is 128 GB but in Linux and if that was getting from so if that is true at that point you know year it would just get the get stuck in that state and the only way to like is the we need solid was that we use a socket got a non lot of flags of which is you can do it all using right non-blocking not really like was so yes some systems don't even have a diamond like and those kind of things you need to look down in your God application and see that there's my application have all proper time or and the biggest thing that Diamond pro-whites is no fault isolation so if it's another service or another of nothing that is not the not responding in its use your system like you can have a dialog and are you can know use diamond so in conjuction with the circuit breaker rigorous is next by and I'll talk about lot are you can if nothing else you can use it wouldn't be trying to be dry logic the 2nd thing is of limited the summary used so I again not wearing of full of people who use caching or something like is on this is something that they'll continue forget about like limiting the memory use art art CEO of the RTC you and their web servers like application servers in those cases like go up in case of unique on you can have a low wall watch on each of your workers and you can see that you know the lady fibrous and it's OK as soon as it crosses over 85 % you can have law you then you know you can notify developers or something like that does a thing that happens is that when you don't have any of it then you let so but there is another case that of the planet itself well what we had was that we had a system an area then every 2 3 makes them but it would be the memory usage would increase so always they would start to use this smart and the performance of that particular post would be you're really terrible then we actually looked into that are of the form that at 1 place it was doing adjacent parts and it was using symbols and unfortunately 1 of the keys words of unique every time every single time and that was actually an those were lower and you do not like to we idea similar so where on garbage collected in very easily and do we do when do arts so any single that you created your system those data you restart your process something given restorative process so in that that is not however we none of us had to you know get a bit of innate in ITER early in the morning to fix any of the systems of what we had was so now we're at the head of water monitoring system and what it would do is if it if the work with global 90 per cent it would actually be start the worker and out and communities would still go and more about what part of it crosses so yeah so so that helped us out a lot like I think that's not actually an ideal solution but what it gives you this time to actually be but the issue of the waste any time you could start reading this map you know it's going to impact the business and that that is something that you can on on big the other point is still you know limit CPU so like a lot of times what happens is that on your host the the processes running that you know that do certain things may pro it helps checked by things like that and 1 those processes are you know I'm not the primary thing that's running it's your service that's running on that was that it is the most important but sometimes what happens is that that go on in that mean 1 or something like that I mean the library using of goes into some kind of an internet Newport it starts using more and more of the resources of the system not however if you can easily limit that the 1 hour using C groups of any 1 that the ways you is like an isolation so you any of you know that the red sites use of all your resources it's only using 1 quarter of your system and because of that it will not go down and fight and finally now having any user me on new stop lock on like a buffer your system models implicit using your system which you have no control over like there's no control on those things and it is much better to have an explicit bonded q like amassing due the sends messages to your service and what do you mean that could be bounded making it could you know all played back pressure in the case of it's full of what this gives you and much more control now always just using an implicit you so but the next month or not I think is 1 of them was better known the existence of it's called the circuit breaker but I
circuit because of the way they work is made of the of the area between the client and the server or the supplier and what they do is like if everything is fine but then then I did this and actually even come into play like it's when that you know you make a request and it starts dining out there is some connection problem between the client and the solid work and in this case what it does is that after a certain dollar threshold of letters of it realizes that you know of that of the other services self-feeding some difficulty like it's are able to do it so it actually improves circuit and then point onwards ignore any future calls are not even read to the server what it does is it does what it feels right then and there are later on what happens is that after certain was the point of tying what it will do is it'll actually of make a call to that other service management CC no it's up or not if it's not about the loses the circuit and everything goes back to normal but but if it's still about unit I know what I will keep in that little user the city will still be in open state and are you wouldn't even need to make the call to make up their own long really good examples of a circuit breakers but I think Cimiano Price shall be fighting in is a pretty good data on implementation and we 1st breakers up if you use you we you can know just make use of tricks of which is written by Netflix are itself very well written and battle tested library so that is something you can you now going forward
but I think of of going for it but there the love OK inside actually up constant that comes from ship but it's in larval cancer actually watertight compartments in your ship so even if your file is damaged by at a certain of partially damaged it will it will you 1 single ownership so the idea behind is that a single failure in doesn't bring down then action and that this is something that you can actually using your service to save what you will hear
floor that site up and now you know like Lodgistix would need a for a survey of product information so let's say a correct information do you know show it to the user Lodgistix needs to know the bird information due to mind if the item is the genesis or you know getting the dance 14 using air are you know Rodo depending on what kind of candidate itemsets not in this case since the website itself facing tremendous lowered make up a lot of people there's some of the ionic so there's a burger content in a lot of people are making use of it so what's going to happen is that the northern website is going to about the effect the brokerage service so eventually what's going to happen is that website will bring known operates there is because of bill I know it's experiencing so at that point you in logistics kind you anything the old it even like sticks is infected and once the logistics system is not any systems which are dependent on that series will also go down and this could actually trigger cascading failures throughout the system like each dependent is going down however using the bulkhead bad and what we can do is we can actually have a dedicated servers for our ultimate side and logistics and the diverted service so even if the other 1 is experiencing a lot of problems out the services actually shielded by it like it would be impacted by that our and the thing is like a lot of kids are not the item a different from adding more campus city lake anymore in campus B I could still result in a problem that I mentioned earlier and here it's a lot of separating the servers sold model then known impact each other however there are multiple other things you all of which you can also use for both for so so both as a constant is very powerful associate our you and you're using circuit breakers and you know you have a terrible of the of for each of the service so while making the call and of each of those are different textbooks and 1 of the tables you realize is completely saturated you realize that there is no free text at that point you you can actually feel free to call that services like you can feel that you can use the following consisting so in that sense 1 the system will not forcibly of bring on everything else and finally the last thing that I actually want to talk about the
steady state us associate you use follow this back and you know your systems are like text staying up and no nothing can be wrong can happen actually that's not true if you act if you have if you have to fill your systems manually like if some rule if there has to be in human interaction to make sure your system is going on for weeks make 3 starting them or something like that that is that is that actually introduces a chance of you know of introducing the area into the system so what you want is you know as little as 1 of the rule both human effort as possible and I think there is a lot of things awarded like you can assign a book of deployment and all that but they're 2 less specific points that I actually wanted talk about forces Our have longer edition in place so that the water see that you want is that you know all of you have a long which are you know weeks old and 1 you realize that you know you're so services are this space at that point do because of the long because there's no way to log over the media could be knowing here and services on the host so the son of a logger rotation like it takes findings to a lot of hoaxes so don't do that but there is something that you never actually makes it to the cost of the graph of the system is not and strategy so that we are telling data archiving actually works is like people will have a script and the media or someone like that would actually you know I have the data for you and that is the data because depending on your system you know the hot it would if you're of abuse based on that you know you if you are in case of look like it if the water is delivered or if it's customer cancel those are terminal states like at that point you know that the has been done in that order we know that nothing else but we don't know in that order at that point we can plot any need associated with that particular order any you need anything so it you're archiving strategy is highly dependent on your domain and the the and that's that's something that you know you can always think of 1 of linear actually designing the system because of you want you have your scheme said once you have everything said you going you know introduce a different kind of marketing strategy data on so all naturally I want and this
but doubt go on the sky of this point by Michael like the so Michael like they're actually wrote a book called release it which is the bible of you know billing Roslin systems are so he says that it is often designed actually will only talk about what the system should do it doesn't address is what the system should not do and I it's about of and do actually of the of resonances of it is very important that you know altered thing about what system should not be too doing and putting it
together are what we want is known to be 1 of the positive the DLA in all the quality of the image into the system is going to feel that we would have the fast we also under water resources use Diamond's at least discover what are the different the libraries you're using our music rigorous about at any and mediation point in your system if you're making a point to a different service or something use a circuit breaker so use that system was down of New Guinea clearly use a fallback instead and that that could be a guassian value are still on it could be just feeling funds and finally how you want do you not isolate your fingers you want to use both carrots and make sure that all of if 1 service is behaving badly defeated could be contained do just that and it could it wouldn't affect other systems so so yeah that's it some of the uh and this is the this and in in in in some
Dienst <Informatik>
Mereologie
Fortsetzung <Mathematik>
Physikalisches System
Twitter <Softwareplattform>
Internetworking
Gewöhnliche Differentialgleichung
Resultante
Einfügungsdämpfung
Resonanz
Web Site
Abstimmung <Frequenz>
Total <Mathematik>
Komponententest
Fluid
Zahlenbereich
Gesetz <Physik>
Code
Computeranimation
Trigonometrische Funktion
Knotenmenge
Endogene Variable
Mustersprache
Softwareentwickler
Speicher <Informatik>
Softwaretest
Sichtenkonzept
Datennetz
Datenhaltung
Softwarewerkzeug
Systemaufruf
Einfache Genauigkeit
Physikalisches System
Biprodukt
Medianwert
Fokalpunkt
Quick-Sort
Softwarewartung
Moment <Stochastik>
Dienst <Informatik>
Mereologie
Identitätsverwaltung
Server
Programmierumgebung
Resultante
Subtraktion
Web Site
Dicke
Physikalische Theorie
Computeranimation
Videokonferenz
Physikalisches System
Message-Passing
Domain-Name
Negative Zahl
Mustersprache
Endogene Variable
Warteschlange
Response-Zeit
Bildgebendes Verfahren
Güte der Anpassung
Zwei
Softwarewerkzeug
Physikalisches System
p-Block
Bitrate
Teilbarkeit
Konfiguration <Informatik>
Arithmetisches Mittel
Dienst <Informatik>
Wort <Informatik>
Bitrate
Message-Passing
Prozess <Physik>
Punkt
Wasserdampftafel
Kondition <Mathematik>
Code
Computeranimation
Eins
Task
Message-Passing
Erwartungswert
Warteschlange
Response-Zeit
Maßerweiterung
E-Mail
Schätzwert
Kontrolltheorie
Zwei
Softwarewerkzeug
Ruhmasse
Physikalisches System
Marketinginformationssystem
Binder <Informatik>
Kontextbezogenes System
Warteschlange
Mapping <Computergraphik>
Arithmetisches Mittel
Dienst <Informatik>
Flächeninhalt
Grundsätze ordnungsmäßiger Datenverarbeitung
Digitaltechnik
Hypermedia
Server
Ordnung <Mathematik>
Pendelschwingung
Message-Passing
Bit
Prozess <Physik>
Punkt
Gruppenkeim
Kartesische Koordinaten
Gesetz <Physik>
Computeranimation
Internetworking
Client
Einheit <Mathematik>
Prozess <Informatik>
Fahne <Mathematik>
Existenzsatz
Figurierte Zahl
Default
Schwellwertverfahren
Kontrolltheorie
Kategorie <Mathematik>
Güte der Anpassung
Systemaufruf
Ähnlichkeitsgeometrie
Rhombus <Mathematik>
Dienst <Informatik>
Druckverlauf
Rechter Winkel
Festspeicher
Grundsätze ordnungsmäßiger Datenverarbeitung
Client
Server
Socket
Schlüsselverwaltung
Zentraleinheit
Message-Passing
Aggregatzustand
Web Site
Wasserdampftafel
Implementierung
Zentraleinheit
ROM <Informatik>
Mathematische Logik
Puffer <Netzplantechnik>
Bildschirmmaske
Faser <Mathematik>
Benutzerbeteiligung
Programmbibliothek
Inverser Limes
Modelltheorie
Softwareentwickler
Schreib-Lese-Kopf
Einfach zusammenhängender Raum
Autorisierung
Matrizenring
Zwei
Eindeutigkeit
Symboltabelle
Physikalisches System
Warteschlange
Mapping <Computergraphik>
Flächeninhalt
Identitätsverwaltung
Mereologie
Digitaltechnik
Wort <Informatik>
Soundverarbeitung
Web Site
Punkt
Reihe
Web Site
Dienst <Informatik>
Physikalisches System
Sondierung
Elektronische Publikation
Biprodukt
Computeranimation
Dienst <Informatik>
Digitaltechnik
Server
Logistische Verteilung
Biprodukt
Information
Inhalt <Mathematik>
Modelltheorie
Tabelle <Informatik>
Subtraktion
Resonanz
Punkt
Euler-Winkel
Wasserdampftafel
Adressraum
Interaktives Fernsehen
Drehung
Raum-Zeit
Computeranimation
Strategisches Spiel
Systemprogrammierung
Domain-Name
Skript <Programm>
Fließgleichgewicht
Gravitationsgesetz
Data Encryption Standard
Graph
Schlussregel
Nummerung
Physikalisches System
Hoax
Software
Dienst <Informatik>
Flächeninhalt
Forcing
Hypermedia
Strategisches Spiel
Ordnung <Mathematik>
Gebundener Zustand
Subtraktion
Dienst <Informatik>
Punkt
Wasserdampftafel
Digitaltechnik
Programmbibliothek
Physikalisches System
Bildgebendes Verfahren
Computeranimation

Metadaten

Formale Metadaten

Titel Resilient by Design
Serientitel RailsConf 2015
Teil 35
Anzahl der Teile 94
Autor Shah, Smit
Lizenz CC-Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.
DOI 10.5446/30696
Herausgeber Confreaks, LLC
Erscheinungsjahr 2015
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Modern distributed systems have aggressive requirements around uptime and performance, they need to face harsh realities such as sudden rush of visitors, network issues, tangled databases and other unforeseen bugs. With so many moving parts involved even in the simplest of services, it becomes mandatory to adopt defensive patterns which would guard against some of these problems and identify anti-patterns before they trigger cascading failures across systems. This talk is for all those developers who hate getting a oncall at 4 AM in the morning.

Ähnliche Filme

Loading...