We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Creating chaos in containers

00:00

Formale Metadaten

Titel
Creating chaos in containers
Serientitel
Anzahl der Teile
60
Autor
Lizenz
CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Chaos engineering is not a new concept, it has been around since 2011. The benefit of knowing the weak spots of your application before it actually breaks is extremely valuable. But with containers, this becomes bit more complicated. There are many layers of possible failure running under your application. In this session you will learn more about the different layers you should be releasing your chaos experiments on, the considerations you need to take into account while testing a shared platform, and also learn about the tooling available to accomplish this.
Zeiger <Informatik>Formation <Mathematik>ZahlenbereichDienst <Informatik>ProgrammfehlerMultiplikationsoperatorServerMathematikWhiteboardPhysikalisches SystemDiagrammVorlesung/Konferenz
CASE <Informatik>Dienst <Informatik>ZahlenbereichServerKonfigurationsraumDebuggingAdditionProgrammfehlerVorlesung/Konferenz
Dienst <Informatik>FehlermeldungSummengleichungSoftwareentwicklerZahlenbereichPunktwolkeDienst <Informatik>BeanspruchungCodeDatenverwaltungKartesische KoordinatenComputerarchitekturProdukt <Mathematik>Umsetzung <Informatik>SystemtechnikSoftwareentwicklerRückkopplungOrdnung <Mathematik>Prozess <Informatik>UnordnungSummierbarkeitComputeranimation
SystemplattformPhysikalisches SystemProzess <Informatik>UnordnungCodeMAPMultiplikationLastteilungPolygonMultiplikationsoperatorBitVerfügbarkeitAggregatzustandZeitzoneBereichsschätzungKontrollstrukturKartesische KoordinatenZentrische StreckungInstantiierungQuaderDefaultZweiComputeranimation
GammafunktionLastGruppenoperationTeilbarkeitURLLastteilungDifferenteZahlenbereichKartesische KoordinatenSoftwareentwicklerExogene VariableResultanteMereologieBitFunktionalStellenringGüte der AnpassungSystemplattformMAPVirtuelle MaschineBenutzerbeteiligungElektronische PublikationServerDienst <Informatik>Humanoider RoboterCASE <Informatik>VerkehrsinformationCodeEinfache GenauigkeitZusammenhängender GraphDateiverwaltungProxy ServerSystemaufrufVorlesung/Konferenz
FehlermeldungMultiplikationsoperatorComputersicherheitSoftwareBasis <Mathematik>OrtsoperatorHalbleiterspeicherBefehlsprozessorUnordnungPuls <Technik>Kartesische KoordinatenZeitzoneUmwandlungsenthalpieHardwareProzess <Informatik>CodeVirtualisierungDatenbankWort <Informatik>LoginInformationNormalvektorDienst <Informatik>SoftwaretestBildschirmfensterElektronische PublikationKernel <Informatik>BitResultantePhysikalisches SystemKorrelationsfunktionVorlesung/KonferenzComputeranimation
SoftwareBitHalbleiterspeicherMultiplikationsoperatorZentrische StreckungDifferentePlastikkarteBefehlsprozessorKartesische KoordinatenPhysikalismusEinfügungsdämpfungWort <Informatik>Lesezeichen <Internet>LoginSystemzusammenbruchVirtuelle MaschineDreiecksfreier GraphDienst <Informatik>LeckDirekte numerische SimulationBandmatrixSoftwaretestRoboterZahlenbereichDialektPunktwolkeFehlermeldungZweiVirtualisierungSchreiben <Datenverarbeitung>Vorlesung/Konferenz
InstantiierungStochastische MatrixInformationDifferenteFehlermeldungUmwandlungsenthalpieKartesische KoordinatenSoftwaretestVorlesung/Konferenz
Selbst organisierendes SystemServerInjektivitätMereologieKonfigurationsraumUnordnungZahlenbereichBefehlsprozessorFehlermeldungKartesische KoordinatenMathematikSystemaufrufProzess <Informatik>DifferenteGüte der AnpassungCodeBasis <Mathematik>MAPProdukt <Mathematik>Ordnung <Mathematik>MultiplikationSoftwareentwicklerInstantiierungInformationExogene VariableTermersetzungssystemPunktwolkeDreiecksfreier GraphTUNIS <Programm>LoginWorkstation <Musikinstrument>Dienst <Informatik>CASE <Informatik>ComputersimulationSystemplattformOpen SourceZentrische StreckungVorlesung/KonferenzComputeranimation
UnordnungKartesische KoordinatenSystemplattformDienst <Informatik>Produkt <Mathematik>ProgrammierumgebungUmwandlungsenthalpieSoftwaretestKonfigurationsraumSoftwareentwicklerZahlenbereichDigitaltechnikFehlermeldungRechenschieberEnergieerhaltungAggregatzustandVorlesung/Konferenz
Vollständiger VerbandEinfache GenauigkeitBasis <Mathematik>HauptidealVollständigkeitSchreib-Lese-KopfFehlermeldungMultiplikationsoperatorMechanismus-Design-TheorieMereologieVorlesung/Konferenz
Dienst <Informatik>MereologieSelbst organisierendes SystemTropfenWort <Informatik>Metrisches SystemLoginUnordnungEINKAUF <Programm>InformationExogene VariableMultiplikationsoperatorInzidenzalgebraSpieltheorieCodeProzess <Informatik>DatenstrukturVollständiger VerbandKartesische KoordinatenGruppenoperationSystemaufrufWeb-SeiteWeb ServicesCASE <Informatik>EnergieerhaltungPhysikalisches SystemSoftwaretestMultiplikationZahlenbereichZusammenhängender GraphAppletVersionsverwaltungUmwandlungsenthalpie
EindeutigkeitPhysikalischer EffektSystemaufrufTaskWurzel <Mathematik>KonfigurationsraumDienst <Informatik>Mechanismus-Design-TheorieOrdnung <Mathematik>WürfelService providerGruppenoperationVerschlingungZahlenbereichCodeKreisbewegungInformationProzess <Informatik>LogarithmusMultiplikationsoperatorQR-CodeRahmenproblemCoxeter-GruppeAnalytische FortsetzungVorlesung/Konferenz
Produkt <Mathematik>Prozess <Informatik>UmwandlungsenthalpieCoxeter-GruppeComputersimulationInformationEigentliche AbbildungSoftwaretestHauptidealProgrammbibliothekBefehlsprozessorUnordnungEinfügungsdämpfungVerschlingungWeb SiteWeb-SeiteSystemplattformSoftwareHalbleiterspeicherQR-CodeInjektivitätZwei
QR-CodeSondierungRückkopplungDokumentenverwaltungssystemTwitter <Softwareplattform>TouchscreenE-MailVorlesung/Konferenz
Analytische FortsetzungUnordnungSoftwareentwicklerMereologieBasis <Mathematik>SoftwaretestMehrrechnersystemVorlesung/KonferenzBesprechung/Interview
UnordnungSoftwaretestSoftwareentwicklerBasis <Mathematik>MereologieWort <Informatik>Ideal <Mathematik>MAPImplementierungProgrammierumgebungProgrammfehlerOrdnung <Mathematik>Vorlesung/KonferenzBesprechung/Interview
ProgrammfehlerFunktion <Mathematik>OrtsoperatorPunktwolkeBefehlsprozessorInformationsüberlastungMultigraphMultiplikationsoperatorPrinzip der gleichmäßigen BeschränktheitTaskSystemplattformInformationMetrisches SystemAusreißer <Statistik>NormalvektorVorlesung/KonferenzBesprechung/Interview
Gesetz <Physik>ProgrammfehlerEinfache GenauigkeitGeradeMultiplikationsoperatorSoftwaretestSchreiben <Datenverarbeitung>RauschenVorlesung/KonferenzBesprechung/Interview
BitfehlerhäufigkeitRauschenMultiplikationsoperatorFormation <Mathematik>Vorlesung/KonferenzBesprechung/InterviewDiagramm
Transkript: Englisch(automatisch erzeugt)
Do you remember where you were on the 28th of February, 2017? Probably not. How about the 25th of November, 2020?
I think you probably will if you've been using technology in a certain amount of time, because both of these days were where we had really, really, really bad outages in AWS. The first one was the S3 meltdown in US East 1 in Virginia. And the story about this was the fact that the service team, which managed the service,
were trying to actually fix a bug in the billing system, completely nothing to do with the S3 service. And they deployed the change. And by mistake, with the runbook that they run, I gathered a fat-fingered number and kind of took down too many servers at the same time, which kind of caused a huge cascading
failure, not only of S3, but of a number of other things which were like S3, the AWS console, the status board, the AWS Lambda, ECS, EBS, a huge amount of things which actually kind of fell over because of that outage.
The second one was the Kinesis outage that we had on November, 2020, where in this case, it actually was somebody was trying to fix a problem in the service. And by mistake, they added, instead of removing, a number of servers into the pool which
was serving the front end. And that addition had a configuration bug, which kind of brought down the whole front end and, again, caused a cascading failure, not only of Amazon Kinesis, but also of a number of other services, such as Cognito, CloudWatch, Autoscading, Lambda, EventBridge, and, and, and, and.
Werner Vogels is the CTO of Amazon. And no matter how well you architect your application or how well you think you architect your application, something is going to break, always. And it's OK if it breaks because you
have to also be ready for when things don't actually go as planned. And you should be planning for those kind of things as failure. So why did I bring these two examples? The main reason is because to show that everything which we think we do also has underlying implications which we rely on.
I guarantee you that the EC2 service team or the CloudWatch service team or the status dashboard did not wake up on that morning thinking, if S3 goes out, summing should be fine. Or they didn't even realize that there were certain implications on the underlying infrastructure
that they needed to take into account. And today, I'm going to be talking about the fact of there are implications of the underlying infrastructure when you run your container workloads. My name is Maysh Said Okaising. I'm a developer advocate that works with Amazon ECS in AWS located out of Israel. So thank you for inviting me over here to do the talk.
It was a nice trip, will be a nice trip. And as my job as a developer advocate, I speak to customers. Actually, it's more listen to customers for their feedback to understand exactly what they want from our product. I'm embedded in the actual service team. I work very closely with the product managers and the systems engineers which actually write the code
in order to understand how I can improve or, I would say, have impact on the roadmap in order to provide that feedback back into the product itself. I'm the customer's voice inside the room whenever we have these conversations. So this is a definition of chaos engineering
from the principles of chaos. Essentially, what this says is break things on purpose because it will lead to you gaining confidence in your system and making it better and having less downtime. And this is very simple.
Chaos engineering is not a new concept, but there are small nuances which I would like to address here in this talk about what you have to think about when you do this in a container platform. So let's go over the agenda for today. The first thing is what we're not going to be talking about. And we'll get to that in a second. And after that, we'll talk about the fact of what are health checks and the underlying dependencies
which you could have with inside your container orchestration or your container platform and a bit of tooling. And we'll see what else we can get into if we have time. So what we're not talking about, we're if Kubernetes is better than another container orchestrator,
if you can use ECS, Kubernetes, Nomad, Mesos, whatever is comfortable for you, we're pretty much, I think, at a stage in the industry today that if you're not following the best practices of deploying multiple availability zones, high availability, then you're in for a lot of pain in the future. But you should be doing that by default. And we're not going to be talking about how
these orchestrators work. Because we're also at, I think, a state in the industry that we can kind of rely on them to do what they say they're going to do out of the box. If they say they're going to scale up an instance or scale it down, it will probably do that because we've got to a stage in technology where it actually works.
So let's talk about health checks. Health check is very simply something which an application or a piece of code or whatever else it is, which asks a very simple question. Is my app healthy or isn't it not healthy? It has one simple job, to provide a yes or a no answer. The question is, and this is what we're going
to dive into further, is what is that question? What do I need to check on my application? Usually we run this kind of a health check on a load balancer. Pretty simply on a load balancer, if you're using AWS, you would say, if my port is available or if I get a URL, et cetera, you can get into more detail as we will further over.
So the first one in AWS, we kind of divide these into a number of different groups. The first one is a liveness check. This is the most simple one of all of them. A liveness check is there to understand the basic functionality of the service. Usually this can be done in Docker.
If I have a port available, if I have a URL providing me a certain response code, these kind of things are done usually with a monitoring agent or a load balancer which does those checks for you. And they're not very much that the developer actually needs to understand how to do these. This is usually part of the platform.
Actually in this case, it must be part of Docker. So there's nothing you really, really need to do over here. The second level are what we call local health checks. This is slightly more, I would say, advanced, that the fact that the actual application is aware in its own bubble of everything that it needs. Does it have a file system? Does it have access to the necessary components?
Am I running the correct things? It's above and beyond more of a simple Docker health check. And this is something which you don't have your developers actually have, I would say, knowledge of. This is something you actually have to think about a little bit more. So if we take, for example, an NGINX proxy,
a local health check will define, for example, that I actually have access to the web server. So the previous one which we said was giving a result on 200 port if it was running on NGINX, that would actually pretty much be a good local health check that I'm running. NGINX is giving me a 200 response and it means I'm actually got access to what I need on my local machine.
The third one is what we call dependency health checks. Well, this is where each application understands that its underlying resources that it needs to access are available. These are more what we call sophisticated health checks where it allows you, for example, can I access my database? Do I have the right credentials to access those database?
And if I don't, I will provide a failure. The problem with these kind of health checks, they can very, very often provide you a false positive because there's absolutely nothing wrong with your container. There's something wrong either with the underlying infrastructure in the background, the database is not ready, or the network is down, or I have an outage in S3
which brought down the whole world, that means I can't access anything. Your container's actually running, the process is there. So when you do these kind of dependency health checks, you have to be very, very careful of the fact of what is the false positive because if, for example, I keep on restarting my container on a regular basis because my database is not available,
that will not solve my problem. So think of these health checks as what you can as dependencies, but also remember that you have underlying dependencies as well, and they could provide false positives, so you also need to check those as well to see where the correlation between these different problems are. The last kind of health check is the anomaly detection.
When we're talking about microservices, containerized systems, we usually talk about a decent amount of pods, not one or two. We can get to hundreds, thousands, hundreds of thousands, hopefully not, but that's okay. I don't think Kubernetes supports hundreds of thousands of pods on one cluster, either ECS, but that's okay.
Not everything will be the same, and you should be testing your underlying services for what is your norm. In other words, I have an application that should be doing X, Y, and Z at a certain amount of time. If I have one which is not throwing errors or responding slowly, I would like to know about that.
I want to understand because that can affect my application. I can get a radical behavior based on keeping that specific container or pod or availability zone because it's not acting or behaving correctly. Example for this, it could be, for example, something happened and the clock is skewed on the specific host that you are deploying
and everything, all your logs are completely out of whack. Also, it could be that it's deployed with old code because it didn't get an update, also providing inconsistent results. So, we talked a little about the health checks. Let's talk a little bit about the dependencies.
I used this graphic because it reminded me of the movie The Usual Suspects. I don't know if you've seen it. Kaiser Soze was awesome, and they are the, as I'm saying, they are the usual suspects when we're talking about the underlying dependencies. You can pretty much think what they are. They're CPU, memory, network, and that's about it, pretty much.
These are the things which are underlying dependencies when you're running containers. Yes, there are things of security and specific hardware, but in general, we're talking about CPU, memory, and network. So, let's see what we can do when we kind of try to test for chaos engineering and kind of these kind of experiments which we wanna run.
So, for CPU, as I said, I'm assuming that the orchestrator, be it Kubernetes or ECS, and by the way, anything that I'm saying to you today is non-platform specific. I don't really care where you're running your containers today. It will work, or the information at least you can use is exactly the same for any one of the others.
So, I'm assuming that when I ask my orchestrator to give me a one virtual CPU pod, it will give me that one virtual CPU. I don't necessarily have to start testing if I got the full amount of that CPU for that, and I also can assume that the Linux kernel, or in Windows, which I don't really pretty much do anymore, and I hope for your benefit
that you don't do it too much anymore either, is it will slice that CPU based on the amount that I get. So, in other words, it will ensure that I got my one virtual CPU out of the four on the physical host or virtual machine, whatever it may be, and I don't have to worry about that too much, but what will happen if I start loading the CPU inside my actual application?
Start chewing up CPU cycles, which the actual application my container should be trying to use. How will it behave? Will I be able to continue to write logs? Will I continue to be able to provide service to my customers based on what I'm supposed to do? If you run an experiment which will load test your CPU inside your container,
I assume you will learn new things. Besides the fact, of course, that the orchestrator will probably scale everything up because it will recognize the CPU spike, but still, how will your application behave? Something that you should actually look at. And again, I'm assuming that we can also rely
on the orchestrator. When I ask for a certain amount of memory, I'll get a certain amount of memory. It will be separated from the other pods or the containers running inside my cluster, but without enough memory, you're pretty much nothing. I do not remember my daughter's cell phone numbers. I have them as a speed dial in my phone. I don't know the number, honestly not, but I don't need to because I have something
to remember for me. But sometimes my phone starts to lag because I have too many applications and the thing goes slowly and then I can't get hold of my daughter and it takes time and I get frustrated. So with memory, when you start chewing up memory, this is starting to annoy me, sorry. One second.
So, sorry about that. When your memory starts either because of a memory leak inside your application or if you have some kind of memory issue inside, running inside the container, whatever it may be, things are gonna start going wrong.
What they will be, I'm gonna need that for you to find out in your specific application, will your logs work? Will your application crash? Will it start responding slowly? Will it cause any kind of issues on my downstream services? Running experiments like this will teach you a lot.
My favorite is the network and yes, it is always the fault of the network unless it's DNS or the network. But, as opposed to the CPU and memory we talk about or the fact that I can rely on my underlying infrastructure to give me what I want, that is not the case with inside your CEO network traffic. There's nothing today in any of the orchestrators
that can guarantee you network bandwidth for a certain pod or a certain container. Which means you can have a noisy neighbor. If something is chewing up too much network on that physical host, then something is gonna go wrong with everything else running on there. Running experiments, for example, injecting latency into your applications will help you to understand what that's gonna happen.
Will your applications continue to work? Will they fail? Will they provide errors? Will they start retrying? If retrying after how long can they cache your requests? These are the kind of things that, when you run these kind of experiments, you should actually not only run them in the actual containers themselves, but also in the physical infrastructure which is underneath on the EC2 hosts
or the virtual machine hosts, whichever cloud you're running them on. I used to crimp my cables once upon a time until I found out that it was too much of a pain in the neck, and sometimes I didn't do it properly, and I would rather have done it by buying a ready-made network cable. So can network cables disconnect? There's nothing you can do.
There are physical cables running eventually somewhere. They can get damaged. Somebody can step on one, unfortunately. They can flip on a network card somehow once in a million, 10 million, 100 million different kind of packets going through, and when you get to very large scale, that can happen once every 10 minutes.
So what happens when you start having network packet loss when traffic doesn't go correctly? Does your application know how to recover properly? Does your application know how to queue up those requests? Do I start causing a cascading failure on the underlying applications beneath me? Things you should actually check about.
And the last one is a black hole. When, in other words, for example, something falls off the network because of an outage of infrastructure or a host, whatever it may be, it still thinks it's running, probably will continue running until it comes back, and something the orchestrator will hopefully recover, but what happens once that happens? Am I providing stale information,
injecting stale information into my applications? That could cause errors that you can do by black-holing a specific pod, or, bless, blacklistening or black-holing a specific instance to understand where this goes. So, we talked about different kinds of tests,
different kinds of things you need to look into. Where do you go from here? In Amazon, we have something which we like to call the flywheel. Every team has their own flywheel for their own service, for their own organization, for their own whatever. And the flywheel for uptime is something
which looks like this, and I'm gonna start up there by the prepare. If there is one thing that you take from this talk, please do not run chaos experiments unless you know what you're doing and what's gonna cause. The preparation stage is running through some kind of a scenario of I'm going to do X, Y, and Z
and I expect this to happen. If I do that, what is gonna happen to my application? Should I run it on a production part? Should I run it on a testing or on a development station? What am I actually gonna do as my experiment? That is the first thing you get. You get a runbook. I'm gonna run this, it's gonna happen, that's what you're gonna do. That's what I expect to happen.
The next part, of course, is the detect. The detect is you run that experiment and you validate what you expected to happen was there. Did I get the correct alarms? Did I see the right logs? Did the right people get involved in the call which was spun up at three o'clock in the morning because we had the outage? Did they get correct people, get paged?
These are things that you have to understand. Did I know what was going on? Could I see what was happening? Did I understand what was happening? The next part is the respond. The respond is you understand, okay, I know what happened. This was the steps that I was wanted, I took in order to resolve the outage. In my case, it could be, for example,
Kubernetes or ECS automatically scaled up and handled the outage because I had a CPU spike. That also could be very valid and that could be fine as long as you understand the steps before what was happening and that's what was supposed to happen and that's good. But it also could be that I needed to push a new configuration change, deploy it to all my hosts or instances in the cloud and hopefully everything was all fine and dandy.
The stage after that is the learning stage. Everything is back to normal, everything is back to where it should be, your customers are no longer impacted and now we do the fact of it like a retrospective or the way we like to call it in AWS is COE, correction of errors process and we'll go into that in a few minutes
what exactly that process is. And from that process, we learn what we were missing, what we can improve, how we can improve also our deployments, our code, our people, our processes, the information we need to understand in order to react better. And the whole benefit of this whole cycle
and doing it continuously on a regular basis is to reduce the number of outages and reduce the wear and tear on people making up in the middle of the night to fix your applications because that's actually the worst problem of everything. Could be that you need as part of your learning process to decouple services, change your services, rewrite code or it could be that for example, you just need to make a configuration change
and push it to Git. That could be simple as that. It will depend specifically on your use case, your scenario on what you need to learn. The tuning you can use for this, there's Chaos Monkey from Netflix, Litmus Chaos which is an open source tool. Have Gremlin which is a third party commercial tool, partners of AWS if you would like
that can run this on multiple applications, platforms, different services, different technologies that you would like and there's also, last but not least, Amazon Fault Injection Simulator, actually it's AWS Fault Injection Simulator, I apologize, the naming is wrong but I'll fix it in the slide afterwards which is a chaos as a service service if you would like
inside AWS where you can target specific things safely. Controlled in a certain way, you can also provide like for example circuit breakers which allow you to say if I trigger a certain alarm, automatically roll back the test and stop it to the previous state that it was in. Number of configurations that allow you
to do these kind of things and it is the fact that for example, it is as granular as you would like it to be. You can run chaos tests that you can expose these tests to a developer specific environment inside your account or to the production whichever you would like so not everybody gets access to everything and you provide a secure platform for your applications. So what is a COE?
Correction of errors. Every week on Wednesday at nine o'clock Pacific time, we have a two hour meeting with pretty much the almost all the senior people within AWS all principal engineers,
head of VP of infrastructure and further and further and further. It's two hour meeting every single week and it's like what we'll say the holy, not the holy grail but there's something which is never ever canceled unless it's Christmas week or something else falls out on a holiday but usually never done. But in this meeting we go over pretty much
on a regular basis any outage which had customer impact in the toll in the complete Amazon portfolio. It can be the AWS part, it can also be other things as well and the idea of a COE is a mechanism. It's something which we have,
it's a muscle which in AWS we have learned to use very, very well on a regular basis that allows us to essentially learn from our previous mistakes. We have this meeting where everybody sits and reads a document, what the structure of a COE actually looks like
but the idea is, as I said, outages will and always do happen no matter how well you plan for it. The question is what do you do when they happen and how you learn to make things better and improve your operational processes and your code and your monitoring and everything else. This is what the COE is for.
It is not a blame game, it's not to point fingers at a certain person or a certain team or a certain engineer that did something because that's not what the purpose of a COE or a retrospective, however you would like to call it and it's also not a punishment. Nobody gets, I would say, looked down upon or any kind of bad vibes because they did something
but that's not the way it works. Specifically in Amazon we have the fact of two pizza teams, if you've heard the concept before, where essentially the size of a team shouldn't be more than you can feed with two pizza dishes which is about six or seven people and every time you get to a team
which is greater than that they split it up because it's the smallest service you can actually kind of handle within in a certain way and it is a collective responsibility. It's not somebody's fault, it is part of the team and we learn from it. When would you run through this kind of process for a COE?
So when you had an outage, of course, when it's customer facing impacting so you learn because you would like to understand what happened, what kind of impact it had, what did it do to the business, what did it do to our people and doing in the background and above and beyond that where, for example, in our test case which we ran before when the flywheel for the preparing, we expected something completely to happen
and we understood when we ran that test that it brought down the whole system because we weren't ready for it. That would be something which you would run a COE because you should have understood that. What did we miss? If the use case, we tried to do something and we completely tested the wrong thing and because we caused an outage or an outage occurred
because we didn't test the specific use case. And the last one, of course, is anything that we can improve on. For example, what we call these are wins where we actually took a action item in the service team to improve something. For example, replacing a certain version of Java
in all of our components to a newer version introduced the fact that we now reduce the latency by 40%. And of course, it's backed by numbers and it's something which anybody else besides in one of the service teams that did this can also learn from. We'll improve the services across multiple, multiple services. So those kind of opportunities are things
which we wanna do and share across the organization. So what does the COE kind of look like? The first part is the supporting information and that what we call is what happened. What happened usually has the impact and the impact is not that the fact that my, for example, if we take the Nginx example before that my Nginx service stopped.
The impact is that, for example, 35,000 customers could not push the purchase now button because my web service was not working. That is the impact. So you understand also from the monetary side of the business what actually happened. Has to, of course, be backed by a timeline. When did we understand,
when did the outage occur? When did we recognize it occur? And the timeline of what actually happened afterwards. We found the logs, we saw the outage, we unidentified what the problem was. And all of these, of course, should be backed by metrics, which hopefully you got based on your chaos experiments and your flywheel that you continue to do to understand your services.
Metrics, you'll be able to see because a picture is better than a thousand words no matter how much you try to describe but if you see a drop of 70,000 requests to zero, you'll understand that there was actually a big problem. And the last one is the incident questions of what did we actually do to find the issue? We dove into logs, we got the logs from somewhere,
we got a page, somebody was called on to the on-call and we started digging in on those questions of how we defined, how we understood what the issue was and how we handled it. The second part is the corrections. We chose the learning and the action part. Here we have the five whys.
Anybody know what the five whys are? For those of you who don't, five whys is kind of a mechanism in order for you to understand the actual underlying calls and I won't say root cause because there is never really one root cause of a problem so the underlying reasons of what happened. So if, for example, I take my NGINX service,
why the NGINX service stuff? Because the configuration that was pushed to my cube config or my ECS task was incorrect and why was it incorrect configuration pushed? Because nobody actually did a two peer review on the code which was pushed. And why didn't they do a two peer review on the code which was pushed?
Because they'd already been fighting seven one problems for the past week and nobody had any sleep for the past three days and why didn't they do that? Because there weren't enough people in my team. Continue asking why those questions after another, not necessarily have to be five, it can be more,
it can be less until you get to a proper answer of what was the actual reason that was causing this problem underneath because it wasn't the fact that the container crashed or your NGINX no longer responded, it was the fact of the number of underlying things that you need to fix relatively important soon because if not it's gonna happen again and again and again and again and again in order for you to understand
what was the underlying cause. Action items, as we said, we have a single threaded owner of the specific COE which is the owner of what we need to fix. We need to, for example, make sure, if we go back to my NGINX example, that every code is enforced by two peer code review, that nobody has an on-call rotation
of more than seven hours and nobody has the fact that nobody's overworked, whatever. These kind of things, of course, these action items can also be physical things or they can be things that you can implement as code. For example, I put in a policy which means that nobody can actually do this anymore by mistake and automatically approve. And all these action items, of course,
have to have a date with a relatively, I would say, reasonable amount of frame time for things not next year when we get to it after our backlog and technical debt and everything else is done. But yes, these things have to be improved in order for you to make your processes better. The last thing which you should actually write is the summary. The summary is, after you've got all this information,
an executive summary of approximately one paragraph, what happened, how we understood it, what was the impact, and what we're gonna do to fix it. Okay, I'm gonna leave you with four links. You can, if you wanna capture the QR codes,
I do know that the presentation will be available afterwards on the site so you can download them. But anyway, the first one is a article from the AWS Builders Library. These are in-depth white papers where our principal engineers share the, I would say not the philosophy or the methodology
of how we do things in AWS. This one specifically is how to create proper health checks. Most of this information from the presentation today came from here. Second one is a self-paced lab, if you would like to do something on AWS to implement those kind of things. We have something which is called the Well-Architected Reliability Lab
where you can go through the process, run these health checks, understand what these kind of things are doing. Third one is the product page for AWS Fault Injection Simulator, and as of, what is the date today? 18th, a week and a half ago, all of these things are currently available for testing in fault simulator, both on EKS and ECS,
if you would like to test these specific things of network, fault M, black hole, latency, packet loss, CPU, memory, IO stressing, all those things you can actually do today with inside all the container platforms in AWS if you would like. And the last one is a link to the principle of chaos's website which I used in the beginning.
The last thing before we all go to beer and have a nice cold drink because it's very warm here. The QR code, in AWS and Amazon, we are what we call customer obsessed and I love to work back from our customers and that means that small QR code is a two-question survey, so if you would please do me the honor of filling it out, 100% anonymous,
just to understand your feedback on the session, if it was useful, if it was beneficial for you, if there's anything you would like to see different or if I could change or improve on it, I would really, really appreciate your feedback. And I'm on Twitter, DMs are open, my email is over there on the screen as well.
Before we go to beer, I'm gonna give anybody if they have any questions. Thanks. You talk about continuous improvement
of detecting, learning and responding to chaos engineering and these continuous are already concepts known in DevOps engineer, so my question is like, do you have an opinion on how chaos engineers and DevOps engineer
should work together and how they can integrate those kind of workflow that we see already in chaos engineering in the workflows in DevOps engineering, which, so. I will repeat the question just for benefit of those who didn't hear. How can you integrate these kind of concepts
of chaos engineering and testing and on continuous improvement in your, I would say DevOps practice or your continuous development practice? Oh, there's air. I would think that would be part of the flywheel and as soon as you continuously do this on a regular basis
and not once a month or a quarter or a year in order to find out where the problems are and you implement these kind of practices, for example, up to a, I would say a level where you can do it for every push of Git for your commands, in other words, you can run these tests to see if you're broke and anything, if you're in your environment. That would be the best ideal
for implementing these kind of things in your DevOps and development practices. It's not easy. I can guarantee you it's not easy. The ideal and utopia and Valhalla, which we would like to get to is to have this as part of our development and CI CD pipeline that every single push goes through some kind of also chaos testing to see if we've introduced
any kinds of bugs as well. Yeah. When you mentioned monitoring, you also talked about anomalies. What do you do about false positive monitors, which happen quite regularly, at least in our platform, and we just kind of adapt to it if a monitor pops out
a lot of time and we know nine of 10 times it's a false positive. Do we have any policy for those monitors which can't be fine-tuned enough? If I rephrase your question, what do we do without all these alerts which are not really, they're not interesting to us
and they don't really give us enough information? I think the answer is in the question that I rephrased. Try and make them more interesting. If it's something which is giving you information you don't really care about, it's not useful. So for example, I don't need to monitor the CPU inside my container, not necessarily, because if it does spike, I'll probably get
that information from the orchestrator which starts scaling up. Yes, for these outliers, as we said, there are certain things you need to do, monitor, and understand where, it's more for, I would think, the performance aspect. If I have 100 tasks or pods running inside the cloud and one of them is really misbehaving,
it's pretty relatively simple to understand where it differs from the norm based on your metrics and your graphs. But for false positives, the only thing I can tell you is try to get better information and find out exactly what you're looking for because if the information is not useful, it's a waste of resources, time, and alerts
which are going to cause, I would say, fatigue and ambivalence to whatever's supposed to be happening in the platform. Any more questions? Yeah. Is kind of a slight follow-on to that one. Every single anomaly detector I've used
has been utterly 100% accurate at alerting people on bank holidays and there have been more of those than actual outages. Have you found any way of mitigating against that? Because every low traffic alarm, guarantee it goes off 9 a.m. when everyone's trying to have a line on a bank holiday morning. Yeah, the answer, do I have a good answer
for anomaly detection which actually wakes you up on the wrong side in the morning? No, I don't. It's writing better tests, writing better alarms, and trying to understand or filter through the noise which not necessarily should be something which is waking you up.
That might not be a good idea and you won't hear that as a recommendation from me but that is one way of doing it. Awesome. Thank you very much for your time. The beer's waiting. And enjoy the rest of the conference.