Incident Command at the Edge

Video in TIB AV-Portal: Incident Command at the Edge

Formal Metadata

Incident Command at the Edge
Title of Series
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
As a content delivery network, Fastly operates a large internetwork and a global application environment. Fastly developed its Incident Command protocol, which it uses to deal with large-scale events. Lisa will cover in detail the typical struggles a company Fastly's size runs into when building around-the-clock incident operations and the things Fastly has put in place to make dealing with incidents easier and more effective. She will also cover common mistakes and lessons learned as Fastly continuously improves its Incident Management framework.
Graph (mathematics) Channel capacity Graph (mathematics) System administrator Multiplication sign Planning Database Bit Real-time operating system Cloud computing Demoscene Twitter Process (computing) Software Hypermedia Personal digital assistant Operator (mathematics) Graph (mathematics) Website Quicksort
Group action Existential quantification Graph (mathematics) Multiplication sign 1 (number) Medical imaging Bit rate Different (Kate Ryan album) Netzwerkverwaltung Software framework Office suite Information security Position operator Area Service (economics) Channel capacity Shared memory Internet service provider 3 (number) Flow separation Membrane keyboard Type theory Arithmetic mean Operations support system Process (computing) Drill commands Internet service provider Quicksort Information security Simulation Data type Directed graph Point (geometry) Trail Game controller Server (computing) Dependent and independent variables Channel capacity Event horizon Law of large numbers Number Twitter Revision control Crash (computing) Telecommunication Internetworking Software Energy level Dependent and independent variables Standard deviation Graph (mathematics) Information Inheritance (object-oriented programming) Operator (mathematics) Planning Computer network Denial-of-service attack System call Word Error message Event horizon Software Logic Data center Video game
Group action Existential quantification System call Graph (mathematics) Multiplication sign Decision theory 1 (number) Mereology Perspective (visual) Neuroinformatik Direct numerical simulation Mathematics Bit rate Core dump Netzwerkverwaltung Cuboid Software framework Endliche Modelltheorie Data conversion Information security Position operator Physical system Area Rotation Collaborationism Service (economics) Software developer Feedback Coordinate system Parallel port Bit Mereology Lattice (order) Control flow Type theory Curvature Arithmetic mean Root Process (computing) Buffer solution Website Self-organization output Right angle Summierbarkeit Procedural programming Quicksort Figurate number Cycle (graph theory) Metric system Point (geometry) Slide rule Game controller Service (economics) Momentum Collaborationism Coordinate system Event horizon Power (physics) Twitter Product (business) Frequency Broadcasting (networking) Goodness of fit Kritischer Punkt <Mathematik> Root Telecommunication Internetworking Operator (mathematics) Graph (mathematics) Energy level Software testing Right angle Traffic reporting Computing platform Metropolitan area network Form (programming) Task (computing) Self-organization Addition Dependent and independent variables Graph (mathematics) Inheritance (object-oriented programming) Information Weight Surface Expert system Planning Line (geometry) Group action System call Explosion Software Personal digital assistant Address space
Dependent and independent variables Graph (mathematics) Graph (mathematics) Structural load Shared memory Infinity System call Event horizon Wiki Explosion Root Process (computing) Graph (mathematics) Whiteboard Quicksort Traffic reporting Address space
Scheduling (computing) Table (information) Graph (mathematics) Multiplication sign Control flow Mereology Perspective (visual) Field (computer science) Power (physics) Local Group Regular graph Internetworking Graph (mathematics) Personal digital assistant Process (computing) output Dependent and independent variables Graph (mathematics) Control flow System call Process (computing) Workload Software Drill commands Internet forum Self-organization
Group action Presentation of a group Multiplication sign Correspondence (mathematics) Real-time operating system Open set Perspective (visual) Computer programming Software bug Wiki Operations support system Expected value Synchronization Different (Kate Ryan album) Netzwerkverwaltung Matrix (mathematics) Videoconferencing Automation Software framework Series (mathematics) Error message Information security Position operator Physical system Scripting language Area Software developer Feedback Data storage device Sound effect Infinity Educational software Bit Instance (computer science) Lattice (order) Flow separation Type theory Arithmetic mean Process (computing) Self-organization Cycle (graph theory) Quicksort Resultant Reverse engineering Point (geometry) Web page Slide rule Service (economics) Link (knot theory) Real number Electronic program guide Least squares Checklist Regular graph Event horizon Rule of inference Wave packet Product (business) Number Frequency Latent heat Internetworking Term (mathematics) Energy level Selectivity (electronic) Software testing Focus (optics) Dependent and independent variables Graph (mathematics) Key (cryptography) Information Weight Prisoner's dilemma Physical law Interactive television Expert system Total S.A. Line (geometry) Directory service Cartesian coordinate system Configuration management System call Voting Grand Unified Theory Integrated development environment Personal digital assistant Communications protocol
and so I started out doing this kind of thing 20 years ago because this here yeah it's and celebrating 20 years working attack and um started out doing systems administration and of what we used to call you know this 10 Ops database or not and leadership as well as individual contributor I'm started out as and a localized P as many invested and then moved into working for alive journal where I ran at work then cash was developed there so kind of got involved in the social media scene and scaling these websites on no money so I eventually worked at Twitter remain job was to kill the fail so as hopefully succeeded there because most people now don't know what I'm talking about and and then to be your off to travel and ended up back vastly where I thought why not just do it all over again aware of who should capacity is growing and that a crazy uh case it's a what we used to call has CDN but we now referring to his and cloud platform and where were basically enabling the faster delivery of real-time data on the Web sites over distributed sort of cash network In 36 centers over 5 continents are current capacity is about 10 terabits per 2nd so were quickly increasing that infrastructure and so never a dull moment and in the whole world so this talk to you today about how we handle incidents the things that we put into place based on our experience in some of the things that matter the most to us and that some will be talking a little bit about some tools will be about some stories and at the end will have time for achieving a before I get into that I'm you talk to Bosnia happens me over the winter I was on a plane to Cincinnati and I
woke up to an image like this and someone telling me Lisa way out that happens what smoke and I am you know I've been on call a lot and so I think I might pretty capable of like waking up and immediately going into like fixed logical Medina American issues are an annuity it is now I'm just waiting make membranes just going through all this and I was late wall let us some time to think about the incident response late in the wild so Adams was kind of noting you know what what's happening right now there are people who were responsible for flying the plane and preserving the drinks are a suddenly also responsible for my life for landing the plane and also relate keeping everybody calm and so as I was and you know with my head down in the emergency crash positions i which I realize it's really to get us all the very common knowledge Look at each other and freak out and how can you buy like all the things that are in play because we land and speed to the you know to the gage means all the other means were redirected we got like the 1st you know go straight ahead and a fire engine greeted us there and and this is all happening like probably 2 or 3 minutes after woke up is extremely fast and so not only word the people that I had in my own flight this this decentralized rebut incident responders they were also checking off other incident response you processes I with air control it on the ground and so that I got out not only a life but feeling like really good about delta is like my experience there so it now I'm like 0 I love delta because I didn't die that has nothing to do with whether or not I like they didn't like saved my life really cues like the air conditioner was smoking might be refined but I know that at the time rate ends up and and I and know that was really was awesome so in my day-to-day job I am not
responsible for people's individual lives which I'm really excited about by its and the that we provide for and provides a sort of infrastructure Our of critical data and information that is used in global emergencies and big events like the election we host New York Times and we have uh weathered many the gasses against us by critical information delivery networks we have Twitter is a customer with the New York Times The Guardian who develop these big you know if online most is you're heading vastly so while that's not seen in individual life it's at our networking available constantly released you knowing what's going on with the network and can be can make a big difference in your life I so these are the types of things that we do the same things that you see and there's nothing different because this is like the Internet this stuff happens every day constantly and we're not going to fix so that's like when we start talking about like how can we have an incident response processes that it seems to make sense of the 1st step is acknowledging that it's going to fail and then you it seems they'll you'll fail the people you work with your software your datacenter your network provider these things are all gonna break at some point and that has never cease to be true in 20 years I how you respond to them no but it doesn't matter if it's your fault someone's fault or matters is how it's impacting the customer or how it's impacting the reader at home they don't care whether or not it's a the of yours were capacity problem or the data type on fire and flooding which have also been to but if anyone wants to share stories afterward that probably been through every bad thing that could possibly happen or I'm about to is they said that I'm it so the Internet fails you start to do your job you're a person the you don't need to hire a group of people that are just sitting in the back of the plane waiting for something to happen oscillating suddenly Berenson responders you yourself and the engineers and the and the support engineers in the sales engineers year the people I at that need to switch from that running the business to our responding in ensuring that customers and you know it happens the so when I came into fastly I think about the sort of idea of like coins and is really common it is up to get people's attention like an executive maybe is like something's broken and I don't know what it is and I'm scared and I don't know how to get people on some of the college incident and this this is really urgent and that's super common so when I like to do is talk but with the impact of the Great Lakes yes it feels really important to you how is is actually impact in the past yes it's not our fault houses impacting the customer and from there we can sort of develop this framework we started with we know things will fail there we go OK but if the smaller networks and transit providers in a region that's got 12 fails is something if in South Africa the transit provider fails and there's really only like a couple transit providers down that's a really big deal now we have a track going to London that should be going to Johannesburg coming from and going to Japan Johannesburg so and that's where we then focus on what's the severity and that's something that's Lake old school till it's been around forever announced but I captain some like OK we're all getting rid of our knocks role getting rid of art was no lots anymore is no servers there's no arts but itself still breaking up to respond to source to fix it so I kept the severity levels as a way to late keep the sort of standard of vocabulary for executives management and engineers to all use and and then from there we have our expectations about OK How arena respond how we follow and so what else to it how we communicate with the customs the so I just in specific detail this is why it's our this is a version of our securities we keep in pretty simple 0 the 3 you hope zeros never ever ever going happen in threes actually happen pretty office ends up in we track are severity 3 incidents as well as we track anything that's a more critical and the reason we do that is to keep us in practice the 7 3 is that seem like I had an impact that many people always just this 1 region those are your practice the three-year drill those area and maybe the light is out on the plane and that doesn't seem like a big deal but that 1 customer sitting there looking and I see really scared of flying if I saw anything broken on a plane of the holographic that's broken the whole thing could be broken were not aligned not being super savvy with plants I'm so even if it's only impacting a few people that your experience to and to make sure your process works so we still do a review and a post mortem and go through the 5 wise review how we should be improving our monitoring how how we of course how we can prevent it from happening next course you're always do not pursue an engineer but all the other things that we found it seems to be have told our customers about it sooner and those all get reviewed weekly no matter what and it doesn't have to be the most important event the higher the severity our because the lower number the bigger the impact and we you know we may spend more time I have a bigger group the people involved in the response but everything's going through at least some review so we start this it looks like there's a lot of incidents and when we're all this out it was scary at this was like we were telling the executives were telling everyone in the company were having a bunch of things happen that impact customers however over time we saw the more said 3 events that attract the Fuehrer said Susan said ones that we experienced
so what's at the core of this whole process and how you should be thinking about incident management and your place it's that we are people we have human needs if we treat every thing that happens on the Internet as it to this major events then a couple things happen 1 nothing really seems like it's a big event anymore and you your engineers don't get sleep like Imagen if we ask the pilot to deal with everything that ever happened in the back of the plane and then also but he wanted plane safely and took off as well I'm and then you can even sleep on your you know whenever you figure it out like whenever you fix whatever's happening in the back of my that's not how businesses work that's not how you can have like a good customer experience and you're gonna come nearly randomizing them and then we think about like how can we actually smarter about how we do these things so the core of an our Our mission and how we make this 24 by 7 never happened it understanding that were staffed by people and I when thinking about the framework how you would handle instant response better about how we get the right person at the right time without broadcast another thing about vastly we don't have a traditional not again most of most businesses are kind of moving away from that now so how do you have no not and the pilot of knows what's happening in the back of plants that you can make decisions the front here she and you in power everybody in the company to have the ability to escalate to know what's going on in any given point so our process Princeton management is super transparent everybody anybody in the company can escalate through the internet and process anyone in a company can watch real-time as troubleshoot and mitigate incidents In our sort channels we have a global team of customer service folks wheels of a global team accessories so there's no point in us having a group of people waiting to just escalated issue when we can have a go directly to somebody who actually knows how to fix it and I but she mention we also the decentralized Every engineering group every service at vastly has an engineering group that on a development engineering group on call In addition to a series and that each that's hard to accomplish I know in many places and we deftly talk I'm 1 veiled talk about how to accomplish that among the top of that right now but as you go a lot of questions on 1 of the ways we've allowed this to work there a perfect partnership on we've given them control of their own destiny here's monitoring platform I think we probably at 7 different types of monitoring platforms that half choose the 1 you wanna use understand what the metrics that make your service healthy and how it impacts customers and then you're going to be involved in the postmortem answering comes the incident review meeting and were going through the time lines and we go what happen during this 15 minutes it took to escalate the answer is we do not have a model that's part of the conversation and we encourage the developers to um to improve the monitoring now this is done blameless way but I think having the developers and a series of that all in the same room when were reviewing this is how we can encourage that's this feedback and coordination and cooperation so we've always been empowering them to improve I should mention as well that our and the main critical point that makes us all work is the position we have called in sick man which is a shared a position that's an on-call rotation with directors BP's managers across sales engineering computer the computer but you're in customer support of and developments these are folks who like Borland's here on top of the day job to be the 1 critical person coordinating in the middle of an incident so we you know generally the processes something's found to impact customers this could be from but Twitter it to be from our own internal alerting it can be from I you know noticing something on the internet and we had the S 3 outage of the Internet experience this 3 outage in October last year and is on does was this example of an incident that little like vastly had nothing to do with it it was not breaking any of our infrastructure but obviously it had a major impact on customers customers and on the origin requests were failing going TEST 3 so we noticed it and so we actually had at we noticed ourselves for internal network monitoring we noticed it from reports from customer tickets we in the areas we solve for our customers origins and I'm we actually update our status 40 minutes before i'm letting our customers to know that they were down and and that's because all of those different people in each of those different levels were empowered to be part of this process and so this quarter where they do they generally know how fast the works from a technical perspective they know where our on hospitals are and how all the services would work amongst each other and their colleague OK with the impact of this with the impact of the told customers and we have a procedure that site related to communicating with customers they get kicked off from the command they don't have to do it themselves they have to keep it up the next thing they're doing is calling in the experts through the on call than and doing positive acknowledgment are you the person knows how know who the next person B and that's that's actually how we were able to save a lot of time during the whole thing where everyone's like maybe this maybe if that's the thing we do is tough time box maybe 10 minutes to figure out what this is if you don't have it if you still don't know intended for escalating to the next point to the next and movement and we've been able to shave a lot of time off of our response that way and our interlinking enters allowed to call off any other deploy so we do this actually means that will break that was long are deployment system no applies no changes period until this incident is over and that's as you know in in a large flat flattish development organization sometimes it's hard for developers to know or for other people working in the network which changes going on right now maybe don't make the situation any worse different
I only better and then here is really something that's super critical I had we know that this issue is no longer impacting production we communicated to customers were moving the dialog about all the engineering things that people love to talk about into another form this incident over you can stand down if you need to go to sleep if you need to eat if you need to spend time with your family go do those things because the rest the stuff is cleaner and then answer commander is also responsible for making sure the cleanup what happens as well and so that's that's our buffer main the see that sounds really fun near the commander your commanding I so I talked about the sort of transparent approach that we have and I but I I did told quite a bit from our product marketing HR legal teams that would be like and to me it's like of course you're involves that's and a bunch of stuff I don't wanna deal but from their perspective thinking I have got to be this symbol before in like how we write our service advisory and and what is our decision-making criteria so we involve those parts of the organization at the executive level and the incident commander engages with other executives to get their input on our surface advisers you're not just seeing a root causes from an engineering group you're saying the perspective you're bringing in a perspective of things that you maybe haven't thought about before and I think that's and yeah that's a great way if you're an engineer is responsible for the time of a service for you engage the rest to the company in a way worry it's not all on you to make sure that were communicating and related entities and appropriately and so we call this like crowdsourcing we do crowdsourcing of Our at saying and goes through a lot of reviews before we send this to customers we go to the 5 wise were letting customers know like what I wanna like I'm like if I were working at another company ID my customer rate so I'm always looking at it like I would wanna know more and what does that mean and when in doubt we choose transparency and I think that's actually expected in the industry as well and I think that makes a great happy I the challenge with the wide collaboration and maybe some of your thinking that you don't actually want everybody's opinion they often and so there's a challenge here which is the natural tension between like should I go ask people their opinion to the problem with that is enough to listen to their opinions and respond in a way that's not being a jerk as if you're jerk about someone's involvements they're not going to come forward and help later when you need them so part of the training and and the work that we do is and think commanders is understand at 1st the people volunteers books I wanna give you their opinion are coming from a good place they care about the customer maybe they have information that you don't have but in a delicate way you gotta be able to sort of take those that desire in redirect so for example in also I think in October of last year which was a horrible month and there was a DNS outage with dying and so that was late front a CNN ones it was back on line ends you know it was like a really big deal Secrist everyone in the community 300 people in the company they all want to be involved somehow and because it is not only was it was in fact the sum of our internal services again With impacting a our customers so I was in 2nd that day and not got like 50 people from all over the company going like I would hope but but this did you see this they're saying this you know you've got like security side all rumors about where attack coming prominent you've got the lake what is the vendor actually saying about what's wrong interesting and then you've got the like how do we actually help and like mitigate the pain brass and how we mitigate the pain for customers so am we established a core team that focused on mitigating the impacts to vastly infrastructure and then I just set up several other groups in and was like OK you you're figuring out how we launch our own DNS service for customers right now if we had selected this person never came back up again how we do that that's you that your group here is the lead to go figure it out come back in an hour let's have a check check in time you're working on documentation that were going be giving to customers you're your working on you know and so basically we're able to get all these sort of parallel and steps in motion in case we weren't able to come back up in a reasonable amount of time and and they all felt that happy engaged and what romances luckily everything did come back the didn't I wouldn't have been working with an already tired group of core engineers to also go tackle the next thing so that's I think that's a great way to use volunteers that you don't give them a task if the follow-up time and then ignore them until you're ready to talk with it the and so we talked about the um through the importance of going through the whole cycle every time as your trolls with the subtrees and and and have basically this process itself is going to help you improve your overall operations so and we do you track everything in Jira I and we do an incident report that's pretty typical postmortem timeline focused for every incident we need like I said weekly to review them all as well as additional meetings if it's a larger incident and I we go through a postmortem processes and this sorry I lost my momentum on the slide the very important the everything we do here as seen from the biggest point that I brought into past we was that I was watching postwar postmortems and then at the end here 0 but we got through that and we did a good job rate weight that's part of the challenge of the plane was post mortem all but it's over and we got through the job instead of OK it's over but when this when this happens again where we unity no blame but let's work on what we're going to do to help prevent it at or result faster next time so that's that's why we make ourselves do this
week we mitigation so here is the actual response process in slight detail I think I've reviewed all events I the the incident report out 1 note about that in the in the responder is like the person is on call actually hints on the board and we have the infinite responders sort of share the load of the timeline so the instant candor creates the timeline is the incidents over and then we essentially work together on a timeline where again it's like on a wiki and everyone can contribute with their on graphs in details and that's expected to happen within 24 hours after incident and another
note on the exercises we do for continuous improvement once a quarter we get together and answer once every 6 months we get together and you tabletop exercise which is like the world's finest the indie where we're all you know you have this role you had that role and we walk through what would happen if such and such situations and we choose a situation that hasn't happened to us that because as they said before every insulin is essentially a drill so this is like a drill on top of that with even more scary I and this kind helps us feel like we identify as we do the same postmortem for that that we do if it was an actual real incident so in conclusion start with the basics everything fails the internet's really crappy this whether all the time and funding go in empower your engineers so done just give them the responsibility and power to make part of the process helped them feel like what they're doing is actually helping and check and make sure you know when they're not when they're being martyrs or unhealthy about the contributions be clear about on-call schedules if you don't have an on-call scandal if you've engineers that say they won't be on call they're implicitly agreeing to be on call 24 hours a day because you know that time that the software breaks and were your software breaks and you can't be reached yeah like field high crappy about and some public and again trouble you know somewhere in the organization so as to make it clear when you expect someone to be available and not we that's the the fairest way to treat engineers I always partner realize that there's other teams that can teach you about incidents from a perspective that you don't have and then let this process continue teaching that is it's
the pH the that you mentioned these last that would you
go into some detail of what's you of what's real interest me FIL and you know works unless it doesn't so actually seperate there's particular chat rooms that are based on the and whether or not something is an instance with a capital i were just a regular production and event and again we refer back to the severity so I don't have a slide in this presentation but we do have just like there's that set of matrix that they showed we have a corresponding matrix for communication protocols so for each says it says which chat room to go into and and so our major major major incident channel a lot of folks do lack occasions if something happens in there as because the insane and they just wanna hear everything happening all the time I have it on when I'm on call but I don't have it on other times and so the expectation is of your call your in a few specific IC chance and then the other thing is if you just wanna be series of we update the status of our bond status page that goes into a general announcements area so everyone in the company can see when the something that's impacting customers and it's like is there we get a secondary channels so in the past we use diocese of got about capability and then there's also Google Hangouts because they were already there anyway led OK will come back and culture so that but 1st things you of the talk of about the incident commanders that had to select the said I was on top of the day job so is a voluntary uses selection process using noted in 1 interview you were wearing your organization related questions equals very good question when I started it was all mean they only let a couple of engineering the keys who were happen to be the people that the only people who knew how to run everything at the company so in order to become sync commander when I started it was you have to know how to literally fixed on if it's crashing and so that's not scalable and so this system in switching it we and I will 1st but we do look toward management and which we are managers our engineers as well so are fairly technical and so I'd say people are not volunteering if they are way on the completely non-technical side which you know so I don't have to turn them down and I did not deter I did suggest to someone that maybe it wasn't a great time on because their day job was so far removed from deployments and customer interactions and stuff like that the hurdle would have been too high they're technically if you really in a coronation role and you really stick to the script and you should have to know all of that but we do a lot of we still have a lot of gut feeling that's like that employ out at 4 PM and the like that that kind of stuff so we you typically 1 have and manager is sparse because of their experience level and also because their sense of accountability do you really engaged during that period of time and there might be 2 7 3 incidents in a week and n so the other thing about it is that you were way to overstretched it's not a good position for you there because you're not be available at entire focus once there's incident I We do incident commander training so we have a lot of documentation about this were developing checklists now and we just tired of director to lead this whole program because it's been successful like as a full-time job so she's going to be doing as this documentation of the framework is on video training an in-person training for the commander and then the flip side is we have to train all of the Internet responders so the engineers as other they don't know how they interact with the commander so we did incident responder training for all developers and net engine series before they go and call on and then we have to redo it like every 3 months because we are all children of forgetting to retreat so we use know we have the links on our wiki and you can watch them in their own time they had all your points was the still 1 more of here as the located 1 the of us usually work in time I you know right now it's a week-long like 7 days 24 hours a day primary and secondary I think we're going to switch to a three-day irritation because it's too exhausting if for example there's an election or something like that where you getting hit from all sides and as we were during that will lead the the question it thank you but that and what you mean by the solid was the most of any usually musicians that everything from our that's everything around but depending on the experience and the focus and the specialty of the engineer higher it's everything from more operational fixing things on the front line triage in automated work to response and remediation of alerts themselves and then working are closely with the developers and so there's really like you think about in terms of like our at the time line there's like a series embedded with the application the groups that are working on how we do deployments non-emitting deploys and configuration management and it's 5th ensuring that they're working side by side with the developer to understand how to deploy in our infrastructure and then we have a series that are more on the front line the idea being were touching things becoming and also helping to prevent them from happening in the 1st place so it's like a nice full cycle framework to the how you know it's not moving in the same order in terms of improving the musician versus year is used this work is central to grandma's people yeah that's a great question this process has been able to feed its you understanding where to fill gaps in hiring and as so when I started 0 OK I didn't mention sorry we categorize all the incidents as well so that operators operational effect of the Wigner started as many a large percentage like at the percentage of our incidents had a human error misconfiguration something like that somewhere in it I just like searched through all belongs in the last 3 years before it started ends up and that led me to know that I needed to put direct I people working
on the automation of a lot of our processes so we focused on that and then they we've gone through period where we have more software bugs and then we know we need to focus more on our testing and and and sort of how we deploy and then find errors before place and so we we can't let this process guide us in our In where we focus next the other thing I should mention is not everybody is on call at the same time so that this process is really to be very specific right week your call yesterday that person it's interrupt-driven but the next 10 weeks here not so that's kind of where we were doing it that so that the incident responded to this means his later this year it is thing I think they were they adenoids means and that sort of thing yes and it's it's you'd so it starts as the person on call it the 1st person is responding in that as the subject matter experts and if they don't know how to fix it and they need to escalate the next person as the responder yes there is some people the type of people take the so this is the 1 in the back the thank you Aaron much to your customers into things like this lecture room stuff and they certainly didn't mean comment that I'll let you know that this call this talk giving rain I've given in about 5 times last year was a little different of course specialized for you and by its I 3 and finds it had an incident occurred which is so weird I know that the society and the vote and then we could have been going for like 2 weeks with no incident and I get up here like I haven't checked her endowment I'm assuming that we hadn't then your pages might go off the lake and we've been tempted to bring up like that happened recently wasn't showed that channel as they were doing the data restore the which company was we've been tempted to show that I don't think I'm helping them ready to be but and that would be interesting definitely which is shown here last month the the you know yeah yes yes I mean if they're needed but actually we do have slack channels with our customers the cities are and a directory and then we have the ISI channel where where in and so this is sort of in school where the relaying information so I do you hear what customers are saying in the middle of an incident isn't getting that through this lecture do you also deserve support post once you do our customers different post with you now we have not done on Open Courseware we delivered delivered our service advisory which know usually has a timeline in our final requires an act that we've never had them and actively engage we do you get a lot of feedback and we respond with that in the back of the sudden people are doing out of this dilemma I know that has to really just those Olympus Mons in solution to the officials company because the perspective total things were thinking about all in how it impacted customers in G 2 that's great and the yeah does he talked about research doing tabletops birds and its you talk a little bit of what that looks like what you do it this way in the Lasso bound precise mentioned security is the other component of but we consider business impact as well as infrastructure impact with the response so there's a security event our security team has their own infinite response and we worked together with a so the security team is actually the 1 currently leading so that's cool because they're they're very focused on risk business risk on and so 1 of them will come up with a scenario and this was super sneaky but the 1 they did a couple weeks ago was directly in response to a meeting where I said we weren't doing something yet and they use that in our exercise so they're like no amount in this scenario as a tenant so humble pick something that they know we have a weakness in and that does this this very detailed narrative and I'm not kidding when I say like role playing like he's the DM and he a light OK here's the scenario and then and then we designate you're the oncolysis the on call that and you're the on call and then like literally walked through OK what you do next pay well I'm going to ask you as the SME around do you know the answer to this actually I don't OK and then literally then go in all look through documentation until like as as if so it's not just and in that case I'd like answering an interview I would go in and look wiki and magically find all the information we actually want to do it and and so we did like a data store recently and I was dismayed at how long it took us to figure out something things and that's bad realistically would have taken on and in real time and we just walked through and then we write everything just like a personal and by Algeria tickets for all of the things that we can gaps that need to be on results so we're really don't really like to talk about some of these are you working on the appeal of AIDS definitely and so feeding me up the unite but it's something which is used to question and to keep this 1 can the you hope that this is due to the user questions it was your exact so you mentioned that you do 0 notification on and acknowledges the reverse if it's certainly the time marks period goes to the next year tool you use those from some of these jobs tools here the you know at Reliant over a nine-month lack remind in 10 minutes are you the 1 that knows how to fix the you know you're like helping 0 like what you need you know what you need to get to the answer but I need to run a my test environment for how long do you think that will be 10 minutes OK will check back in 2 minutes and then you set the slack reminder and then 10 minutes later you back and it's so important to do that the number of times we see incidents were someone's like asked a question and 2 scrolls behind and so we do you like a simple simple simple suspect miners married adults tool what do they really was the rule of law and then if you lose your