Bestand wählen
Merken

I don't like Mondays-what I learned about data engineering after 2 years on call

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
I am the near there there were for not on the ice states in on them yes there is a marketplace for his more creative businesses where enrichment mission is 1 of the best parts of London incredible parks and Riverside problems and it doesn't feel like at all if is like a country religion emitted by like embossed of Lawson rich bunkers life's and the components solution is where the own instance form the and some of us care a lot about the renaissance this is a talk about being on call getting a pager being available 24 if you're a develops on a developer you might already know something about this topic if you're in the Tanzanian went to make a case that you should care more about this topic but I will also talk about the TV show and there is an embarrassingly I number of pictures of my kids and it's a Friday afternoon talk so long when toward me up by asking you questions you to raise your and Swanson was encoder right now but look who was encoded in the last month was actually called In the last 1 OK if you did the raise your and the 2nd time when you do the raise your and the 4th time while you're leaving the very my for nothing and ships would free on the menu for you today we start to read like the
definition we want to sigh makes advice on what to do during and after the incident then we have a lot of prevention especially and then we move a baby so far with a father view and we talk about motivations and auditing practices and in the spirit of thought of transparency is both of transparency and I have a couple of stories to follow lucky enough 1 was from last week so some of you already know about being on call means the other the form used to be a pager in the eighties and want things you need to go and fix the system because it's rock another way to think about it is basically you have something that is would you at all times and you need to care about
it but secretly irritates you and it will demand your attention at the most in a porch and times and which ones your attention you have to really geared immediately but the norm called is also about knowledge about being the 1st person will uh act on a certain system in case of a problem so you get a do you know an entire system not not only just the parts you did develop n you will also get some rewards for because you deserve them the words like being look up in the middle of the nite the the
so all the last Thursday my phone around at about 1 AM and before you get to your computer you want to make sure you are awake you're fully awake so on uh make some do you make some coffee while coffin the sleep it transition like going from Europe was sleeping self or your working self is the 50 and is during office hours to Europe incidence of so you don't want to be the guy who is developing on 1 window and fixing production on the other window that just these and this serves your full attention and the 1st thing you do East you're going to read and yet out of that what you are like really read it least a couple of times possibly more it was a red it's you probably know where to look and which system broke where can I find the error logs were confined to the monitoring data and you want to gather as much information as you can until you can this could be sure of why did this alert trigger like what did wake me up at this point you probably have enough information to assess the book what is going to happen because of this problem who is they're not going to be able to do their jobs uh who is not going to be able to nor something they want to know more the n be nice if people are impacted by the problem informed them most websites of stoppers pages forward in turn out systems you probably have a of an e-mail on the charts through and there is a question you might yourself find yourself asking a lot which is why what is the real deep cause of this problem and this is not the right time to answer if you are going to diving tool would your developer mind to try to find out the real root of the cause of the problem you spend a possibly very long amount of time the Nikkei any kind of quantified so don't do it is not productive at this time but and as you find out informations as you start acting on the system level what you're doing the city this is on chopped up locked but you can also like just open a blanket you have to use the to use team and start typing out what is happening this is a valuable information especially if you're preparing the Torkham being on call and was have enough information you probably can come up with 1 action for a few actions that will limit the impact as much as possible and Our the that and probably squint a bit because of it's readable hearing to find out that I did not follow my own advice and last week they ended up for working a library it for him and trying to pass it it did not solve absolutely anything and that's because instead of focusing on taking the smallest the selection I actually started asking myself why an that was not useful at the time this I should have taken which we only took in the money was just that that interventions that draw we wouldn't have known on the day after will d sources of air traffic where but at least we would have had the all the rest of the data in time must have taken enough steps to limit to the
the impact again be nice in from the people who are infected and then get boxes because you want to be fresh and the next day is really you wake up the after and your 1st taught in your in your mind is this is a pretty stressful I don't want it to weapon ever again and at this point you can really ask yourself why this ever recorded and the best way to do it is an RCA the there is Best practice an extensive literature on a series of so not going to dive to deep just on the slide you want to put your detector you hat on gather all the information what actually happened during the day during the incident and before the incident find the root causes and b sure to leave enough time at the end to decide on some actions that will mediated those root causes that's very easy in this case to try to blame someone you don't do it so I'm going to tell you a story it's about endorsing a childcare ospital she uh gave the wrong the drive to a little child and this had almost died so the an inquiry was opened and there was a proposal to fire the numbers Bogdan the Commission on the enquiry died a bit deeper and they found that the Virtue should about Mr. the direction actually administered where 1 next to each other in the same cabinet and they had seen labels and the is found out that the had working 10 hours straight and there was nobody to double check what medicines will drives she was missing so don't allow yourself to focus on default of 1 person always look at the context and from an RCA usually get some useful lessons for the future the you've earlier utterly familiar with this and be very careful when you where your systems to look to afford party and because communication is Morris Carson would easily ignored and watch out for point of friction in internal communication as well so that the root cause for last week's uh figure was that the Jedi reporting as a time on site field and their named it's to session duration and your name was deprecated in 2014 but they actually started enforcing the petition and fading on EPA geckos last week and other insight is to really care about your error messages keep them up-to-date make sure they include everything that can help you during an incident so checklists a lesson on the encouragement so we did we took free actions we fix the the root cause of the problem we rename the time of such session duration we should have some time to go to all the GA fields that we're using and check if any other of them is deprecated but we also included in the alert messages specific suggestions not to do what I did so just skip over the step Don try to diving to closest to much so that's take a step
back to 2015 mistake a brother view don't enemies to be show about a wealthy aristocrat British family In the 1st half of the 20th century is wonderfully acted the sceneries the customs the settings an amazing and it's really on target for not on the ice at this very British it's so much on target then when we uh place that you have to during the 1st episode of the season on a Sunday evening in 2015 we so much stays and so much additional traffic that the site on is the and then it went down again when the same moderate that on the plus 1 of and on 1 side of a sort of lucky because it it was not directly involved with the consumer site but on the other side of the situation that if infrastructure which is look after was a lot worse so we had basically no data since the Saturday morning because the the replica of the production that these we used to read from that was flying we were in the process of migrating between and hosting providers and that it hadn't been enough communication on when the replication between environments will stop and given had so could really fix all of this mass was all water uh databases and networking experts their books and he was on a plane back from Russia so as he landed like late Sunday evening is fun around so many times that that by the time he got on his but 3 at the rain down why did we can 2 incidents at the same time 1 on the consumer Web site 108 infrastructure that to do that we would have used it to evaluate the impact of the the consumer Web site incident and it was a Sunday evening the next day among the on Monday is word that infrastructure is the most users you because it's both the the busiest trading day and so the day
where we plan for the week ahead the I do not like Monday's and this was not a normal Monday at the time there were a lot more experienced than we have today and we got a lot from these events we don't there's an organization will on this team and we changed so let's talk a bit about the changes we made out the answer to this question
well you look at consequences of an error was affected by semi sort the data being unavailable or wrong do they depend on your Service Our are much much would it cost much time can be uh can the weights for the information in 2015 we were realizing that all work co-workers in our colleagues would increasingly dependent on what that infrastructure especially for decision making but see if you a public-facing 1 services public-facing web site you probably want to consider some kind of good policy because you cannot control on must the except people depend on your Service and you might also be contract in place your revenue might depend on external services In 2006 phase 16 we started offering to all partners that people excel on and off and the estate access to a rich dashboard with and saves Fig ures and product performance at that point we didn't any space so that encode policy may even if you just never and internal service you should consider uncle because you want your coworkers still spend less time and worry less about checking and double-checking that the services are available and if they spend less time doing that they will get benefit in their daily work if you take a step back to downtown abbey and sort of know the characters is not about just keeping the use 1 these EDA yes system Coke to be as good as bad as the best you can be at the job and you might be this increased interest when a bit a bit little less control on your priorities and at a lesser GTT is you need to react to incidence but in the end it would be worth it because you are enabling the others to rely on yours tool and your stability you will enable their success as the view the more and more on top of your data known and used tools and you would be surprised by the brilliant creative way ways in which they can use your the service you provide enabling others nothing else matters so it's worth it we decided this was although we make it work what did we do in the days in the weeks and the months after the the don't honor be the buckle to make sure that the tree fix problems in time usually does the very 1st basic is getting any main when a certain problem fades this is the real
basic and that's what we had at the time then you can the yield on the scene made you can touch certain tools that will form and and wake you up and you can even do it yourself with if you want and then you can also send you lower priority alerts some messages to your chart to your internal communications so that you have a timeline of it was this low priority laughter and that was this high priority alert and this is what happened then it's only in 1 place and this in this phase you also want to be making sure that the person answering the incident responding to the incident is able to do it so make sure your logs accessible make sure there is the conditioned place consider training people would would feeI emergencies in the and faking since the next step up in in the chain is moving from gathering information just when bad things happen to gathering information all the time so you can start with a very basic information CPU usage around use urged disk usage and then you can move up and and take a broader view on many web pages that was serving an are many jobs are running a much data moving and then you can even more even I of many customers said was having on how many orders have been placed and despite you can the blood your alerting system on top of your monitoring system sit in a rather than just getting a lot and getting page that when there is a problem you can say OK I have a CPU all at 100 % for 10 minutes name it's unfair and last or I have only 8 megabytes left of my other I've maybe some Fernando at an even higher step of after the chain is looking at your monitoring that your business data and monitor the data itself city you're looking at questions like is 20 thousand customers on the site normal for a Sunday evening than ever received the data we expected from Google Analytics the do we have a higher rate of traffic that doesn't have the Google Analytics identified source this was really well because it's it's basically another thing system for both your business and your systems there's a lot of research on data quality sort of tainted by association with some well known big software vendors but if you discarded in the big software vendors there's this concept virtually generality they don't depend on specific acknowledges so we have lots of checks and lots some require immediate attention some require attention on the next day some required attention on the next working day and is maybe start ignoring some of them don't get comfortably number a read each other at make sure the team worries each other responded to which a lot and also examined if that alert was useful can you improve it should you silence it should you measure something else and then the classifier let's classify them by system by a kind of problem by business area my priority ideally every new feature every new bodies monitoring an alerting attached and over time you can use this information you got it this way to guide your decisions they're just technical decisions product decisions but but so not very opinionated selection of resources and and blocked from Julia evidence the conversation on Twitter from With Charity majors the nurse the story stolen from this and of course on a business ethics I cannot recommend this close enough even if it doesn't almost call room called it's a really of course the DOS size at 2015 a book on data quality which still holds well enough to I just want to say thanks to all the devil said engineers and developers have been call for years and make the internet work the branch thank you for your presentation now over questions and answers you remember up you mentioned training people with like fate uh emergencies of stuff like that how we use simulated and do you actually they break something going up and uh our own a staging environment 1 what d d t J. actually break production on purpose during office hours that so it would not i'm not telling you to do it but this is actually our iterative half when your world continuing this when you break on production do you have also another lady backup system or it depends on the breakage if we are causing the breakage employer plus Sweden not to do something that will actually go to the customers so maybe we put our longest studying for the connection that the base and then we make it fit before it applies to the we tend not to do anything that well actually people that I have to look after a lot of details so you can make energy of saying in the middle of the day and the data was so the same data you gather at the beginning of the day to do this another way to do it this mostly how we do it actually and thank you any other questions but this it's can of unrelated but all those your cats yes yes in the gray 1 is and the white 1 is futile at the thanks and I know what you would if you would make Liu and said area which kind of lust after the use for monitoring your uh agency system them of uh 0 Europe data so the the this layer here is that a dog they have a both just the right inside the for the sleigh here we use a data democratization tool called the reader sh and so wonderful tool we strongly encourage you to try every Dutch it's important by the way obviously there are alternatives there's as much as of alternatives is you can think of can we have another question if not we can anchor speaker FIL
Videospiel
Bildschirmmaske
Komponente <Software>
Mereologie
Systemaufruf
Zahlenbereich
Softwareentwickler
Große Vereinheitlichung
Instantiierung
Aggregatzustand
Sichtenkonzept
Desintegration <Mathematik>
Systemaufruf
Inzidenzalgebra
Systemaufruf
Eins
Bildschirmmaske
Geschlossenes System
Mereologie
Wort <Informatik>
Google Analytics
Normalvektor
Arithmetisches Mittel
Bit
Unterring
Punkt
Reibungskraft
Information
Inzidenzalgebra
Analysis
Homepage
Übergang
Richtung
Geschlossenes System
OSA
Prozess <Informatik>
Trennschärfe <Statistik>
Bildschirmfenster
Wurzel <Mathematik>
E-Mail
Default
Figurierte Zahl
Gerade
Umwandlungsenthalpie
Datenlogger
Physikalischer Effekt
Reihe
Spieltheorie
Systemaufruf
Quellcode
Biprodukt
Gruppenoperation
Rechenschieber
Datenfeld
Wurzel <Mathematik>
Information
Message-Passing
Fehlermeldung
Telekommunikation
Web Site
Quader
Gruppenoperation
Zahlenbereich
Message-Passing
Adressraum
Programmbibliothek
Softwareentwickler
Physikalischer Effekt
Medizinische Informatik
Gerade
Office-Paket
Inverser Limes
Verkehrsinformation
Ebene
Telekommunikation
Expertensystem
Bit
Web Site
Sichtenkonzept
Prozess <Physik>
Datennetz
Selbst organisierendes System
Wasserdampftafel
Datenhaltung
Mathematisierung
Familie <Mathematik>
Ruhmasse
Biprodukt
Inzidenzalgebra
Ereignishorizont
Menge
Geschlossenes System
Datenreplikation
Wort <Informatik>
Unterring
Bit
Umsetzung <Informatik>
Punkt
Inzidenzalgebra
Datensicherung
Login
Raum-Zeit
Computeranimation
Internetworking
Homepage
Netzwerktopologie
Knicken
Web Services
Geschlossenes System
Prozess <Informatik>
Trennschärfe <Statistik>
Phasenumwandlung
Web Services
Sichtenkonzept
Informationsqualität
Kontrolltheorie
Güte der Anpassung
Quellcode
Biprodukt
Bitrate
Entscheidungstheorie
Verkettung <Informatik>
Twitter <Softwareplattform>
Information
Decodierung
Computerunterstützte Übersetzung
Ordnung <Mathematik>
Message-Passing
Fehlermeldung
Lesen <Datenverarbeitung>
Sichtbarkeitsverfahren
Web Site
Gewicht <Mathematik>
Wellenpaket
Gruppenoperation
IRIS-T
Zahlenbereich
Kombinatorische Gruppentheorie
Web-Seite
Zentraleinheit
Zustandsdichte
Demoszene <Programmierung>
Systemprogrammierung
Software
Mini-Disc
Äußere Algebra eines Moduls
Softwareentwickler
Einfach zusammenhängender Raum
Assoziativgesetz
Verzweigendes Programm
Quick-Sort
Office-Paket
Design by Contract
Energiedichte
Flächeninhalt
Stereometrie
Normalvektor

Metadaten

Formale Metadaten

Titel I don't like Mondays-what I learned about data engineering after 2 years on call
Serientitel EuroPython 2017
Autor Rapati, Daniele
Lizenz CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
DOI 10.5446/33697
Herausgeber EuroPython
Erscheinungsjahr 2017
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract I don't like Mondays-what I learned about data engineering after 2 years on call [EuroPython 2017 - Talk - 2017-07-14 - PythonAnywhere Room] [Rimini, Italy] The first weekend of October 2015 my company bought an advert during the first episode of ""Downton Abbey"" on Sunday evening. It was so successful that the website went down for half an hour. We wanted to look at the analytics and the data to estimate the impact. But they were having a very hard weekend too: the replica of the production database we used was unreachable and the only person who knew how to fix it was on a plane. Monday really was a memorable day This session is about sharing some life experience and best practices around Data Engineering. Attendants should have some previous understanding of data and tech in business. Attendants should leave with an understanding of on-call practices and with some quick win actions to take. What does it mean to be on call? How do you make sure that the phone rings as little as possible? Fixing versus Root Cause Analysis. Systems break at junctures. Especially if the juncture is with a third party. Why and when is it worth reacting to errors as soon as they happen? External Services. Increasing Business Trust. Allowing others to build on solid ground. How do you make sure the phone rings when it should? Alerting tools: emails, chat, specialised applications like PagerDuty, OpsGenie and Twilio Monitoring systems Monitoring data (Data Quality) as a continuous early warning system

Ähnliche Filme

Loading...
Feedback