We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

I don't like Mondays-what I learned about data engineering after 2 years on call

00:00

Formal Metadata

Title
I don't like Mondays-what I learned about data engineering after 2 years on call
Title of Series
Number of Parts
160
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
I don't like Mondays-what I learned about data engineering after 2 years on call [EuroPython 2017 - Talk - 2017-07-14 - PythonAnywhere Room] [Rimini, Italy] The first weekend of October 2015 my company bought an advert during the first episode of ""Downton Abbey"" on Sunday evening. It was so successful that the website went down for half an hour. We wanted to look at the analytics and the data to estimate the impact. But they were having a very hard weekend too: the replica of the production database we used was unreachable and the only person who knew how to fix it was on a plane. Monday really was a memorable day This session is about sharing some life experience and best practices around Data Engineering. Attendants should have some previous understanding of data and tech in business. Attendants should leave with an understanding of on-call practices and with some quick win actions to take. What does it mean to be on call? How do you make sure that the phone rings as little as possible? Fixing versus Root Cause Analysis. Systems break at junctures. Especially if the juncture is with a third party. Why and when is it worth reacting to errors as soon as they happen? External Services. Increasing Business Trust. Allowing others to build on solid ground. How do you make sure the phone rings when it should? Alerting tools: emails, chat, specialised applications like PagerDuty, OpsGenie and Twilio Monitoring systems Monitoring data (Data Quality) as a continuous early warning system
95
Thumbnail
1:04:08
102
119
Thumbnail
1:00:51
DisintegrationGoogle AnalyticsSystem callArithmetic meanMoving averageIncidence algebraForm (programming)BitView (database)System callFreewareSystem programmingMereologyMenu (computing)Goodness of fitMultiplication signComputer-assisted translationRight angleCASE <Informatik>Total S.A.NumberSoftware developerInstance (computer science)Video gameComponent-based software engineeringState of matterLecture/Conference
Ring (mathematics)WordCASE <Informatik>Mereology1 (number)Normal (geometry)System programmingMultiplication signSystem call
InformationComputer wormDean numberError messageAddress spaceEinheitswurzelLimit (category theory)Incidence algebraLibrary (computing)Group actionOffice suiteProduct (business)WindowSystem programmingSystem callEinheitswurzelMultiplication signEmailCausalityReal numberInformationWeb pageError messageWebsiteSoftware developerPoint (geometry)Data loggerProcess (computing)Selectivity (electronic)NeuroinformatikBitLevel (video gaming)Computer animation
Ring (mathematics)CausalityMathematical analysisEinheitswurzelLine (geometry)Group actionGame theoryCuboidSource codeMultiplication signEinheitswurzelInformationCASE <Informatik>Group actionSlide ruleIncidence algebraPoint (geometry)Series (mathematics)CausalityError messageData integrityExtension (kinesiology)
Game theoryMessage passingCausalityPoint (geometry)FrictionContext awarenessError messageTelecommunicationEinheitswurzelMessage passingMultiplication signGroup actionField (computer science)System programmingTraffic reportingDirection (geometry)Line (geometry)NumberBitLatent heat9K33 OsaIncidence algebraWebsiteFigurate numberDefault (computer science)
View (database)FamilySoftwareSet (mathematics)Expert systemWebsiteMultiplication signWater vaporDatabaseIntegrated development environmentMassTelecommunicationReplication (computing)Product (business)PlanningProcess (computing)Sinc functionLecture/Conference
Web serviceDesign by contractWeb serviceGame controllerMultiplication signBitMathematicsQuicksortProduct (business)Incidence algebraSpacetimeFigurate numberWebsiteCodierung <Programmierung>Point (geometry)Event horizonSelf-organizationWordPhase transitionError messageDecision theoryInformationGoodness of fitWeightLecture/Conference
Medical imagingCAN busRing (mathematics)Group actionWeb serviceProcess (computing)Game controllerEmailComputer programmingMultiplication signMessage passingBucklingView (database)System programmingBitDemosceneNetwork topologyComputer animation
System programmingGraph (mathematics)ChainHoaxIncidence algebraNormal (geometry)Wave packetMultiplication signSystem programmingOrder (biology)LoginBefehlsprozessorWeb pageInformationView (database)Hard disk driveProcess (computing)MiniDiscPhase transition
Source codeNormal (geometry)Bit rateSystem programmingMeeting/Interview
Associative propertySoftwareData qualityQuicksortNumberReading (process)System programming
Covering spaceSoftware developerData conversionTwitterDecision theoryBlogElectronic program guideProduct (business)Software bugSoftwareMultiplication signInformationSystem programmingString (computer science)Dependent and independent variablesConnected spaceOffice suiteBackupHoaxLevel (video gaming)InternetworkingWave packetSelectivity (electronic)Computer-assisted translationAreaFood energyIntegrated development environmentData qualityBranch (computer science)Presentation of a groupLecture/ConferenceMeeting/Interview
Exterior algebraCoefficient of determinationLecture/Conference
Lecture/Conference
Transcript: English(auto-generated)
Hi, I'm Daniele. I work for Not on the High Street in London. Not on the High Street is a marketplace for small creative businesses.
We are in Richmond. Richmond is one of the best parts of London. There are incredible parks and riverside pubs. And it doesn't feel like London at all. It feels like a country village inhabited by ambassadors and rich bankers' wives and tech companies.
Also Richmond is where the Rolling Stones formed. And some of us care a lot about the Rolling Stones. This is a talk about being on call, carrying a pager, being available 24. If you are a DevOps or a developer, you might already know something about this topic.
If you are a data engineer, I'm going to make a case that you should care more about this topic. But I will also talk about a TV show. And there is an embarrassingly high number of pictures of my cats. And it's a Friday afternoon talk. So I'm going to warm you up by asking you questions and you're going to raise your hands to answer.
Who is on call right now? Good. Who was on call in the last month? Who was actually called in the last month? Okay, if you did raise your hand the second time and you didn't raise your hand the third time,
well, you're living the dream. Money for nothing and chips for free. On the menu for you today. We start with light definition. We move on to some mixed advice on what to do during an actual incident.
Then we have our prevention special. And then we move a bit with a further view. And we talk about motivations and alerting practices. And in the spirit of total transparencies, total transparency,
I have a couple of stories to follow. Luckily enough, one was from last week. So some of you already know, but being on call means you have a phone, used to be a pager in the 80s,
and when it rings, you need to go and fix the system because it's broken. Another way to think about it is basically you have something that is with you at all times and you need to care about it. But secretly it hates you
and it will demand your attention at the most important times. And when it wants your attention, you have to really give it immediately. But being on call is also about knowledge,
about being the first person who will act on a certain system in case of a problem. So you get to know an entire system, not only just the parts you did develop. You will also get some rewards because you deserve them.
Rewards like being woken up in the middle of the night. So last Thursday my phone rang at about 1am and before you get to your computer, you want to make sure you are awake.
You are fully awake, so make some tea, make some coffee, walk off the sleep a bit. A transition, like going from your sleeping self or your working self,
if the incident is during office hours, to your incident self. So you don't want to be the guy who is developing on one window and fixing production on the other window. The incident deserves your full attention. The first thing you do is you are going to read the alert that woke you up.
Really read it. At least a couple of times, possibly more. Once you have read it, you probably know where to look. Which system broke? Where can I find the error logs? Where can I find the monitoring data?
And you want to gather as much information as you can until you can basically be sure of why did this alert trigger. What did wake me up? And at this point, you probably have enough information to assess the impact.
What is going to happen because of this problem? Who is not going to be able to do their job? Who is not going to be able to know something they want to know? And be nice. If people are impacted by the problem, inform them.
Most websites have status pages. For internal systems, you probably have an email or a chat tool. And there is a question you might find yourself asking a lot, which is why?
What is the real deep cause of this problem? And this is not the right time to answer it. If you are going to dive into with your developer mind, try to find out the real root cause of the problem,
you are going to spend a possibly very long amount of time. And you cannot quantify it. So don't do it. It's not productive at this time. And as you find out information, as you start acting on the system, log what you are doing.
So this is on a chat app, it's Slack, but you can also just open a blanket email to your team and start typing out what is happening. This is invaluable information, especially if you are preparing at work on being on call.
And once you have enough information, you probably can come up with one action or a few actions that will limit the impact as much as possible and are safe.
If we scroll back, and you probably squint a bit, because I'm not sure if it's readable, you are going to find out that I did not follow my own advice, and last week I ended up forking a library at 4 a.m. and trying to patch it.
It did not solve absolutely anything, and that's because instead of focusing on taking the smallest piece of action, I actually started asking myself why. And that was not useful at the time. The action I should have taken, which we only took in the morning, was just to skip that data integration step.
True, we wouldn't have known on the day after what the sources of our traffic were, but at least we would have had all the rest of the data in time. Once you have taken enough steps to limit the impact,
again, be nice, inform the people who are impacted, and then get back to sleep, because you want to be fresh on the next day. Usually, you wake up the day after, and your first thought in your mind is,
this is all pretty stressful, I don't want it to happen ever again, and at this point you can really ask yourself, why this error occurred, and the best way to do it is an RCA. There is best practices and extensive literature on RCAs,
so I'm not going to dive too deep, just one slide. You want to put your detective hat on, gather all the information on what actually happened during the incident and before the incident,
find the root causes, and be sure to leave enough time at the end to decide on some actions that will mitigate those root causes. And it's very easy, in this case, to try to blame someone. Don't do it. So I'm going to tell you a story.
It's about a nurse in a childcare hospital. She gave the wrong drug to a little child, and the child almost died. So an inquiry was opened, and there was a proposal to fire the nurse,
but then the commission on the inquiry dived a bit deeper, and they found that the drugs she should have administered and the drugs she actually administered were one next to each other in the same cabinet, and they had similar labels, and they also found out that the nurse had been working 10 hours straight,
and there was nobody to double-check what medicines, what drugs she was administering. So don't allow yourself to focus on the fault of one person. Always look at the context.
And from an RCA, you usually get some useful lessons for the future. You probably are already familiar with this. Please be really careful where your systems talk to a third party
because communication is more scarce and more easily ignored, and watch out for point of friction in internal communication as well. So the root cause for last week's failure was that the GA reporting API has a time-on-site field,
and they renamed it to session duration, and the old name was deprecated in 2014, but they actually started enforcing the deprecation and failing on API calls last week. Another insight is to really care about your error messages.
Keep them up to date. Make sure they include everything that can help you during an incident, so checklists, lesson learned, encouragement. So we took three actions. We fixed the root cause of the problem. We renamed the time-on-site to session duration. We scheduled some time to go through all the GA fields that we are using
and check if any other of them is deprecated. But we also included in the alert message a specific suggestion not to do what I did. So just skip the step. Don't try to dive into causes too much. So let's take a step back to 2015,
and let's take a broader view. Downton Abbey is a TV show about a wealthy aristocrat British family in the first half of the 20th century.
It is wonderfully acted. The sceneries, the costumes, the settings are amazing, and it's really on target for Not on the High Street. It's very British. It's so much on target that when we placed a TV advert during the first episode of the season on a Sunday evening in October 2015,
we had so much sales and so much additional traffic that the site went down. It recovered, and then it went down again when the same advert aired on the plus one channel. On one side, I was sort of lucky because I was not directly involved
with the consumer website, but on the other side, the situation on the data infrastructure which I looked after was a lot worse. So we had basically no data since Saturday morning because the replica of the production database we used to read from was offline.
We were in the process of migrating between hosting providers, and there hadn't been enough communication to turn on when the replication between environments will stop, and the only person who could really fix all of this mess was our databases and networking expert, DevOps,
and he was on a plane back from Russia. As soon as he landed late Sunday evening, his phone rang so many times that by the time he got home, his battery had drained down. Why did we care?
Two incidents at the same time, one on the consumer website, one on the data infrastructure that we would have used to evaluate the impact of the consumer website incident, and it was a Sunday evening, so the next day is a Monday.
On Mondays, our data infrastructure sees the most usage because it's both the busiest trading day and also the day where we plan for the week ahead. I do not like Mondays. And this was not a normal Monday.
At the time, we were a lot more inexperienced than we are today, and we learned a lot from these events. We learned as an organization, we learned as a team, and we changed. So let's talk a bit about the changes we made.
How do you answer this question? Well, you look at consequences of an error. Who is affected by the service or the data being unavailable or wrong? Do they depend on your service? How much will it cost?
How much time can they wait for the information? In 2015, we were realizing that our coworkers and our colleagues were increasingly dependent on our data infrastructure, especially for decision-making.
So if you have a public-facing service or a public-facing website, you probably want to consider some kind of on-call policy because you cannot control how much the external people depend on your service, and you might also have a contract in place
or your revenue might depend on the external service. In 2016, we started offering to our partners, the people who sell not on the high street, access to a rich dashboard with sales figures and product performance, and at that point, we didn't have any space
to roll back the on-call policy. But even if you just have an internal service, you should consider on-call because you want your coworkers to spend less time and worry less about checking and double-checking
that the services are available, and if they spend less time doing that, they will get benefit in their daily work. If you take a step back to Downton Abbey and you sort of know the characters,
it's not about just keeping them happy. You also want Daisy, the assistant cook, to be the best she can be at her job. And you might pay this increased interest with a little less control on your priorities
and a little less agility as you need to react to incidents, but in the end, it will be worth it because you are enabling others to rely on your tool, and your stability will enable their success
as they build more and more on top of your data and your tools, and you will be surprised by the brilliant creative ways in which they can use the service you provide, enabling others. Nothing else matters.
So it's worth it. We decided it's worth it. How do we make it work? What did we do in the days, in the weeks, in the months after the Downton Abbey debacle to make sure that we could fix problems in time?
Usually, the very first basic is getting an email when a certain program fails. The real basic, and that's what we had at the time.
Then you can build on this email. You can attach certain tools that will phone and wake you up, and you can even do it yourself with Puelo if you want, and then you can also send lower priority alerts and messages to your chat or to your internal communications
so that you have a timeline of there was this low priority alert, then there was this high priority alert, and this is what happened, and it's all in one place. In this phase, you also want to be making sure that the person answering the incident, responding to the incident,
is able to do it. So make sure your logs are accessible. Make sure there is documentation in place. Consider training people with fake emergencies and fake incidents. The next step up in the chain is moving from gathering information
just when bad things happen to gathering information all the time. So you can start with very basic information, CPU usage, RAM usage, disk usage, and then you can move up and take a broader view.
How many web pages are we serving? How many jobs are running? How much data are we moving? And then you can move even higher. How many customers are we serving? How many orders have been placed? And at this point, you can plug your alerting system
on top of your monitoring system. Rather than just getting an alert and getting paged when there is a problem, you can say, okay, I have all CPUs at 100% for 10 minutes.
Maybe it's time for an alert, or I have only 8MB left on my hard drive. Maybe it's time for an alert. And even higher step up the chain is looking at your monitoring and your business data
and monitor that data itself. So you're looking at questions like is 20,000 customers on the site normal for a Sunday evening? Have we received the data we expected from Google Analytics? Do we have a high rate of traffic
that doesn't have a Google Analytics identified source? And this works really well because it's basically an alerting system for both your business and your systems.
There is a lot of literature on data quality, sort of tainted by association with some well-known big software vendors. But if you discard the big software vendors, these concepts are actually general and they don't depend on specific technologies.
So now we have lots of checks and alerts. Some require immediate attention. Some require attention on the next day. Some require attention on the next working day. And you maybe start ignoring some of them.
Don't get comfortably numb. Read each alert. Make sure the team reads each alert. Respond to each alert. And also examine if that alert was useful. Can you improve it? Should you silence it? Should you measure something else?
And then classify alerts. Classify them by system, by kind of problem, by business area, by priority. Ideally every new feature or every new bug is monitoring and alerting attached. And over time you can use information
you gather this way to guide your decisions. Guide your technical decisions. Guide your product decisions. So I have a very opinionated selection of resources.
A blog from Julia Evans. A conversation on Twitter with charity majors. The nurse stories stolen from this course on business ethics. I cannot recommend this course enough.
Even if it doesn't almost cover on call, it's a really good course. And the last one is a 2013 book on data quality, which still holds well enough. I just want to say thank to all the developers and engineers
and developers who have been on call for years. And make the internet work. Thank you for your presentation.
And now for questions and answers. You've mentioned training people with fake emergencies and stuff like that. How do you simulate that? Do you actually break something on a staging environment?
Or what do you do? I actually break production. On purpose during office hours. That's how I do it. I'm not telling you to do it, but that's actually how we do it in my team.
When you are continuing on this, when you break on production, do you have also another native backup system? It depends on the breakage. If we are causing a breakage on purpose,
we tend not to do something that will actually go to the customers. So maybe we put a wrong string for the connection database, and then we make it fail before it deploys. We tend not to do anything that will actually impact people.
I also look after a lot of ETLs. So you can make an ETL fail in the middle of the day, and the data will still be the same data you gather at the beginning of the day. So that's another way to do it. That's mostly how we do it, actually.
Thank you. Any other questions? It's kind of unrelated, but are those your cats?
Yes, yes. The grey one is Estia, and the white one is Filo. Thanks, and I know what you feel when they call you. Which kind of software do you use for monitoring your system or your data?
So this layer is Datadog. They have a booth just right outside.
For this layer, we use a data democratization tool called Redash. It's a wonderful tool, and I strongly encourage you to try Redash. It's in Python, by the way.
Obviously, there are alternatives. There are as much as alternatives as you can think of. OK, do we have another question? If not, we can thank our speaker.