We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Validating Big Data Jobs

00:00

Formal Metadata

Title
Validating Big Data Jobs
Subtitle
An exploration with Spark & Airflow (+ friends)
Title of Series
Number of Parts
561
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
If you, like close to half the industry, are deploying the results of you big data jobs into production automatically then existing unit and integration tests may not be enough to present serious failures. Even if you aren’t automatically deploying the results to production, having a more reliable deploy to production pipeline with automatic validation is well worth the time. If you, like close to half the industry, are deploying the results of you big data jobs into production automatically then existing unit and integration tests may not be enough to present serious failures. Even if you aren’t automatically deploying the results to production, having a more reliable deploy to production pipeline with automatic validation is well worth the time. Validating Big Data Jobs sounds expensive and hard, but with a variety of techniques can be done relatively easily with only minimal additional instrumentation overhead. We’ll explore the kinds of instrumentation to add to your pipeline to make it easier to validate. For jobs with hard to meet SLAs we’ll also explore what can be done with existing metrics and parallel data validation jobs. After exploring common industry practices for Data Validation we’ll explore how to integrate these into an Airflow pipeline while making it recoverable if manual validation over-rules the automatic safeguards.
Process (computing)MereologyMachine learningMultiplication signComputer animation
TwitterFeedbackInclined planeSoftware developerGoogolSlide ruleCodeVideoconferencingSoftware developerCodeLattice (order)AdditionMereologyOpen sourceProjective planeStreaming mediaQuicksortSPARCComputer animation
MereologyMachine learningVirtual machineVideo gameOpen source
Electric currentRule of inferenceSoftwareScale (map)Similarity (geometry)Physical systemOpen setEndliche ModelltheorieNumberPhysical systemNeuroinformatikValidity (statistics)Computer-assisted translationTask (computing)Goodness of fitComputer animation
SoftwareScale (map)Physical systemSimilarity (geometry)Open setValidity (statistics)Software testingComputer fileCodeRow (database)SoftwareDevice driverMultiplication signRight angleRandomizationCrash (computing)
Software testingCodeSystem callSource codePersonal digital assistantInvariant (mathematics)Archaeological field surveyComputerBitMathematicsMultiplication signFormal verificationValidity (statistics)Arithmetic meanQuery languageStaff (military)NeuroinformatikOperator (mathematics)Right angle1 (number)Point (geometry)WordArchaeological field surveySystem callSelf-organizationSoftware testingProcess (computing)Product (business)Content (media)Rollback (data management)ArmBoss CorporationData qualityFeedbackMetropolitan area networkComputer animation
Software testingDisintegrationFormal verificationSoftware developerFunctional (mathematics)Expert systemFocus (optics)InternetworkingValidity (statistics)Computer animation
Sinc functionVirtual machineRule of inferenceRight angleQuicksortValidity (statistics)Software testingArchaeological field surveyPetri netMereologyGoodness of fitComputer animation
SoftwareDigital filterCountingDigital photographyChainMathematical optimizationDevice driverLevel (video gaming)Regulärer Ausdruck <Textverarbeitung>Similarity (geometry)Regular graphCodeLogicImplementationValidity (statistics)Physical systemMultiplication signRule of inferenceIterationTrailCodeDecision theorySoftwareResultantKernel (computing)Operator (mathematics)WritingRight angleWave packetVideo gameQuicksortComputer programmingEndliche ModelltheorieDegree (graph theory)Product (business)Regulärer Ausdruck <Textverarbeitung>Goodness of fitPoint (geometry)MereologyoutputMathematical optimizationRow (database)Metric systemFunctional programmingRun time (program lifecycle phase)Sampling (statistics)Field (computer science)Java appletGroup actionException handlingSoftware engineeringNumberVirtual machineBit rateCategory of beingTerm (mathematics)LogicMathematicsCountingMusical ensembleInformationProcess (computing)Functional (mathematics)Bell and HowellComputer animation
Time domainRule of inferenceVirtual machineDomain nameSoftware testingMultiplication signValidity (statistics)Right angleBitQuery languageComputer animation
Archaeological field surveyProof theoryRule of inferenceSoftware testingCategory of beingTime domainLatent heatMetric systemInferenceoutputMathematical optimizationGroup actionLogicField (computer science)Military operationBit rateLogical constantContent (media)Parallel portProcess (computing)Level (video gaming)TensorPairwise comparisonElectronic visual displaySoftware bugData modelEndliche ModelltheoriePersonal digital assistantDefault (computer science)SoftwareOperator (mathematics)Basis <Mathematik>Message passingStreaming mediaExecution unitStrategy gameComputer programmingLink (knot theory)GodoutputValidity (statistics)Process (computing)2 (number)QuicksortSoftwareSystem softwareTask (computing)Right angleRun time (program lifecycle phase)NumberMultiplication signStatisticsOpen sourceOperator (mathematics)Limit (category theory)MathematicsLink (knot theory)Extension (kinesiology)Proof theoryLogicComputer programmingRule of inferenceLageparameterGoodness of fitGraph (mathematics)Sign (mathematics)Different (Kate Ryan album)StapeldateiArchaeological field surveySoftware bugParallel portRow (database)Library (computing)TensorDataflowProjective planeLevel (video gaming)Computer animation
Sign (mathematics)Adventure gameNumberProjective planeComputer programmingComputer animation
Archaeological field surveyFeedbackEmailAlgorithmReal numberVideo gamePhysical systemParallel portWave packetEndliche ModelltheorieMultiplication signNumberVirtual machineMachine learningFitness functionRight angle
Process (computing)Group actionSoftware testingStreaming mediaExecution unitStrategy gameComputer programmingLink (knot theory)Rule of inferenceOperator (mathematics)Basis <Mathematik>Message passingoutputValidity (statistics)Multiplication signChainTask (computing)LogicMetric systemParallel portFunction (mathematics)Process (computing)Right angleComputer animation
Point cloudComputer animation
Transcript: English(auto-generated)
That's that's part of why I change jobs all the time, so yeah, I'm gonna talk about validating big data and machine learning pipelines I Think it's a super important topic, and I wish more people validated their pipelines. Oh, I need to speak up okay cool I'm gonna try yelling and
This is gonna be a long day So if my voice starts dipping and you can't hear me let me know and I will start yelling some more. Yeah Okay My name is Holden my preferred pronouns are she or her and I'm a developer advocate at Google I'm on the spark PMC which is why most of my examples involve spark even if it is a really bad idea
But that's that's okay because you can do the same tools with other things I'm a co-author of two books on spark neither of them talk about anything in this talk But that should not stop you from buying my books That is the more important part
One of the things that I've started doing that I think is kind of cool if anyone is particularly Interested in how projects like spark do code reviews is I've started doing code review live streams Where you can watch me review open source pull requests live and try and not swear I fail at the second part, but I still think it's fun to sort of watch and join in
Okay, cool in addition to I am professionally I'm trans queer Canadian in America on an h1b work visa Which they're debating whether or not they want to keep it's a really great feeling To not know if you can renew your work visa or not And part of the leather community and this is not directly related to big data or machine learning
However, I think for those of us who are building machine learning pipelines or even just tools with data It's really important that we try and make diverse teams and this includes us in the open source community If we if we don't talk about where we're all from and if we don't build diverse teams
We're just gonna recreate yesterday's problems more efficiently and that is not really what I want in life I want us to find new solutions to new problems and I think diverse teams will help us do that. Okay That being said not gonna talk about that We're gonna talk about how to avoid having everything catch on fire
Why you should do it given that you're here. You might be fairly convinced And I promised at least one cat picture and at least one picture of my scooter club Which is only tangentially related, but I do wish to try and expense my gas So I'm working on making it more related to computers
So hopefully you're nice Um Maybe you're familiar with is anyone here familiar with Scala Yeah Friends, thank you. Okay, cool. Um, and if you're not familiar with Scala, that's totally fine How many people are familiar with something like spark beam flink?
Okay, that's a good number of people if you're not it's okay these these same techniques apply to other systems Generally, they're a lot easier to do in non distributed systems So if you happen to be working locally your validation tasks become way way easier And hopefully you're here because you want to make better software
And if not, I I can't convince you Okay so Validation is really important Drivers license tests can be very similar to how we test our software which is to say they're better than nothing I would not want my friends to start riding a motorcycle without a license as my friend is doing here
but at the same time Even if someone has passed their motorcycle license test You're probably maybe still not the safest driver and it's just like your code your test covers the basics But it's not going to catch everything. There's always going to be that strange SUV
operated by some like random drunk person or You're gonna have some null records in your CSV files randomly One of the two is going to happen and your pipeline is going to fail or you're going to crash And ideally you want to know when something has gone wrong So you don't make it worse, right?
So our tests are not perfect we are eventually going to get on the fail boat and at some point You want to minimize the impact of this does anyone here have to carry a pager? That is very very few people You are all very lucky people. I am kind of jealous
Someone else in your organization may be woken up at three o'clock in the morning to do a rollback from your data pipelines You probably don't want them being very angry with you because you're going to need things from your operation staff So it is still worth saving other people being woken up at three o'clock in the morning
Even if it is not saving yourself being woken up at three o'clock in the morning So I did a survey for how many people have had spark jobs cause a serious production outage 15% said yes 50% said no and 30% were like I didn't have to update my resume
But we did lose a few million dollars also known as depends on what you mean by serious And so you really don't want to be in the 15% and you don't want to end up in the 30% either Right, you don't want to have your pipelines cause really bad failures
Um, and I have a survey if other people want to want to get feedback on this And as more people deploy their pipelines automatically into production and as more people start doing streaming data You know, you don't have the same time to do manual verification and validation that you're used to Um, so when scooters it could be
Going home after an accident rather than checking on your bones to see if they're all put together Bit of a bit of a stretch. You'd probably checked notice if your arm was broken But with computers we might not notice that we broke a feature and cost our employer a few million dollars I've done that it was a really awkward one-on-one the next day
Like trust like I updated my resume I did everything right But I was really stressed about that and you don't want to have that experience There was another time much less stressful I just assumed that everything was a coffee shop and my only test query was the word coffee
And then my boss got upset that we were returning Starbucks For when he was trying to find a steakhouse And that was an awkward call but not that awkward because we were a start-up and we didn't make any money So it was just like yeah, whatever it's jokes and then there's other ones some words can have multiple meanings and
Those meanings can be really awkward to have to explain to people And if you have tools which do a good job of like keeping Miners from seeing inappropriate content. You really don't want those tools to break And it's it's really easy to have that happen when your data changes
With money things in America the Veterans Affairs Agency Couldn't pay a whole bunch of people because of data validation pipelines Bank of America foreclosed Sorry because of not validating data Bank of America foreclosed on a whole bunch of people's mortgages and like other generally like really bad Terrible things have happened to specific people from us not doing a good job of catching data quality issues
So hopefully this is enough that you care and you'll you'll pay attention And if none of these problems are things that you care about It's okay. The Internet is working somewhat So let's let's do some validation. Yeah
Okay, I'm gonna take the unicorn horn off now so Another thing that we might want to validate is our our Slurpee is drinkable. Do you have the the concept of Slurpees here? It's a ice drink with more or less sugar And some food coloring and something resembling flavor, right and
occasionally these machines are at gas stations and maybe not the most well-maintained machines and you might not want to drink something From a machine which is fundamentally just a giant petri dish When it's not operating correctly So a good validation rule that we might have for consuming Slurpees is that our Slurpees should have the food coloring in them if they
don't Something has gone wrong and I probably don't want to get sick. Maybe maybe that could happen, but the new Slurpee ghost white gummy flavor is going to break our validation rule because now there is a Slurpee flavor
Without food coloring and this is okay, right? I think it's totally fine to have validation rules that break occasionally, right? And I think this is perhaps Different than tests we we really don't want tests to break occasionally we want our tests to be deterministic but with validation rules
Since they're Being used as sort of a second last catch. It's okay if they're sometimes wrong They can't be so wrong so often that people just start to ignore them But if they're occasionally wrong like once a month and people don't start turning out their alerts. It's okay So
How do we make validation rules? So hopefully at some point you've had software that worked if you do not have any software that works at all Validation is not your problem. It is time to go and fix your software But you've had your software that works and maybe we can collect some metrics About how our software is working and then we can look at future iterations of our software
And if it's not looking like the previous iterations Maybe that's a thing that we can do something about and we can also do similar things for our inputs We can look at do today's inputs look similar to yesterday's inputs or as the rate of change between the days of inputs within reason
Right Do do do okay? Does anyone in the audience have something equivalent to this where you load some data you try and parse it? But like maybe the schema doesn't apply maybe it's missing a field maybe something set to null and like
Whatever. We'll just just throw away the bad data. It's okay. We'll just keep the good data in our pipeline Does does anyone do this? Am I just the one bad person here? Okay, so a lot of people are raising their hands and they're only half raising them Which makes me think more of you do it, but you don't want to be caught And this is okay, right like
Especially for anyone who has to work with JSON data if you require that all of your inputs were completely correct You would never produce a result, right? That's just not gonna happen JSON data is garbage so We're gonna have something like this and that's that's okay, but the problem is We might throw away
99% of our data when historically we've only been throwing away 1% of our data and then if that 1% Now is no longer a representative sample of my users and I'm training a recommendation model I might make some really bad recommendations or other kinds of decisions based on this So we could go ahead and we could like
Write a check is it valid and then we can go ahead and we can count Is it valid and we can say if we have less valid data than bad data We'll put some special business handling logic in here if you're in Java, you know throw an exception because everyone likes exceptions And you know just just do something and this this is technically a validation rule
It's not a very complex one, but it's it's a good start and it's better than nothing and At this point right your special validation rule can trigger Apparently all of you are very lucky and have operations people's Pagers and they'll come and take a look at it and then yell at you the following morning
It's less fun though, right? Like this code is less nice to write And that's that's the thing I care about In terms of working in spark, there's some performance problems with doing these two counts It's a little sad And it's similar in other systems where essentially triggering two actions can be kind of not great
Beam is different because it has a sort of whole program optimizer to a degree. So if you're in beam This this last part doesn't apply. But yeah, okay, but we could use counters Yay counters and
The other thing is we don't have to define all of the counters, right? All of these systems keep track of some metrics themselves already and these metrics can actually have a lot of really useful information That can tell us whether or not our job is operating normally We can look at the number of bytes that are being shuffled around we can look at the number of records that are being read
Execution time is super basic But if you've got a machine learning pipeline that takes three hours to train and today it takes 15 minutes You do not want to push that model to production, right? There is something going on I'm sure you just like sped up your code and you're amazing but maybe maybe it's time to spot check this one, right and we can add counters for things that we ourselves
have either had catch on fire or Suspect might catch on fire, right and we can still pretend it's nice functional code We can just hide all this mutation and counters just inside of the underlying systems and pretend that we're writing functional programming code For those of you who care about writing functional programming
And right. Yes So counters solve our problem in the same way that regular expressions can solve our problems in that they produce a new problem for us to solve But it's different. So it's like they solve their first problem Okay, so we have a happy counter and a sad counter and we see how sad we are at the end of our
Job, and if we're really sad, we we won't do anything. We'll just go to like You don't have Taco Bell Pizza Hut Pizza Hut, we'll go to Pizza Hut and we'll get some pizza and we won't push our model to production We'll take the rest of the day off And you know, that's better than nothing So, okay. This is still really tightly coupled to our code and this kind of is less than great
All right, and there's a bunch of problems with counters Some of the problems are that beams counters are implementation dependent So you can change the runners for beam but doing that will change the behavior of your counters
So if you've got some things that's working locally And start using it on a cluster It's totally possible that the validation rules that you've been working on locally will just stop working That'll be a great experience Or if you're trying to change Runners and you you want to use your validation rules to validate that your new runner is performing reasonably
It might just not be and you won't know and spark counters have their own problems too with data properties But where do we put our counters right fundamentally? We have to understand our problem domain and this is very similar to the problem domain of running a queer scooter club, which is
Everyone loves glitter and bubbles and scooters, but it turns out the putting a bunch of soap on the road was a bad idea You know who could have foreseen that having a bubble machine would cause Accidents during a parade
You know, no one obviously But you know if we spend time to think about our problem domain we can add some tests like hey Or sorry validation rules like the road should not be more slippery than I used to if it is Maybe let's turn off our bubble machine and stop writing for a little bit and think about what's going on, right?
So what do people do in practice for making their validation rules? Really depressingly. I ran a survey and it turns out that the only thing most people validate is Execution time and the number of records they read in they're like, you know If it took three hours today and it took three hours yesterday and it read six gigs and today it read seven gigs
That's fine. Right everything in between whatever. It's probably the same. It took the same amount of time How many people think that's enough? What oh Please tell me you don't work at a bank. Oh
Dear oh dear at least not my bank. Okay. Well, whatever it's it's not like I have a lot of money So not my not a big problem for me. Anyways, maybe you'll lose my mortgage. That'd be great Okay, so spark validator. So I did a proof of concept and I actually have a second proof of concept as well This one is integrated into your pipe into your spark job
I have a second one which integrates into your airflow pipeline and then you can just go ahead Put in all of these counters and then you can define historic rules on your counters and say how they should be related From today to yesterday Over time and then it's nice and you can have sort of decoupled stuff. There's input schema validation is really cool
We can write our own input schema validation. We can look at the percentage of data that's changed and There's also come on. Okay validation rules can be separate stages, right and this is important With the difference between spark validator and the second proof of concept that I made
To an extent we can do data validation in parallel in a separate process And that's to say not all of our validation rules have to be about how our program is behaving It can be about do the summary statistics of our data Look similar to today as yesterday because that can be a really good sign that something has gone wrong And then we can run that in parallel without slowing down our main job, which is really cool, too
And in fact, there's a tool to do that that runs on top of beam It's called TF DV Despite being called tensorflow data validation. You can use it for things other than tensorflow it's also open source and you can use it to Compute some basic statistics about your data and compare it to the previous statistics and also find anomalies based on your schema
And this is a really useful tool There's some limitations with the places where you can deploy it right now But it's open source and you can take the ideas and apply it to your own system Software changes, too You should when software changes and your data hasn't changed run your old software against your new software and see if they look the same
It is really simple and can save you a lot of time Even when you think that only unrelated changes have happened Because your unrelated changes might change your Fortran libraries, which oh dear God are somehow related to spark. Okay
And we can put it all together in airflow yay so if you if you use The second proof of concept that I have you would put your spark submit operator as a data validation task And you can define it as dependent on the important business logic task
If you were doing parallel stuff, you wouldn't define the dependency graph that way you would instead just have it run parallel And then you can use the batch operator to just call the TFDV stuff that we showed you here and that's all kinds of fun Some ending notes. You don't have to be perfect. Just do something
Something is better than nothing even if all you want to do is execution time. That's okay. It's better than not doing anything Just just start somewhere. Here's some related links. Here's some books Here's some books it's unrelated But that should not stop you from buying this book. I
Also have a project to these teach distributed computing to children Probably not the children that you like But The children who don't have your cell phone number But you still want to convince them to join us in this wonderful adventure that we call programming And this is this is not a joke
It sometimes gets confusing. Okay So this is k thanks. Bye. I think you said there were five minutes at the end for questions, right? Cool So I've got time for a few questions if anyone has them. Yeah a question
Yeah, so the question is I've been talking a lot about the machine learning being more and more integrated into distributed systems What about algorithms which are not easily parallelizable?
Do I see distributed systems fitting into that? And I think that yes even for algorithms which are not usually parallelizable My friends at a company which will remain nameless Because They know where I live Use distributed systems to do all of their feature prep and they also do down sampling on their large data
and they construct representative samples and do their training locally with non parallelizable algorithms and honestly In real life you tend to spend a session sale in real life in real life I tend to spend more time doing feature prep than cool machine learning stuff
And so you'll still have all this big distributed system problem But then at the end you're still going to want to rent a really giant node for like the six hours to train your fancy model Because there is no magic wand of parallelization if there was I would make a lot more money Or less. I don't know actually that people wouldn't need me
One of the two Okay, question number two. Oh Yeah So I should have sorry I was rushing because I got the like you're at a time thing
But I'll come back here. So this this data validation task. Oh Right. Sorry. Okay. The question was I showed an example in airflow. Why was I doing the validation? After the important business logic rather than in parallel So the the spark validator Example that I have it looks at the counters that are output by your spark job
To do the validation on it. So it doesn't like look at the data independently TFDV looks at the data independently the the spark validator tool that I made looks at the metrics produced by your spark job And that's why I have the dependency chain in the way that I do here
awesome Cool, I think I'm probably out of time. Thank you all you