Scale-up your job satisfaction, not your software
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 69 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/67307 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 202159 / 69
11
25
39
43
45
51
53
54
60
00:00
SoftwareMotif (narrative)Product (business)DatabaseFormal verificationSoftware testingGame controllerRow (database)CodeFrame problemImplementationProcess (computing)MultiplicationAuthorizationParameter (computer programming)BuildingFunction (mathematics)Descriptive statisticsRule of inferenceState of matterFunctional (mathematics)Object (grammar)Noise (electronics)Arithmetic meanProjective planeSoftwareSet (mathematics)CausalitySoftware developerMoment (mathematics)MereologyTwitterRaw image formatoutputMultiplication signRight angleTouchscreen2 (number)Functional programmingSoftware architectureDrill commandsSoftware engineeringInformation engineeringScheduling (computing)File formatMetric systemRun time (program lifecycle phase)Branch (computer science)Optimization problemConstructor (object-oriented programming)Focus (optics)Revision controlMathematicsUniverse (mathematics)Mechanism designFigurate numberPhysical lawWater vaporScaling (geometry)WordDreizehnBlock (periodic table)Electronic data processingPlanningHand fanXMLUMLProgram flowchart
09:22
NewsletterPurchasingMessage passingProduct (business)Latent heatRevision controlData modelConfiguration spaceTensorFloating pointData centerSoftware testingTerm (mathematics)WordDifferent (Kate Ryan album)Musical ensembleForm (programming)FeedbackCountingPoint cloudGroup actionInformation privacyFocus (optics)InternetworkingRevision controlEvent horizonResultantEndliche ModelltheorieVirtual machineError messageScripting languageINTEGRALFunctional (mathematics)Descriptive statisticsDesign by contractSoftware developerMultiplication signFunction (mathematics)CodeMereologyBitValidity (statistics)Computer fileBuildingRule of inferenceMessage passingTraffic reportingSound effectProduct (business)Repository (publishing)Expected valueSoftwareLatent heatWritingEmailMathematicsSoftware engineeringImplementationBranch (computer science)Point (geometry)Line (geometry)Flow separationHeegaard splittingCartesian coordinate systemMetric systemComputer programGoodness of fitCASE <Informatik>Cycle (graph theory)Process (computing)DatabaseTask (computing)Proper mapGreatest elementSlide ruleInformation engineeringData storage deviceElectronic mailing listExecution unitUMLProgram flowchartXML
Transcript: English(auto-generated)
00:08
When was the last time you spent 4 hours debugging a data pipeline? It happened to me one day. To be precise, it happened to me one night. I woke up at 2 am for no apparent reason, and I had a feeling that I should check out the Slack channel that delivers alerts when something is wrong.
00:27
In retrospect, that was my first mistake. In the Slack channel, I thought that one of the data pipelines is failing, and it was the most important one. If that one doesn't work, everything else will fail too, and we will spend the whole working day fixing them manually and restarting the pipelines one by one.
00:45
But it was still 2 am, I could still fix it. Sure, there would be a one hour delay, but at least we will not waste a whole working day fixing the problems. I didn't have to do that, but I felt that if I'm the first person to see the problem and I can still mitigate it, I must do it.
01:04
In retrospect, that was my second mistake. While looking around for the cause of the bugs, I saw that a code change was merged before all of the tests were executed. Apparently, the author wanted to stop working at 5 pm and didn't wait for the test.
01:22
I couldn't be mad at this person. Took a lot of patience to wait for the test in this project. Unfortunately, a test failed 20 minutes later and nobody noticed the failure. I could revert the trend, assume that the previous version still works, because it worked yesterday, and go back to sleep.
01:42
Or I could write a fix, wait until the test pass, deploy the fix, and run the fixed version in production. Of course, reverting the trend means that I will have to fix the problem twice, so I decided to save some time. At least, that was my idea. In retrospect, I decided to make my third mistake.
02:03
Some moments later, I heard my alarm clock ringing, and it turned out it was already 6 am. The fix took 4 hours, and I had to rework it anyway, because what seemed like a brilliant idea at night wasn't even close to the most optimal solution.
02:22
Have you been there? Do you listen to this story wondering whether I work at your company? Before switching to data engineering, I was a backend developer for 8 years. I have seen projects with no tests and projects with 100% test coverage, but with useless tests that didn't validate anything.
02:43
In those projects, those tests existed only to increase the test coverage metric. On the other hand, there were projects with useful tests, with decent coverage, but people complained about test execution time. In those projects, it was unacceptable to wait for a test longer than 2 minutes.
03:02
People complained if it took more than 30 seconds. Data engineering is somewhere in the middle between no tests and tests running for hours. If I have to wait 3 minutes for Apache Spark to start in the test environment, I must be happy if my tests run in less than 5 minutes.
03:21
At some point, you need half an hour to run all of the tests, and allowing merging before all of the tests pass seems like an obvious and reasonable practice. At some point, we have to accept the reality of data engineering. This is how things work, right? We cannot change that. Or can we?
03:42
It might be a shock, but data engineering is just a branch of software engineering. So let's start at the beginning. What makes software useful? It is all about trust. If we trust the output, we use the software. If we don't trust it, nobody wants to use it. So how do we build just what we data pipeline?
04:04
Let me show you something. This is the universal software architecture chart. Of course, it is kind of a joke, but today it's going to be our data pipeline. Yeah, that's an oversimplified example. In reality, we have multiple inputs, multiple outputs, and the processing part branches out into multiple subprocesses.
04:25
But let's focus on the simplest thing, and think about what part of the pipeline can break. Well, every part of it can. We might get invalid input data, we can fail to load the data correctly, we might make a mistake in the processing code,
04:43
we might write the data to the wrong location, the output format may no longer be correct, and we might break something downstream. Finally, the scheduling mechanism may break, and the entire pipeline will not even run. Christopher Burke, a co-author of the DataOps Manifesto and the DataOps Cookbook, says that we have two kinds of moving parts in data pipelines.
05:08
It is either a value pipeline that is running in production, the code stays the same, but the input data changes, or it is an innovation pipeline. When the code is still in development, the test data is constant, but we modify the implementation.
05:25
Because of that, we will focus on two things, making sure that the code works as expected, and validating the data. However, before we even start, we have to simplify the pipeline and get the code under control. Functional programming. Do you remember what the buzzword was a few years ago?
05:44
Every conference was full of talks about monads, monoids, Scala, and Haskell. We can ignore almost all of that. All we want is deterministic output, so we will focus on pure functions and immutability. How do we achieve immutability? After all, we can trench everything in a database or a strict bucket.
06:05
First of all, let's assume that whatever enters the data pipeline is immutable, and we should store it separately. I prefer to write the input data to a raw data bucket as fast as possible. This gives us two benefits.
06:21
First, we can audit the data and check what came from the external systems, and second, we can rerun the pipeline. This is the first part. What about pure functions? What is a pure function? A pure function is a function whose output depends only on the input, and the function does not access or modify global state.
06:43
This definition does not resemble anything we have in data engineering. What is a pure function in data engineering? Let's imagine that we have a Spark data pipeline. What are the building blocks? We have the part loading the data, the processing part, and the code that writes the data to the output.
07:03
To get a pure function, we can extract the processing part to a separate function, and assume that this function is pure, because we don't access any external state. What are the benefits of pure functions? It is trivial to test the code. If the output of the function depends only on the input, we can easily write a lot of very, very simple tests.
07:28
Speaking of tests, please take a look at the test visible on the screen right now. Is it readable? What does it do? We would need to spend some time reading the code to figure it out. Tell us some marks.
07:41
We pass a lot of arguments to the object. I bet that if we look at the implementation of this function, we will see that many of these arguments are not even used. But the constructor requires them, so we have to pass them anyway. Also, let's look at the verification part. What do we expect? We might need to search for the drill ticket and read the description to figure it out.
08:06
It is not obvious. There is no meaning in this test. Also, I am sure that this test was written after the implementation. The author wrote the code, tested it manually, and was told to write a test.
08:21
So, a test has been written, or rather copy-pasted from another test and modified. Because if the output started with a test, there would be no needless parameters. Because who wants to write four variable names when you only need one? The design of this code would be much simpler.
08:41
I think it is quite easy to make a similar mess while testing a data pipeline. After all, we need extensive setup. We create multiple data frames containing at least few rows when we run a test for Apache Spark. The verification part is never that simple. The tests are long, and when the test fails, it is hard to tell what was the cause of the failure.
09:05
So, how do we improve it? The setup and the settings are technical details. If we keep them like in this example, we hide the business meaning in tons of noise. So, let's find a way to make these business rules more visible.
09:22
There is a testing method that separates a description of the expected behaviour from the technical details of the test implementation. It is called behaviour-driven development. We separate the human-driven description of the specification from the test code and from the production code.
09:42
The specification is easy to understand, at least it should be easy to understand, and it should describe the business rules. Also, it should be quite easy to spot a mistake in the specification. So, if the specification is easy to understand and we can spot a mistake, we can also spot a mistake in the test implementation.
10:02
That also should be easy. So, we no longer have to wonder whether a strange looking line of code was intentional or not. But what's the most important thing? The scenario describes a business rule, not the implementation detail. So, let's take a look at two examples. The code on the top of the slide is correct.
10:22
The specification describes the intent, not technical details. The example on the bottom is a terrible idea. I bet that such specification looks exactly the same like the test implementation. It is just a waste of time. We do double the work but get no benefit of doing the work at all.
10:42
This is a test written by people who claim that automated testing doesn't work. Testing the code is the easier part. Software engineers have been doing it for decades, at least they should be doing it. We have functional tests, integration tests, end-to-end tests, unit tests,
11:00
and long discussions about differences between those kinds of tests and proper naming. We can find a way to test the code. But what about testing the data? In production, the data changes all the time. Other teams release new features and start sending different kinds of events. We should be prepared for that.
11:20
And there are three ways to do it. We can write validation rules that reject everything unusual. But in this case, the data engineering team becomes a huge bottleneck preventing everyone else from achieving their goals. So, we should not do it. We can accept almost everything and hope that nobody makes a mistake. Good luck with that. Or, we can separate the obviously correct data from unexpected values, but keep both.
11:47
So, when we update the pipeline, we can still reprocess what we assumed was wrong and get the correct results. Why do we do it? What is the point? First of all, we can never let incorrect data into the production pipeline.
12:04
It is too difficult, too time consuming to fix the problem if that happens. That's why we have the error bucket at the beginning of the ingestion pipeline. If something is not correct, we put it in the separate bucket and let a person review the data.
12:21
Of course, it is going to be a manual process, but at least we detect the problem as early as possible. But that is not enough. We have tested our data and the code, but that is not bulletproof. We should test the data once again before we write it to the output location, just to be sure that we don't propagate any problems downstream.
12:44
Does it sound like contract testing? Well, perhaps. But please don't expect that other teams will write contract tests for your data. They will not do it. How do we write tests for data? There are tools like Great Expectations or AWS DQ that let us define validation rules for the data.
13:06
In the case of AWS DQ, we can even run anomaly detection. But even the simplest implementation, even something like adding an additional column with a validation result and using that column to split data into correct and incorrect buckets will help improve the pipeline.
13:24
It is a good starting point if you cannot decide which tool to use. But there is one more thing we can do. What if we didn't allow incorrect data in the data lake at all? What if we had a version control system for our data with branches and pull requests?
13:44
And what if we rejected branches that don't pass validation? Well, I cannot promise you pull requests for data because that would be terrible to review. But we can have branches. If we use tools like LakeFS, we can create a new branch for every data pipeline run and match the changes only if all of the tests pass.
14:05
So, let's imagine a little bit different ETL pipeline. We create a new branch, interest the data into a separate branch, run all of the tests there, do the processing, run some more tests and merge the data into the main branch when everything is fine.
14:23
What is the benefit of that? The production branch contains only complete and valid data. What is even more important, the data versioning tools handle merges as an atomic action. Either all of the files get merged or none of them is merged. It means that we no longer have a situation in which one process writes files to S3 or any other storage
14:46
and another process starts reading them before the write finishes. We no longer need marker files, status databases or any other ugly hack that we use to indicate which files are ready to use.
15:03
Now we can assume that if a file exists in the main branch, we can use it. Still, all of those tests are not enough. In the end, there is the infrastructure and scripts we use occasionally. Do you know what happened to me recently? I was working on deploying a TensorFlow machine learning model.
15:23
I had my deployment script that I was using a few times in the past, so I knew it worked. I had the new model and I wanted to deploy it as a SageMaker endpoint. Of course, the deployment failed. I did not change the script, so I assumed it must work.
15:40
I needed to test it. The only difference was the new model version, so I tried to deploy the old version that was already running in production, and the script failed once again. But the script was the same, the model was the same, the deployment pipeline was defined in CloudFormation and nobody made any manual changes.
16:02
Here is the error message that I saw. It seems that the version was missing. I started googling to find the SageMaker code. Fortunately, there is a GitHub repository with some of the code used by SageMaker endpoints. Sometime later, I found the line of code. See, it removes leading zeros from the model version.
16:24
All of them. Even if zero is the only thing in the version ID. Do you know what is the problem? I have one model version per file, so I used zero as the model version in all of the files. I had to repackage the model and change version 0 to version 1.
16:42
And then I tried running the script again. It worked. But I wasn't happy. You see, it was a much learning deployment script. The data scientist needs a few weeks or months to create an improved model that we might need to deploy. I will run this script several times per year.
17:02
Do you really want to find out that the script doesn't work anymore on the day when I'm supposed to deploy a new model? Of course not. So, there is one more thing I need to test. The scripts. I run my scripts in an AWS code pipeline scheduled to run Monday to Friday in the mod-link.
17:22
It deploys the SageMaker endpoint, sends a few requests, compares the results with expected values, and removes the endpoints. If anything breaks, I will get an early mod-link email with added message. That is not the best email you can see in the mod-link, but at least it is better than figuring out that the deployment pipeline doesn't work anymore
17:43
on the day when you urgently need to deploy the new model. But this does not apply only to the scripts. Do you have a pipeline that runs only once a month because it generates the monthly report? Please schedule some additional runs at least once a week. If anything breaks, you would still have some time to fix it before anybody notices a missing report.
18:06
I have some homework for you. Here are two software engineering books that offer tons of useful ideas for data engineering. The first one is The Effective Engineer by Edmund La. It is a book about focusing on high leverage tasks, optimizing the feedback loop,
18:24
shortening cycle time, and quickly validating your ideas to reduce wasted time. It is also a book about pragmatic automation and choosing the right metrics. Speaking of being pragmatic, the second book is The Pragmatic Programmer by Andrew Hunt and David Thomas.
18:43
This is a classic book about writing good enough software, building prototypes, defining useful tests without over-engineering it, and making a minimal, but still working version of your application as soon as possible. When you finish reading the software books, there is one more thing.
19:01
Some additional homework. This last book does not contain any code. It doesn't even mention the code. It is a book about writing. A book about writing well. On Writing Well by William Zinser. I'm following you for two reasons. First of all, because I would like to see more well-written documentation.
19:21
And second, because On Writing Well helps you to clarify your ideas, express them more precisely, and avoid needlessly but complicated language. My name is Batov Mikulski and this was a thought list of time-consuming tasks which you have to do if you want to build trustworthy data pipelines. For more information, visit www.thepragmatic.org