Don't fix bad data, do this instead
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 131 | |
Author | ||
Contributors | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/69427 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 2024107 / 131
1
10
12
13
16
19
22
33
48
51
54
56
70
71
84
92
93
95
99
107
111
117
123
00:00
Data qualityData managementDecision theoryComputing platformPhase transitionRight angleTask (computing)Archaeological field surveyStandard deviationDimensional analysisSoftware testingoutputProduct (business)Point (geometry)10 (number)Multiplication signConsistencyCASE <Informatik>Order (biology)Data conversionInformation engineeringData analysisPurchasingDatabase transactionLevel (video gaming)Data storage deviceProper mapEndliche ModelltheorieStrategy gameDifferential (mechanical device)InformationAreaDependent and independent variablesHoaxComputer animationLecture/Conference
06:22
Uniform resource locatorEndliche ModelltheorieParameter (computer programming)Expected valueOrder (biology)Touch typingProduct (business)Source codeCASE <Informatik>Data managementDependent and independent variablesMultiplicationData qualityWeb-DesignerWebsiteData analysisDecimalPoint (geometry)Figurate numberTable (information)Total S.A.Software bugSoftware engineeringView (database)DatabaseMultiplication signDesign by contractComputer animation
08:34
Systems engineeringOrder (biology)Event horizonPatch (Unix)Computer wormBit rateDefault (computer science)MeasurementOperator (mathematics)Software engineeringProduct (business)Software testingData integrityData analysisMultiplication signInterface (computing)PlanningReal-time operating systemSoftware developerProcess (computing)Order (biology)Expected valueAnalytic setCollaborationismEvent horizonINTEGRALData managementCartesian coordinate system2 (number)CASE <Informatik>Latent heatEndliche ModelltheorieService (economics)Revision controlType theoryAddress spaceInterior (topology)View (database)CodeField (computer science)Design by contractDifferent (Kate Ryan album)Domain nameContext awarenessShape (magazine)outputSource codeContent (media)Goodness of fitData qualityTable (information)InformationPrice indexStrategy gameState observerDependent and independent variablesCalculationStack (abstract data type)BitSoftwareOnline helpDrop (liquid)Computer animation
17:49
Error messageINTEGRALProduct (business)Web pageSoftware testingContent (media)Computer animation
19:17
Storage area networkTotal S.A.Router (computing)Information securityRoundness (object)Point (geometry)Source codeMultiplicationData managementCode refactoringObject (grammar)Product (business)Multitier architectureMereologyPresentation of a groupData analysisResultantRight angleData modelTraffic reportingMathematicsData qualityData warehouseStatement (computer science)Decision theoryEndliche ModelltheorieMultiplication signSoftware engineeringSet (mathematics)Rule of inferenceThumbnailData conversionProjective planeINTEGRALMeasurementMusical ensembleSoftware testingAttribute grammarAnalytic setKey (cryptography)Lecture/ConferenceComputer animation
Transcript: English(auto-generated)
00:04
Hi, you all. I'm happy to be here among you and see all these faces. How do you feel? Hungry before lunch? To be honest, I feel nervous. I feel nervous because I assume you all came here to get a high quality talk
00:26
about how to make quality of data great and what I got for you is just bad news. No, you can't fix quality of data.
00:43
It's just not possible. And the question is does this even matter in the area of Gen AI? And I think it matters more than ever because all the companies have access to this new technology.
01:01
And the only differentiator which we can affect is what we feed into this new technology. What data we feed into the models and how good this data are. So let's get started.
01:23
So let's get started. My name is Martina Ivayichova. I lead a data engineering tribe at kivy.com. kivy.com is an e-commerce travel company. We help customers from travelers from all around the globe to find cheap flights.
01:42
I have been with kivy for something more than three years and when I joined kivy as a data platform team lead, one of the very first things which I did was that I was interested in understanding what are the main challenges our data
02:02
users, data consumers within the company face and by data users I mean data analysts, data scientists, product managers, UX designers, anybody who needs data to do some important decision which affect customers, users, business. So we distributed a survey and
02:21
surprisingly guess what was the outcome of the survey? The most prevalent answer was that the biggest issue is a data quality. So what we did guess what happens if you task a team of data engineers to fix data quality.
02:40
They will come up with all the great ideas like we need to set up data quality checks. We need to have a tooling, a right tooling in place which is able to execute these data quality checks and we also need to have like data cleansing pipeline which remove duplicates and
03:04
input missing values and also like that data quality tooling which we have should be robust enough to test for all the standard dimensions of data quality such as time liners, freshness, accuracy, consistency and so on.
03:21
So we did all of this and the situation was even worse. Why so? Because now after a couple of months we had unhappy data consumers because the situation did not change significantly and
03:43
unhappy data engineers because they had tens of pipelines to maintain, they had many many alarms and notifications on Slack about fake pipelines and failed data quality checks. So
04:01
I have been talking about data quality for something like five or seven minutes without even getting to the point or getting to the question of what data quality is. And I had some like informal conversation maybe over the coffee in the kitchen. I remember I had a
04:23
discussion with a pricing business analyst and I asked him like, okay, you complain about data quality. What exactly do you mean by that? And he told me yeah, of course Martina, I will explain this to you. So we have this risk-based product
04:41
and in order to price it properly we need to understand on per transaction level what are the exact costs associated with each transaction, each order. And today we have this data, but they are at the time of the purchase. So they are just like expected costs,
05:04
projected costs and since this is a risk-based product some of the costs are incurred only later. So we are missing this information. And this is preventing us from like having this proper pricing strategy and I was like, well, I
05:23
understand this is bad, but how does this relate to data quality? We either don't log the data. We don't calculate the data. Data is just not there. How can I complain about data quality? It's the same as if I went to a grocery store, I wanted to buy bananas and they had just apples. Would I complain about the product quality of that grocery store?
05:47
Of course not. So I thought to myself this is just an isolated case until I talked to another colleague, the product manager responsible for third-party ancillary products.
06:04
So as I said, we are selling flight tickets, but in some point of the user journey our users might decide to not only book flights but also to book accommodation. So they decide to click on our offer where we
06:25
redirect them to a third-party partner site where they can also book the accommodation. And the thing was this product manager responsible for the ancillary products complained to me that Martina, look,
06:41
data quality is bad because in 30% of cases we don't know what was the exact touch point when our customer decided to leave to that partner website. And I was like, okay, this is bad, but like let's go and fix it. Let's go to web developers and
07:01
ask them to find a bug where this parameter in a redirect URL is not passed along and let's get things done. So the lesson learned here is that whenever the expectations of data consumers differ from reality, it is
07:22
manifested as issues with data quality. And what is the very worst thing about this is that these expectations are very often implicit, not manifested, not contracted, not promised. For example, I
07:41
saw this many times. Like I saw like data, let's say data analysts going directly to the source database, finding a table name booking orders, finding a column called amount, and assuming that this column is the total amount in euros with two
08:05
figure decimal precision representing the amount of that order. And this is totally implicit expectation and the owner of that table, the software engineer who put in place the model is not even aware of this assumption.
08:22
Well, so we understood that this is a complex topic and as such it has to be tackled from the multiple point of view. So when we want to address the data quality, we need to have in place right technology.
08:44
And I call it here hard measures, but equally important are so-called soft measures and soft measures are hardest to implement. So by soft measures, I mean cultural measures like bringing awareness, bringing culture.
09:03
I will talk about this in a minute. So what do I mean about, what do I precisely mean of technology measures? So first of all, they are data quality checks. And as I mentioned in the beginning, that was obviously the one of the very first things which we did. By that time, there was no great
09:23
expectations package, as it was mentioned in a previous talk. We actually built our own in-house tooling, very similar to great expectations. Then we abandoned it and got something from the market. Quite happy with that. And then
09:41
there is something which we piloted, which we call data integration test. There is a significant difference between these two, because data quality checks are ex-post measure. They might help you to understand that
10:03
data are broken and you can't be fast enough to react before they reach your business. But it does not change the fact that the data are already bad. And like that, data integration tests are trying to prevent this,
10:23
not in the runtime, but already in the development time. Let me continue now with measures and let me then later on elaborate on integration tests. Be with me. As for the soft measures, we
10:40
introduced a so-called, or tried to introduce, at least with some of the teams, who had the biggest pain, data collaboration process. What does it mean? It means that there are multiple parties involved in every new initiative or product feature.
11:03
Product managers. Their job is to think about how we will measure success of the new feature which we are going to implement, and how can data help us to understand that result, and how can we use data to improve the product for our users.
11:23
Data analysts or analytics engineers are here to suggest and help product managers to understand what can be done, what insights we can get, and to prepare the aggregated models, clear data, which can be later used. And software engineers, which process a specific domain knowledge,
11:44
they're responsible for keeping the data contract, that promise that the data will be there in such a shape, and whenever we are doing a new development, and we need to start logging new data, they will make sure this will happen. And
12:03
just in a previous talk in this room, I'm not sure how many of you were here, there was one sentence that said by a previous speaker that we can't trust what is on the input, and this is exactly what we are trying to fix here, trying to make sure that there is someone
12:21
guaranteeing how the source data look like. And then, another pillar of soft measures is product thinking. Not all of us drive Teslas. Not all the data matter the same, and there is a significant cost
12:40
associated in maintaining good data. And let me explain this again on that example with Ansley products. When that product manager complained about 30% cases where this information is missing,
13:00
my answer should be, so what? Why does it matter? Does it matter because it shows bad on some of your dashboards? Then let's drop the dashboard. Or does it matter because we need this data to craft our merchandising strategy? Then let's get fix it today.
13:23
And all this will not happen without the ownership, so none of these issues none of these tools and approaches can happen without having someone who is accountable, who is accountable for that data, for who is waking up
13:41
in the night when the SLOs are not met, when the data quality checks trigger fail and trigger monitors, and without really the ownership. And now the question is who should own the data? I truly believe that data ownership should be as close to the source as possible,
14:03
because only those who can impact the data can really hold the full responsibility. You can have whatever cleaning pipelines, data tests, if data are not there, data are missing of bad quality, you will as a data analyst not
14:24
fix it. Well, so let me, as I promised, come back to the integration test. We run on a GCP stack and that's why I will be maybe a bit more technology specific.
14:44
Our interface between operational plane and analytical plane is the PubSub topic, where the events representing some business too, which happened in the production, are published. And with this topic there is associated a schema, the protobuf schema
15:06
typically, and from there the data are either, you know, stored to the warehouse, like to BigQuery tables, either they are used for maybe some streaming analytics, like to some nearly real-time observation of the revenue, and then they are stored maybe, let's say, now in some GCS buckets and like to use for
15:26
feature calculation and for ML models training. So this is an interface between the operational world and the analytical world. So this order processing application,
15:42
besides doing the order fulfillment, with each order publishes this order event to this PubSub topic. Each time we deploy a new version of that code, we before that have run a CI-CD integration test, which tests if the
16:01
event is not, we are not testing only whether it matches schema, but we are trying to test the content of that data, whether we are putting some business assumptions there. So whenever there is a developer who creates a merge request and integration tests are
16:20
executed and this integration test fails, which means that the integration test needs to be adjusted, this is this is an indication that probably the data contract is going to be broken and the data consumers, downstream consumers need to be aware of that, even maybe eventually approve and review the merge request.
16:47
Maybe a quick example, you can maybe check it, check the code and give it for a second.
17:02
So really this is a simple example that I tried to demonstrate here that we are really also trying to not to test only that data comply to schema, but also that there are some business assumptions turned from implicit to explicit. So in this case you see that we expect amount to be
17:25
assumed of base fare and service fee. And for example, whenever someone adds a new type of fee like payment fee, it either, you know, it has to reflect the reality that some amount has to be bigger or there has to be a new field introduced like
17:41
total amount including payment fee. Well, so with all this being said, this still holds true. No, we really cannot fix data. But what you can do is that you can make sure that your most crucial data are
18:05
protected as best as you can by integration tests, that you will know soon enough that the production data are either missing or not matching the business assumption and you can maybe act soon enough
18:20
before they reach your business. And third, you can make sure that at least for crucial initiatives before you even start implementing, you know how you will evaluate your success and what new data you need to start logging to understand if this was a success or not.
18:47
This is all from my side, but if you are interested in more tech content from Kiwi, check out our page. We have also a booth here. We have also some open roles. So feel free to
19:02
come if you pass by and also, we can talk now or you can reach to me on LinkedIn. That's all from my side.
19:22
Thank you very much for your talk. And now we have some time for questions. So if anybody has a question, they can come here where I'm standing right now and ask their questions. Thanks for a great talk. You mentioned
19:43
initially you implemented all the pipelines and alerts and then the data quality, but maybe not fixed, but then you still run these data quality pipelines, right? Yeah, okay. I didn't want to go into the details in the presentation, but these pipelines, we have many of these pipelines.
20:03
They are not only for data quality, but also for data modeling. So we really adopted Kimbell methodology. We are organizing data in the consumable way for end users. So yes, we still have this pipeline and pipelines and they are checking the
20:21
assumptions on the source data, but as well they are modeling data in the consumable way. Okay, so the question was how say how much, how big portion of problems you would be able to, or you are able to capture by the integration test compared to the, let's say, more traditional quality pipelines.
20:45
I would not ask, these are complementary things. So, you know, I think you will, you always need both. But, and we have it on one project. So it is really hard to, you know, measure it. I know that we need KPIs.
21:01
It's hard to give answer how much, but as a general rule of thumb, as closer to the source, as earlier you discovered, the lower is your data downtime. So, and this is a crucial point. So it's not like how many you discover by reach, but how soon you discover.
21:21
Or that it will not happen even, you know. Okay, thank you. Hello, thank you for your talk. I have a question you mentioned about how you need to protect your most important data points. And that's basically the only thing you can really focus on and do.
21:40
Who determines those most important data points? Because I believe that each product manager is going to say, yeah, my data is most important. So what we did is that we categorize data into three tiers based on the following criteria. First of all, is this data
22:01
important for OKRs, like company objectives and key results? Like, is it coming into the strategic decision? Then second, are these data coming to financial reporting and financial statement, which has to be really, really, you know, solid? And third, does this data source, this data set
22:24
really many, many downstream dependencies? So out of these three questions, if two are yes, we categorize it as tier one. As tier two, we categorize those which have, like, one yes, and the rest is the tier three.
22:43
Okay, thank you. Thank you very much for your talk. I agree that we need to improve data quality from the source, but how do you secure buy-in from stakeholders? As an analytics engineer or data analyst, you can gather happily accurate data from multiple sources in data warehouse, but you know that, you know, the
23:04
best data should come from this team, but that team has to deliver, you know, objectives for the quarter, do not have time to do refactoring and collecting more data. So how do you secure buy-in from your stakeholders? This is a very, very good question. And this is the hardest part. And it took us, let's say,
23:24
two and a half years to go team by team, pursue them, explain them, demonstrate the data quality issues in production which happened, demonstrate what happened because we as data analysts were not aware of these, you know,
23:42
changes and then the reporting was wrong, you know, maybe some assumption in when building ML models was wrong. So we did go one by one, face-to-face conversation, explaining how bad it is and that this can help. And I'm not like claiming that we fix it everywhere,
24:02
but I'm very happy seeing now a slight conversation where software engineers from this team or this team come and say, hey, we are implementing these new features. How should we track it? Would you prefer to capture it in this attribute or in that attribute? So yeah, it takes time, but it brings a value, but it brings value where it is most
24:24
manifested, where the issues are biggest because there you can get the buy-in earliest. Nice. Thank you for taking questions. We actually have some more time for questions if anybody has
24:41
one. Otherwise, just thank you so much for your talk. It was really good. I think we all learned a lot and please, another round of applause.