The perils of building a democratic data platform
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 56 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/67180 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 202224 / 56
22
26
38
46
56
00:00
Computing platformMusical ensembleVapor barrierBitComputing platformControl flowFrequencyBuildingXMLUMLLecture/ConferenceComputer animation
00:39
Time evolutionComputing platformBitComputing platformContext awarenessEvoluteMultiplication signOffice suiteMereologyPhysical systemComplex (psychology)Latin squareLecture/ConferenceComputer animation
01:39
ArchitectureStreaming mediaVolumeDatabaseComputing platformCentralizer and normalizerDatabase transactionExecution unitComplex (psychology)Data analysisSet (mathematics)BitDatabaseMultiplicationScaling (geometry)Product (business)BuildingInformationContext awarenessInferenceStack (abstract data type)Pattern languageComputer architectureProcess (computing)Self-organizationView (database)Data structureNumberGraph (mathematics)WritingCircleIntegrated development environmentDifferent (Kate Ryan album)Power (physics)StapeldateiDecision theoryRight anglePresentation of a groupComputer animation
05:59
Time evolutionComputing platformProcess (computing)Physical systemExecution unitScheduling (computing)Axiom of choicePresentation of a groupComputing platformPhase transitionComputer animationLecture/Conference
06:27
Time evolutionComputing platformPhysical systemQuery languageDatabaseProcess (computing)Limit (category theory)Different (Kate Ryan album)Decision theoryPresentation of a groupPoint (geometry)Barrelled spaceComputing platformTraffic reportingLevel (video gaming)Physical systemDecision theoryComputer architectureDatabasePhase transitionCASE <Informatik>Integrated development environmentNumberBusiness reportingArrow of timeGraph (mathematics)Scaling (geometry)Query languageContext awarenessMereologySet (mathematics)Analytic setFunctional (mathematics)Computer animation
09:02
DichotomyLatent heatComputing platformComputer animationLecture/Conference
09:57
DivisorStreaming mediaDecision theorySoftware maintenanceOperations researchPattern languageCommunications protocolInternet service providerMechanism designBuildingAxiom of choiceStandard deviationStress (mechanics)MetadataProcess (computing)Operator (mathematics)Software maintenanceMultiplication signData warehouseDivisorInstance (computer science)MetadataOpen setIntegrated development environmentStreaming mediaDichotomyPhysical systemScaling (geometry)Sign (mathematics)MereologyQuery languageRepetitionCommunications protocolDifferent (Kate Ryan album)MathematicsBitLevel (video gaming)Raw image formatImplementationStandard deviationComplex (psychology)Revision controlSet (mathematics)Computer animation
13:53
ImplementationCodeMultiplication signElectronic mailing listPhysical systemBuildingDampingComplex (psychology)Centralizer and normalizerParallel portLecture/Conference
14:44
FreewareElasticity (physics)Scheduling (computing)Computing platformSurfaceSystem programmingAreaStreamlines, streaklines, and pathlinesAbstractionProcess (computing)Computing platformPhysical systemAbstractionProcess (computing)CASE <Informatik>Complex (psychology)LogicIntegrated development environmentMoment (mathematics)Scheduling (computing)NumberInterface (computing)Formal languageAnalytic setBarrelled spaceParallel portComputer animationLecture/Conference
16:30
MetadataData storage deviceDistribution (mathematics)Database transactionComputing platformSign (mathematics)ScalabilityPhysical systemProcess (computing)DatabaseComputing platformMathematicsEndliche ModelltheorieDependent and independent variablesTerm (mathematics)Complex (psychology)NumberSign (mathematics)Self-organizationIntegrated development environmentBuildingSource codeMereologyConnectivity (graph theory)WordComputer animation
19:00
Computing platformIntegrated development environmentLimit (category theory)Physical systemProcess (computing)Group actionStreamlines, streaklines, and pathlinesIdentity managementLimit (category theory)Physical systemAxiom of choiceBitComputing platformDifferent (Kate Ryan album)Sign (mathematics)Set (mathematics)Group actionNumberScaling (geometry)State of matterType theoryMultiplicationProcess (computing)RepetitionGene clusterDirection (geometry)Maxima and minimaMultiplication signContext awarenessCore dumpComputer animationLecture/Conference
22:39
Execution unitComputing platformComputing platformPerspective (visual)1 (number)CASE <Informatik>Process (computing)DistanceServer (computing)Computer animationLecture/Conference
23:19
Perspective (visual)Personal digital assistantSign (mathematics)TelecommunicationStatisticsPoint cloudProcess (computing)Computing platformTelecommunicationComputing platformCASE <Informatik>TrailSign (mathematics)Set (mathematics)SurfaceGraph (mathematics)Insertion lossConstraint (mathematics)DistanceSelf-organizationProcess (computing)Pattern languagePerspective (visual)Decision theoryPhysical systemPoint (geometry)Traffic reportingComputer architectureScaling (geometry)Centralizer and normalizerRight angleTotal S.A.1 (number)Product (business)Computer animation
27:09
Computing platformFrequencyProcess (computing)Streaming mediaProduct (business)Library (computing)Integrated development environmentSet (mathematics)Keyboard shortcutComputer-assisted translationMereologyComputing platformLevel (video gaming)Decision theoryBitWordDependent and independent variablesTheory of relativitySystem callMultiplication signOpen setDemo (music)MathematicsProduct (business)Streaming mediaFrequencyControl flowDefault (computer science)Information privacyComputer animationLecture/Conference
29:58
Integrated development environmentDecision theoryMechanism designData recoveryAutomationProcess (computing)Physical systemMaxima and minimaLimit (category theory)MathematicsDecision theoryMultiplication signComputing platformData recoveryMechanism designSelf-organizationLevel (video gaming)Moment (mathematics)BitPoint (geometry)DiagramValidity (statistics)Lecture/ConferenceComputer animation
31:00
Decision theoryComputing platformPresentation of a groupRight angleSelf-organizationScaling (geometry)Well-formed formulaEndliche Modelltheorie
31:43
Streaming mediaOpen setSelf-organizationSoftware developerFeedbackProduct (business)Operator (mathematics)Computing platformBuildingComputing platformEndliche ModelltheorieContext awarenessData conversionMathematicsSign (mathematics)Scaling (geometry)Musical ensembleDifferent (Kate Ryan album)Lecture/Conference
34:28
Projective planeCASE <Informatik>Lecture/ConferenceMeeting/Interview
34:57
Physical systemCommunications protocolOpen setMultiplication signMetadataCodeConnectivity (graph theory)Musical ensembleFood energyInterface (computing)Lecture/ConferenceMeeting/Interview
36:20
InformationLecture/Conference
36:57
Musical ensembleJSONXMLUML
Transcript: English(auto-generated)
00:07
So, thanks for coming. Hope you had a nice break. We're going to talk a little bit about our experience and some guidelines on how to build a democratic data platform and what are the barriers that we found along the way.
00:22
And hopefully, if you're building something similar, you could avoid those. Yes, I'm Andrey Aziskis. I am lead engineer at Streamy Infra, a new bank. Hi everyone. I am Joaquin Torres, also lead engineer in Data Infra at New Bank.
00:41
And we're going to give you a bit of context about New Bank, about the data platform built there. The evolution of the data platform perils and hopefully come to a nice conclusion if we have time. So, starting off with New Bank, I don't know if you're familiar with, but New Bank is a bank,
01:02
a financial institution in Latin America. It started nine years ago and it is working in Brazil, mostly in Brazil, Colombia and Mexico. And we have this idea of fighting complexity to empower people. So, Brazil has more than 100 million people that were not in the banking system
01:25
and New Bank gave a lot of those people a way to be introduced in this part of the society. And why we're talking here? Because we have also an office in Berlin, an engineering office. And I want to give you a little bit of context very quickly about New Bank,
01:43
the structure, the organization structure and a little bit of stack so we can kind of contextualize the platform we've built. So, New Bank works in those vertical units, which are self-contained teams that have all the roles needed to perform a certain role and also horizontal units, for example, data, which provides the base layers where we kind of make easier certain processes of the company.
02:08
So, we have data inference, which provides the platform for people to build data products on top of that. And at New Bank, since the beginning, we built a microservices-based architecture
02:21
because we knew that we knew. It's hard to know, but yeah, we had the idea that we would need a lot of scale and we started very early with Kafka and distributed microservices, with each microservice, its database, which already creates a lot of complexity for data analysis. That's why the data platform kind of started very early on in the timeline of New Bank.
02:44
And not only that, but there's additional complexities in the architecture, which we have many shards, so it's a sharded architecture vintage by customer. So, to get all the customers of New Bank, you need to query all the shards, for example. And now we have multiple countries, Colombia and Mexico.
03:02
So, all of those are complexes that we try to eliminate in the data platform. That's an overview of a specific team. So, we have the mantra of you build it, you run it, so very DevOps approach. So, if a team has a microservice and maintains those microservices, everything related to the reliability, guaranteeing that it's always SLA, SLOs,
03:26
and also maintaining the database and everything that it needs to operate, so the team is responsible for that. And talking a little bit about the data platform, we have several. So, because of those complexities of having many databases, many shards and countries,
03:44
we have the data platform team kind of centralize a lot of the things into a unit to make it easier across the whole company to inject, to do data products. So, we have an ingestion layer that automatically discovers all the newly databases created and extracts the transaction log in a CDC kind of approach.
04:03
We have a centralized processing layer where people are incentivized to add new data sets and contribute and collaborate, so everyone at the company, that's why we call democratizing the data, because everyone at the company has the power to add new data sets and create the right views
04:24
and perform business needs using data, so making data-driven decisions. And we also have a serving layer, which serves back the data that is generated or calculated in the batch platform back to the microservice and transaction environments,
04:43
like completely in the circle. So, one very interesting thing that happened in NewBank was the growth was quite quick, and also the platform grew with NewBank. So, one number that is quite interesting is that we have more than 8,000 different databases,
05:05
so extracting data from all of that, so if every team had to handle that by themselves or doing it in different ways, so the platform kind of provided a way that is unified and teams basically don't need to think about it. So, we also process, we read every day more than several petabytes a day
05:24
and we write hundreds of terabytes every day. And the graph on the right shows a very interesting pattern of the team still being effective as it grows in number of data sets, so we don't need to hire as many people,
05:42
so it is not nearly scaling the number of people that we hire with the number of data sets that is included. And nowadays we have 70,000 data sets with more than a thousand individual contributors adding data sets and creating different views. So, that is basically the context and we gave a presentation with much more technical details
06:04
of the choices we have made, how we have made this scheduler, how we do CDC and everything else on this presentation at Spark Summit, AI Summit, and yeah. So, for the rest of the presentation we are going to take a step back
06:22
and talk to you about how we evolved the platform and basically the phases that the platform went through along the way. And also the perils that we find the main point of the presentation. So, in the very beginning when the company was still small,
06:42
all we needed was a simplified query layer that allowed us to access the data from the different databases across the company. Again, we have a microservice architecture and this was complex from the beginning. But this does not scale very well and soon enough we needed something,
07:01
we needed an actual analytical environment, but we didn't have sufficient business use cases to build something that was general purpose just yet. So, the second phase of a platform of this kind tends to be we have an analytical environment but it only solves some concrete business use cases.
07:23
And you might think about an important business report that you have to generate. Here we gave the example of a customer financial report that we needed to generate. And the next stage of the platform is actually as the number of use cases grows
07:41
and as more parts of the company need this kind of functionality, making all of that general and make it available to more and more people. And this is where the democratic aspect comes along. We generalized that analytical environment, created solutions to ingest things in a consistent way across the company
08:06
and also made it so that different teams were able to create datasets and have the data available in an analytics environment as well.
08:20
And finally, the last stage that we're talking for the context of this presentation is actually getting to make automated decisions based on that data. So, we have a lot of people contributing, we have all parts of the company using that data. But the final stage, again we're talking specifically about us but we believe that this generally applies for people that make a journey similar to ours.
08:46
And the final stage is actually represented as a single arrow on this graph but there's a lot that goes into that purple arrow which is actually bringing the analytical data back to the transactional systems
09:00
and making automated decisions based on those. So, the pairs that we want to describe is based on this timeline and usually a great majority of the companies kind of go in the same way. First business specific needs and then they go towards general purpose platform
09:24
and there's many things that we fail into, some things we avoid and then kind of we try to reflect back and see the things that worked and things that didn't. So, the first one is basically a dichotomy of when to adopt complex technology and also when to decommission it.
09:41
Because sometimes you do something that sounded like or you thought that was a very great idea, very simple and really efficient but you stick with that solution for too long and sometimes you're just having a team of many people just to push the solution uphill. And this happens basically as the dichotomy is basically twofold.
10:03
So, one it starts very early when you're trying to fulfill a business need of having people to query the data and you can see signs of that you're falling to the strap where the adoption of your solution is too slow, people are not able to query, maybe someone created a crazy DSL instead of just using SQL
10:20
and no one can understand that. Or if you're trying to implement a solution that is much more advanced than what you need, maybe you have a couple of gigabytes of data and you're trying to use a big data system to do queries and then you're seeing that you're having to cut many corners instead of building a simple solution, you're cutting too many corners because you don't have a team
10:42
or you don't have the support needed for such a tool. And the other part of this is when you hold too long for the custom solutions that you've built and then you start to see things like okay, this problem is slowing my team down but I cannot solve it by putting more money to the problem as you would maybe scale up an instance,
11:02
you cannot do that because really the solution is bounded to things that are limited technological factor. And when the team spends much more time in maintenance and operations than time releasing features. So, examples of how to apply those principles in practice is something when you want to provide data for your business users.
11:24
So, one can start into looking, okay, now I will have a lot of data so let's build something very complex and complicated but maybe sometimes you can take a step back and say how can I do this thing as quick as possible and provide the value the most, which is totally okay.
11:42
Maybe even your company or company working is not going to grow significantly in the next couple of years so that is a solution that is going to just keep working fine. So, one example of that is going with managed tools like Stitch and some managed data warehouse where you would be able to, you wouldn't build DTL yourself
12:01
because that sometimes like loading data, doing CDC is kind of complicated. You would just delegate to providers and that seems to be a good idea. And if you want to go a step further and say okay, maybe I'm going to need to do some more complex processing earlier, you can say okay, have like Amazon stream between and you have your data, raw data in a stage environment
12:23
where you can process that later on. And something else that was spoken earlier today about open lineage. So, every company faces this issue a little bit more when there's the general purpose ETL part of things,
12:40
when they need to collect lineage and metadata about data sets and then transformations that happen in versioning. So, what happened before, you saw many companies including ourselves building very complicated or maybe not even complicated, not sophisticated protocols and standards to collect metadata.
13:03
And look at like one important thing that we see over and over is like always thinking back and seeing what the industry is doing and trying to adapt the things that we have to use something that is adopted in the industry. So, one example is open lineage. Now, it's a standard that if you weren't in the talk
13:21
earlier today, I really recommend looking, but it's like as you have an open telemetry and different standards, it's the same thing. So, instead of like doing something that is like reimplementing a protocol that already exists in the industry, maybe adopting it and even is using,
13:40
like even if you need a very specific implementation of that protocol, at least you're basing your implementation on existing protocol that if later on you need to change, that change is going to be easier. And yeah, I think the kind of general advice is to start from first principles
14:00
and really think about where to pick complexity. So, if you're building like a system from scratch, like what is the most core, what is the thing that is going to change the list and really invest the time into that and avoid like as possible, mostly in the beginning, avoid like writing a lot of custom codes, plus things like as engineers, we tend to really like incremental improvements
14:22
and sometimes you really need to rethink like the whole implementation. So, if you don't have codes to be attached to, I think sometimes it's easier to think and like redevelop and rethink your own solution. The next parallel that we wanted to talk about
14:43
is mistaking consolidation for centralization and that when building a data platform, there's a lot of things that are common problems, you're solving a bunch of common problems for the company and there might be a tendency to believe that creating centralized systems to solve those problems
15:01
is actually a requirement, but the main requirement that exists is having consistent ways to solve the problem, having well-defined interfaces that can be leveraged transparently by people and we tend to believe that this first starts showing up
15:23
when the number of use cases is picking up in the company, there are like already a few use cases implemented, the platform is approaching the moment of generality, more and more teams needing this and a few of the ways that you can see that you're actually falling into this parallel
15:43
is that you start again feeling the need to implement those cross-cutting concerns like ingestion, scheduling or a need to load data into an analytical environment that people can query using a familiar language like SQL or the need to streamline some of the processing logic
16:00
that already exists in whatever tool you're using under common abstractions that hides away some of the complexity for the users of the platform. And other two things that normally indicate that you're already suffering from this is that the operational burden on the platform team
16:21
starts to increase and the quantity of system that this team owns are getting bigger and bigger. I'm going to give an example to make that a little more concrete. So again, ingestion is a common concern that you can build a centralized system for a certain kind of databases,
16:43
one of the common sources in a company. The process of ingestion tends to be consistent and similar for that source and building a centralized system is something that you can do and that system goes through all of the systems
17:02
across the organization, extracts the data and consolidates all of that there. But another way to look at exactly the same problem is actually building a distributed solution where the ingestion component now became a smaller component part of the realm of each of the teams
17:21
that actually produces the data. The consequence of this, especially in a environment of fast growth which is ours, is that the scope of the platform team becomes much more stable and the number of teams doesn't matter how much they grow,
17:41
that scope is not going to change significantly. Of course, this is more complex in terms of infrastructure related, like this is a distributed solution but the previous solution has to scale vertically a lot and this basically creates an incentive that is quite different.
18:00
So the main takeaways here are to ensure that ownership is distributed in a scalable way and here we mean both organizationally and technically. So it's not only the solutions scaling for the problem that you have at hand, it's also that your organization is walking towards something that is going to put too much more than on a single team in the future,
18:23
giving them too much responsibility. Another recommendation is to course correct early. People tend to not like own things, if you keep up wrong ownership model where people will tend to resist to get ownership of something
18:41
that they didn't own before and that's a word cultural change and those are hard and painful to address. Another thing is just to look proactively for signs of the platform being overburdened and having too many things on their plates already.
19:03
The next pair is about every system has its limits. Usually when designing, since you have two choices, one is to make them explicit and let the user know when those limits are going to be reached. On the other, just leave them implicit
19:20
and someone is going to find out and it's going to hurt a little bit. It usually happens when there's the transition to general-purpose platform when you're opening up your platform to many different types of users, so different roles like business analysts or even customer supports or data scientists.
19:40
All of them have different intuitions about technical systems and they might not know exactly how things can go wrong. You start to see signs that you're kind of falling in this trap when every day is a new surprise, every run that you have every day, your daily job is getting a new data set that is failing
20:04
or you're having one data set that is depending on multiple data sets from other teams and it's kind of implicit coupling, which leads to kind of unintended downstream impacts because you have one data set that is kind of shared on by multiple teams
20:22
and no one is actually accountable to make it work properly so that can lead to unintended downstream impacts. One example of that is imagine this simple bag that has 17 data sets all depending on each other, but if you look at the leaf nodes on that, it has 14 transitive dependencies
20:43
and that's kind of like a lot, but if you consider a bag which is this one, has like, I don't know, many data sets and you have a lot of dependencies and if you have like implicit dependence across teams and you have no ownership on those data sets, I don't know, I don't think it's going to scale that well.
21:04
One example of like very simple limits that could maybe at least spike the discussion even if the limits arbitrary at the beginning is to limit the number of transitive dependencies a data set can have. So if your user maybe a data scientist
21:22
is creating a new data set and it depends, it found two data sets that look like they have like the columns that it needs, and that's it, but you have like a transitive like lineage or you have a lineage of like things that needs to be computed for those data sets to be right
21:42
and if you have a limiting place, maybe say 10 data sets, your data set can only have 10 transitive dependencies, maybe that's going to spike a discussion and the teams are going to maybe redesign like those four data sets that you had before or like eight data sets into two data sets.
22:00
And that can constantly happen. And even if you like increase the limits, at least you're generating the discussion of why the limits exist and if we can increase or not. So that's a simple thing that can be done. So the core principle is to identify the fragility of a system and introduce preemptive limits. So thinking on the context of a platform, maybe you can have like a maximum running time of a data set can take the maximum allowed of data scale,
22:23
maximum number of direct dependencies, if you're providing clusters, maybe you can say this is the limit that we can safely provide a number of nodes that we can safely provide for our cluster and the maximum data set running cost. So those are all examples that could help like declare those limits very explicitly.
22:41
And the next one is adopting a platform centric perspective. And by this we mean that as the platform grows again, more and more use cases are implemented on it and the distance between the platform team and all the teams that it serves might get so big
23:02
that the fact that the platform is supposed to serve important business processes and that those business processes need to be at the forefront of the priorities of the platform might be lost and especially the critical ones.
23:21
The way we see it like this surfaces already when the platform is general purpose, it already scales significantly across the organization, there's a lot of people using it and the scale of adoption is actually quite high.
23:41
So some warning signs that you might encounter that tell you that you're getting to this point are that there is no way of understanding the use cases or making sure that they work well independently and that don't impact one another.
24:02
And another sign is again since the distance between the platform team and the user teams keeps growing and growing necessarily, the communication with users starts to become less and less frequent, less and less consistent. And finally that problems are starting to get reported by users
24:22
and not being found proactively by the platform. The ways that the platform is being used are not fully known, so patterns start to emerge and teams start to find problems before the platform team does. An example to make this concrete is like this is again a small graph of data sets
24:41
and they're like divided in layers. From the perspective of a platform team, these layers might make total sense. For example, the bronze layer is typically all the data sets that come from ingestion. The platform team might want to keep track of is our ingestion systems doing well? Are we computing the ingestion data sets fast enough?
25:01
How is that doing? But if we take a different perspective on this exact same graph of dependencies, the picture is completely different. What we have are two business processes and one happens to be essential for the company and the other is just not too important. It's just a nice to have.
25:22
This needs to be at the forefront of the decisions and the architecture of the platform because while understanding whether ingestion is working well, the central bank reports really need to be delivered and the platform team, if these cases are not like two,
25:41
but like dozens, hundreds, might lose track of this. The question becomes how does the platform team ensure that this does not become the case, that the layers become more important than the use cases? The main takeaways or the main thing to say
26:01
is that design for the critical business processes. It's not that the non-critical business processes are not important and shouldn't be healthy as well. It just tends to be that critical business processes have more demanding constraints and serving them well and making sure that they're working very well
26:22
tends to ensure the success of the other ones as well. Again, while you can monitor the SLOs for the layers, for the concepts that are important for the platform, you have to ensure that the teams or the business processes, the platform team has a role at least in facilitating
26:42
the monitoring and facilitating that those things are at the forefront and are well understood in the organization. Finally, it's nurturing the user relationships. There is no secret sauce for this. It's hard, especially when companies scale a lot, but understanding people's needs, maybe it means involving more people from product,
27:03
but having a better understanding or a good understanding of those use cases is going to be critically important for the future of the platform. The next one is when you have your general purpose ETL, everything's collaborating, you have a nice environment and so on,
27:20
but if you leave democracy without accountability in a company, maybe you don't want your coworkers like those cats just smashing the keyboard, creating a lot of datasets and then things wouldn't be as nice after that. So usually happens also in the middle of the general purpose ETL part and you see a lot of questions that users are asking.
27:43
Oh, my dataset didn't run today. And someone, maybe a business person comes to you and say, comes to the platform team, says that the cost of the platform is increasing significantly from one week to the other and business start to get worried and platform has why and then needs to investigate. Or not having the clarity on the lineage of a dataset.
28:02
The example that I mentioned before, someone is just trying to find the data that he needs. He doesn't care who owns the dataset or anything like that. He just wants to create a dataset that depends on that and which leads to broken pipelines for longer periods of time because you have datasets that no one actually knows
28:21
or people own but a shared ownership of a dataset and no one proactive goes and fixes it. So it leads to this problem. So you have this DAG Imagine and you have a teammate that created those datasets to create those datasets and you have some datasets that contains changes from both teams.
28:41
Who is going to be responsible if this dataset breaks? Who is going to be responsible to go there and fix it? And usually if you don't have a clear map out relationship, the ownership of a dataset falls in the platform team to figure out, trying to debug the problem, then go to the team and ask them, oh, can you please fix your dataset? Or even if the platform team is responsible for one call,
29:03
is going to have to deal with that themselves. A better way would be to treat the datasets as data products and have very clear public datasets and leave datasets by default internally so teams can create datasets as they want
29:20
but they are going to be internal. And if they want to expose a dataset they are responsible for maintaining the quality and reliability of a dataset. So things like a dataset scoreboard can happen. So you can have the datasets that you have, the teams with the schema, using a little bit of open lineage and then you have the SLA and what is the score.
29:41
So people can make informed decisions about the health of a dataset. And I think one really important to expect is when you talk about money, then things get real. So attributing the cost of running a dataset to its team is really, really important. The final one, and I'm going to gloss over it
30:01
because we're at time. What we're mentioning is overlooking the changes required for actually automated decision making. In our diagram, in our timeline, that's basically the moment where we reach the final stage. And basically what this brings
30:21
is a need for automated validation mechanisms and for automated recovery mechanisms. And in this moment, basically all the things that we talked about until now, all the problems that you face and you didn't fix properly are going to bite you back because this is when the level of criticality of the platform really raises a lot.
30:42
And when the adoption of the platform is at full speed, the organization wants to use it for more and more things. So all things come to this point. We can talk a little bit more. I'm going to gloss over the example, but we can talk about it offline. There's only one thing that I'm going to mention to finish
31:00
is that if there's one thing that you can take away from this presentation is that you have to make sure that your platform grows both with the business and the organization. And that means that it's not only the technical decisions that matter, it's creating the right incentives for teams to creating the right ownership model,
31:22
creating all the things that you need especially for your growth. There's no correct answer. There's no perfect formula. But depending on the growth of the business, depending on the organization, you have to make sure those things scale side by side.
31:43
Thank you. Thank you, Andrej. Thank you, Joakim. Now you have a chance to ask some questions. Give me a sign and I come by with a microphone.
32:03
You're the first. Thank you for a fantastic talk. I think you summarized a lot of the mistakes I have made building platforms. I wish I saw this talk before. One question I have is how do you get feedback
32:21
from your users, from the different developers, and how do you make sure that your platform engineers have both the context and empathy for the rest of the engineers in the company to build for them what they need and not what they think they need?
32:42
I'll start by answering and Mide will complete if he has something to say. I'm going to say that that's very hard and we made the same mistake. We're still struggling through that and we built this talk exactly so that people can identify earlier those problems. There are a few things that help for us.
33:01
We talked about involving product earlier, so having actually, we stayed mostly, the data platform team stayed mostly engineering organization for way too long at New Bank and having people for which their role is actually understanding users and make sure that that's represented on the roadmaps,
33:21
that's represented on the priorities of the platform. It's very important. The other thing that I would say is that doing anything that allows your organization to start shifting the ownership model in a way that, again, scales with the organization
33:40
and with the business, it's very important because it's not only the platform engineers that have to have more empathy, it's also for an organization that is as big as New Bank, just like we did for operations a few years back where there's more teams own their pipelines,
34:03
teams own their infrastructure. Of course, there are tools like Kubernetes or whatever that makes all of these things easier, but teams own these things these days and it's no difference for their data products, at least at the scale that we're operating on. And that cultural change is very hard and it requires a conversation from both sides.
34:34
Hi, thanks again for that talk.
34:42
So we've been looking at the open lineage for a project that I'm also working on and I was wondering if you could speak to the use cases that you've used to help solve. Can we get the microphone so that they can answer that?
35:04
Of course. So first about open lineage, I think it's really, like the talk previously was really, really interesting.
35:22
The way that we approached, like we came up with this example exactly because we fell into the strap. So at the time, actually, nothing like open lineage was built yet and we built our metadata system and I have to say, we're still suffering the issues that it brought up.
35:42
But the ways that I would approach and that we are thinking on approaching in our next systems is to use it as you would use open telemetry or protocols instead of concretions to build your systems depending on. So later on, it's easier to switch those actual components
36:03
and as you would when building a code, you would implement towards interfaces. I think the idea of like using a standard protocol rather than building something custom kind of applies to that. So yeah, I think that's the general advice. So that's all.
36:22
First, I have a look if there is a question from the online. And if not, no. So yeah, thank you again. And just one information. The wardrobe closes at 18 o'clock,
36:43
6 o'clock p.m. just to remind you if you have some stuff over there. Thank you very much.