Writing and Scaling Collaborative Data Pipelines with Kedro
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 130 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/49948 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 202055 / 130
2
4
7
8
13
16
21
23
25
26
27
30
33
36
39
46
50
53
54
56
60
61
62
65
68
73
82
85
86
95
100
101
102
106
108
109
113
118
119
120
125
00:00
Scaling (geometry)ArchitectureComputerSoftware frameworkQuantumInformation engineeringCollaborationismYouTubeBitSoftware frameworkWritingMultiplication signMultilaterationInformationMeeting/InterviewComputer animation
01:31
Demo (music)Product (business)Information engineeringOnline chatState observerResultantBitInformation managementQuantumTerm (mathematics)Computer animation
02:42
Shared memoryVisualization (computer graphics)Demo (music)DivisorWindowGame controllerGraphical user interfaceBitTouchscreen2 (number)Computer animation
03:42
IRIS-TVisualization (computer graphics)Software frameworkSoftwareMultiplication signIRIS-TProcess (computing)Projective planeOpen sourceFunctional (mathematics)State of matterTransformation (genetics)ResultantPlateau's problemFunction (mathematics)BitLevel (video gaming)Mathematical analysisFile formatCASE <Informatik>Wave packetCountingSoftware developerSlide ruleShared memoryAttribute grammarScaling (geometry)Set (mathematics)Heegaard splittingShooting methodMechanism designScheduling (computing)Endliche ModelltheorieSimilarity (geometry)Connected space
07:20
Order (biology)Information engineeringOrder (biology)Mathematical analysisFile formatStability theorySpectrum (functional analysis)Endliche ModelltheorieRight angleComputer animation
09:51
CAN busChemical equationChemical equationProduct (business)Right angleTrailStability theoryInformation engineeringOrder (biology)Process (computing)Line (geometry)FrictionComputer animation
11:56
QuantumAddress spaceClient (computing)Similarity (geometry)Data modelDesign of experimentsPRINCE2CodeThomas KuhnFunctional (mathematics)Self-organizationoutputFunction (mathematics)Order (biology)Open sourceClient (computing)ArmStandard deviationWhiteboardSelf-organizationTerm (mathematics)Mechanism designData analysisProjective planeCodeCASE <Informatik>BuildingIntegrated development environmentFilter <Stochastik>Computer programmingRight angleProduct (business)NeuroinformatikCartesian coordinate systemInformation engineeringWordMereologyEqualiser (mathematics)Connectivity (graph theory)Pattern languageMessage passingSoftware maintenanceDemo (music)Quantum
16:49
Demo (music)Demo (music)TouchscreenShared memoryCommon Language InfrastructureComputer animation
17:21
TouchscreenCommon Language InfrastructureComputer animation
17:49
outputTouchscreenFunctional (mathematics)Mechanism designConnectivity (graph theory)Software frameworkSound effectFunction (mathematics)Transformation (genetics)Right angleSoftware testingEndliche ModelltheorieSingle-precision floating-point formatStandard deviationReading (process)Parameter (computer programming)Wave packetLaptopDifferent (Kate Ryan album)Heegaard splittingVariable (mathematics)Computer programmingRow (database)CodePhysical systemNormal (geometry)IRIS-TXML
21:46
Open sourceOrder (biology)Multiplication signMultilaterationTemplate (C++)Subject indexingInstallation artRight angleFlow separationLaptopLoginConfiguration spaceProjective plane
23:49
ResultantSoftware testingWave packetFunctional (mathematics)Computer fileSoftware frameworkConfiguration spaceReading (process)Set (mathematics)Uniform resource locator
24:48
FingerprintWave packetEndliche ModelltheorieoutputWeb pageFunction (mathematics)XMLSource code
25:28
Demo (music)Visualization (computer graphics)CASE <Informatik>Library catalogProgram flowchartComputer animation
25:55
Vertex (graph theory)Component-based software engineeringFunctional (mathematics)outputBlogConfiguration spaceLaptopSelf-organizationTemplate (C++)Different (Kate Ryan album)Transformation (genetics)SoftwareLibrary catalogData storage deviceAnalytic setProduct (business)Multiplication signVariable (mathematics)Standard deviationCodeSoftware developerConfiguration spaceoutputCollaborationismCodeTemplate (C++)Projective planeInstallation art
26:36
Visualization (computer graphics)QuantumStack (abstract data type)YouTubeMechanism designVisualization (computer graphics)BitBuffer overflowInformation engineeringYouTubeVideoconferencingComputer animation
27:20
YouTubeBitProduct (business)Information managementProgram flowchartComputer animation
27:55
CollaborationismModal logicStandard deviationProjective planeType theorySoftware frameworkDirectory serviceMereologyMultiplication signSource code
28:45
Open sourceConfiguration spaceClosed setDifferent (Kate Ryan album)DebuggerVisualization (computer graphics)Heegaard splittingElectronic mailing listData structureGraph (mathematics)Library (computing)Absolute valueRight angleConnectivity (graph theory)Link (knot theory)Computer fileMeeting/Interview
30:11
Heegaard splittingLaptopConfiguration spaceParameter (computer programming)IRIS-TFunction (mathematics)Software testingSource codeXML
31:25
Closed setMultiplication signWordOrder (biology)BitGame theoryOffice suiteProjective planeSheaf (mathematics)Open sourceQuantumOnline chatMeeting/Interview
Transcript: English(auto-generated)
00:06
Hello, everyone. My name is Tom. I'm a data engineer and architect at Quantum Black, and this talk is writing and scaling collaborative data pipelines with Kedro. So let's go ahead and get started. A little bit about myself. I've been doing
00:24
data engineering for actually quite some time. I started out at Palantir back in 2013, which was actually before data engineering really became a thing. So I've been doing it for as long as data engineering has existed in the mainstream, I suppose. I also have a little bit of hobbies of doing yoga, as well as teaching meditation.
00:47
So if anyone is curious about yoga or meditation, very happy to talk about those kinds of things here. So here I am doing a Sushasana 2 pose on two fingers. And I also run a YouTube channel
01:02
that talks about Kedro and a little bit more information about that later. So the talk itself, writing data pipelines with Kedro, writing and scaling collaborative data pipelines with Kedro, this can also be called instead architecting your pipelines with the Kedro framework. For the way that pipelines have been written up until now,
01:28
and this is back since, as I mentioned, only in the past recent years has data engineering really become a thing. Unfortunately, the truth is that it's a relatively new discipline. And since
01:42
there's a relatively new discipline, there's not that much around it in terms of best practices or established methodologies. And as a result, data pipelines that we write and the things that we do kind of go all over the place. And so I think this is something that Quantum Black
02:01
kind of observed and something that we wanted to address by using Kedro. Really quickly, I also want to mention that my colleagues are in the Discord chat room and hashtag talk Kedro. And they're actually one of them, Yetu, she's the product manager of Kedro.
02:21
So if you guys have any questions, feel free to talk in the Discord itself. And that way, they can get access to your questions and perhaps answer them better than I can. Okay, so let me go ahead and go to the next one. I have a little bit of a demo right away
02:40
for us to play with. So let's go ahead and talk about pipelines. So with pipelines themselves, it's a little bit easier to visualize some of these things. And we can begin to describe why do data pipelines kind of grow out of control? And then what are the kind of contributing factors? And how can we reel that in? So let me stop this share really quickly and then re-share
03:01
one of my other screens here. Okay. So you should be able to see a Chrome window. Is that true? Yes. Here I have. Yes, thank you. I think it should be popping up in just a second. Here it
03:36
is. So what I have here is a modification of a data pipeline visualization tool that we use.
03:43
This is called Kedro Visualization. It's a very powerful tool when coupled with Kedro. And in demonstrate how data pipelines can grow and change over time. So here is your typical data pipeline. It kind of already is going all over the place, right? The real question is,
04:05
how did we get here? Why is it so all over the place? And what can we do about it? So data pipelines, before they get to this kind of state, they start out always very small. You have a data source, and then you have a cleaning function or some kind of transformation
04:21
function on top of that data source. This is how we always get. Once you run your transforming function, you get this output. And so in this case, we have this clean Iris, which cleans this Iris data and then outputs this cleaned Iris data. And then once you have this formatted Iris data, you want to start analyzing it. You want to begin your analysis. And so you have
04:44
another function that does your analysis, which then outputs your Iris analysis, obviously here. This is usually how data pipelines start out. They're just very simple, very typical. But as soon as you want to expand any of the work involved, of course you need to start off
05:03
shooting. And so you might want to take your Iris data and then start to split it out into training sets and example sets so you can try to predict some attributes of the data itself. You feed it through your modeling engine, you feed it through your accuracy reporting mechanisms,
05:21
and you can see how things start to expand out. And this is just for one data source. What happens when you have several others come into the picture? You have your companies, your shuttles. You also want to do this very similar cleaning processing and saving of that data. And then finally, you're going to want to connect everything together,
05:41
and this is where the mess starts to happen here. The truth is that this process itself is very similar to how software also kind of like sprawls and grows. But because of how new data pipelines are into the software world,
06:04
we don't really have these established methodologies for controlling and constraining the way that pipelines grow. And as a result, that's why we've become a little bit almost like arrested. There's a plateau onto how big and how powerful our data pipelines can
06:21
become. And as a result, they never grow past the stage that you see here. But thanks to Kedro as a framework for growing, scaling, and writing these data pipelines, you can go from something like this to something even larger. And so here is actually another typical pipeline that you would see on a project. And I actually myself have worked in projects where we have pipelines that
06:46
are quite literally 10 times the size, and we have a developer count that is in the dozens. And because of the way that Kedro works, we're still able to collaborate effectively and properly maintain that data pipeline. So Kedro is a very powerful framework for how we architect
07:04
it. And so we're going to go in a little bit into those details now here. Let me go ahead and share my slides yet again. I'll share like this. Okay. So now we talked about the pipelines
07:22
themselves. How do pipelines grow? Well, pipelines grow because they're centered around our teams. We have teams of data engineers as well as data scientists. And the truth is that the typical data engineer and the typical data scientist, they're not necessarily
07:41
super compatible, right? They almost are opposites in many ways. So you'll see that data scientists may not be engineers, for example. They have a lot of amazing knowledge of these particular modeling topics and the way that we can transform data, but they might not necessarily have those engineering practices in their tool belt.
08:06
Furthermore, data science as a discipline tends towards more experimentation, and this requires having faster data pipelines, or rather being closer to the data. So you want to be as close to the data as possible when you're a data scientist in order to manipulate, experiment,
08:24
and kind of play with the data. Now on the other side of the spectrum, the data engineers have the opposite problems. They might not be scientists necessarily. They might have more engineering backgrounds. And so when you're a data engineer, you might not know a lot of the ways that someone can model pipelines or how you transform data. And furthermore, the things that
08:46
you're most interested in is order. You want to keep things kind of like neat and tidy. You want to abstract away things as much as possible in order to sustain that robust stability. And so, oh, and there is one problem with both of them is that they both still must clean the
09:06
data. And so data cleaning itself is just like a whole topic by itself, but it's something that is inescapable. And it's quite interesting because the way that you clean the data changes the way that you analyze the data too. And so this is a collaborative problem
09:22
that both of these team members must tackle together. Because if the data engineer doesn't clean the data in the correct way, and the data scientist instead gets the data in an incorrect clean format, then it's hard for them to do their analysis. And the data engineer,
09:41
it's hard for them to take any analysis out from the data scientist. So cleaning, cleansing is definitely an important portion of this. And then finally, what you'll see is that data pipelines are almost always asked to be in production right now. The reason why you write these data pipelines and do these data analyses is because the business wants some facts. They
10:05
want some insights. They want you to take things out as soon as possible and give it to them immediately. So these are kind of the problems centered around data pipelines. Can we find a balance with these two guys? So how can we find a balance with collaborating with data engineers
10:22
and data scientists in a way that doesn't necessarily introduce so much friction, so much process, but still maintains that fluidity that the data scientists need, as well as that the data engineers need, in order to make their pipeline stable and still agile? And that's actually just an immediate need. The final need is, is it ready for the
10:46
handoff? One day, those data engineers and those data scientists may not necessarily be working on that data pipeline anymore. Is that data pipeline ready to be handed off to another team to pick up the baton? And I think this one is actually the most fascinating to solve,
11:03
because when you think about it, when you hand off a data pipeline and you're handing it off to another person, right? That's a person down the line, someone who has no knowledge of your data pipeline. The truth is that that person can actually still be you when you think about it, right? Because in the future, two months, three months, four months in
11:24
the future, you'll find that you might not remember anything about your data pipeline. And when you come back to the data pipeline, you look at it and you have no idea what's going on. So effectively, you yourself can be that person in the future that you're handing the baton off
11:41
to, right? This is your past self and then your future self. And so is your past self really creating a pipeline that your future self can handle? I think that's something that we need to definitely keep track of. And so this is something, these problems are things that Quantum Black discovered in their work. And so Quantum Black, if you're unfamiliar,
12:03
they're a startup that came out of London, very famous for some of the cases that they worked on in terms of data analysis for F1 cars. And so McKinsey picked them up and brought them in as their data science, data engineering arm of the firm. And so Quantum Black, they do hundreds
12:21
of these projects all over the world for data pipelines and data science and data analysis. And they found that they kept running into these similar patterns across all of these different projects. And so instead, what they did was they tried to codify these processes and bring them
12:41
back into Kedro. This is in order to maintain and make it easier, not only for the data scientists and data engineers to work together, but also to allow handoff to the clients that we would work with. And so Quby open sourced Kedro last year, and it's been growing ever since.
13:03
And so, yep, why does Kedro exist? It really is that collective learning, trying to deliver those applications. And our product mission, I think, is really fantastic. It's this empathetic intention. How can we tweak our workflows so that our coding practices are the same?
13:24
And so I like this word empathy here because it really is important to think about code in a way that allows other people to help you with the code and you to help other people with code. And so I kind of think of this as like almost altruistic programming in a way,
13:41
where the way that you write the code is not for your own selfish intention to get things done right now, but really for the benefit of people who are going to be reading that code later on. And what I found is that that return on investment of making the code readable, maintainable and understandable is really, really beneficial. And this comes and pays in dividends
14:03
later down the line. Okay. So how does Kedro solve these problems? So let's think about how data pipelines really are set up and how we can break them down. So let's take an example here. Let's imagine audio as data. So literally just like your
14:24
audio signals, let's just imagine those as data. We can think of those as things that we would to push through like data pipelines and et cetera. And so you have this kind of like for audio, you have standard inputs and outputs. You have standard mechanisms that can take input from the
14:42
environment and then output things into somewhere else. And so we have microphones, we have amplifiers, we have compressors. There's a lot of technology that goes into audio engineering and we want those inputs and outputs to be standard. Next we also want to have these
15:03
kind of transforming mechanisms here. And so actually the one that I mentioned earlier, these kind of like microphone compressors, for example, or these kind of mixing boards where you can modify your equalization on the audio itself. And so these guys will transform the audio in a way that can affect it for the needs that you have with your low pass filters, your high pass
15:26
filters, et cetera. And this is something here that is overlooked. And I would argue that this is one of the most important parts. It's not the most important part here, is that you want to be able to redirect your output. We're actually, this is very similar to abstracting your API.
15:44
We want to make sure that each of the components in our audio system can easily talk with other components in the audio system. So when you have a microphone input, you want to be able to put that microphone's output either into your mixing board, into those compressors, into those filters, et cetera, et cetera. So having this ability to plug and
16:05
play different portions is really vital there. And then finally you want to have this convention for organization. How do we actually structure our audio engineering setup in order to
16:21
most effectively do our work? And so here we have like a setup. You got your microphones here. You got your audio mixers here. You got your computer over here. You know where everything is. And because you know where everything is, you know where to find things, you know where you know where to adjust things, and you know how to tweak things as you desire.
16:42
And concepts back into pipeline building. And so we have here a quick demo also regarding Kedro. And so let me go ahead and pull up my, where is my mouse? My mouse has
17:04
disappeared. There it is. Okay. We're just going to stop the share really quickly. Make sure that I have this guy here in place. And then let me share this screen once more. Okay. Great. So here we have just our CLI here. And inside of the CI, it's very simple to
17:32
go ahead and get started with Kedro. Actually, before we go ahead and do this one, let me share a different screen, which shows an example of what data pipelines can look like,
17:42
just like as a raw example. So here we go like this. This is the one here. Okay. I think that that should be showing up. Do you see a Jupyter notebook? I think that's available now. Okay. So here we have an example like pipeline, which we're breaking down
18:09
Iris data. And so the truth is that this is what you see in a typical data pipeline, right? You have like a single Jupyter notebook. Oh, this is the wrong one. This is the one here. Okay.
18:27
You have a single Jupyter notebook, and you're suddenly inundated with a whole bunch of programming. There's a whole bunch of code here. There's a lot of things going on, and you don't really know how to approach this guy, right? And if you're lucky, you're going to
18:42
find that there's going to be functions, things are going to be broken down in ways that you can understand. But more often than not, you're going to find notebooks that just have a plus, like just code splatted on there, all doing these kinds of different things, and it's
19:00
difficult to trace and understand. So what happens usually is that you just go through, you look at the notebook, you begin to like run the notebook, and then you hope that it works, right? And of course what will happen is you're going to be missing out on some data, and you don't know where things are, you don't know how things are tweaked, you don't know where parameters sit, and that's an unfortunate thing that happens here.
19:23
But we can start to break this down into those four components that we kind of mentioned earlier. And those four components are, right, the first one is the standardized inputs and outputs. So right here we have an input and we have these outputs here. Pandas data frames is, of course, our well-beloved mechanism for making our pipelines,
19:45
and they come with different reading and writing mechanisms. And so this is like an example of that standardized input and output. You can use this to interface with your system, to read our CSVs, and then output the CSVs as you please. So that's the standardization there.
20:04
Then we have our transformations, right? This kind of split data. So this split data here will take a data frame and then will split it out into our train X, train Y, test X, test Y. So in modeling, of course, this is just to actually train the model and then test the
20:21
model on the data that we're using. And so this right here can be considered that kind of transformation. These are actually pure functions, which will take an input, modify the input, and then give you an output. And there's going to be no actual side effects. So this is kind
20:40
of a transformation here. Next, we have these guys here. And so you see these three in a row. And this is where notebooks as pipelines kind of break down. It's like, how do we start to string all of these different data frames together, right? We're getting the inputs or we're getting
21:00
the outputs, and it becomes hard to figure out where things are coming from where. You have to pray that someone has named your variables correctly, and you have to manually trace things in this manner. And so this is where things get broken down, because obviously there
21:21
is literally no convention where when you set up your notebooks and you set up your normal data pipelines. And so this is where things start to break down completely. And now let me show you what this pipeline actually looks like inside of Kedro. So Kedro actually comes with a lot of really great command line functions. Let me share this screen.
21:48
And so in order to start a Kedro project, first you have to install Kedro. Very easy. You can get it with pip install Kedro. That will go to the PyPy package index, grab it, download it, install it. It's very simple, straightforward.
22:03
Then what you can do is you can actually use Kedro new in order to create a new pipeline. And this allows you to name your pipeline and then create the project. So we're just going to really quickly say EuroPython 2020 as the Kedro project, and then EP as the package
22:21
name. We will generate the example of the pipeline, and there we go. We've already created our pipeline. Let's open this guy up here, and I'll open this up in PyCharm. And now something that's really cool also, and I will show you guys that a little bit later if we have time,
22:40
is that Kedro also has now comes with these kind of starter templates, which allows you to modify how Kedro creates those, modify those initial conventions. But here, I'm just going to show you what Kedro comes with. And so you can see here on the
23:04
our configuration folder, a source folder, and then these data folders, logs, docs, and notebooks. And so all of these guys are actually related to that original template that we were talking about, right? How do we organize our pipeline? And so right from the get-go,
23:21
Kedro gives you an example of how you can start to organize things. And it helps us with our separation of concerns, right? So in the previous example, we didn't really know where data was going, where data was coming from, how we were reading the data. And not only that, but everything inside of that previous pipeline is actually
23:42
hard-coded, right? Everything inside of that pipeline was hard-coded. Let me share this desktop again, and I can show you that example. And so right here, for example, the read CSV path, this is hard-coded. This train one, train two, test one, test two, this is also hard-coded. And then there's also parameters inside of our functions.
24:02
We have our test data ratio, which is hard-coded here. All these things are hard-coded. And so as a result, you find that pipelines themselves are very difficult to maintain because you don't know where anything is. But thanks to Kedro, we make that easier. We put our configurations here inside of this configuration folder, and then we can instead parameterize how we start to
24:26
find our data sets, right? And so here we have the location of the data set and how we want to read that data set. So we're just using this pandas data frame to read it in, read in that CSV file. And then we use Kedro itself to kind of bring things together. And so in that other
24:47
portion here that we were talking about, we don't really know how things are, the relationships, how these are built. We instead have these pipelines. And so inside of our pipelines, we can start to see the relationships. And so we see those same examples here
25:02
of this train model predict taking these inputs and these outputs, and then putting them through the pipeline themselves. And so actually we can even visualize this guy. Because everything is standardized, we can programmatically extract the pipeline itself
25:22
and then present the pipeline to the user. And so if we go here to this webpage, we can see that Kedro visualization. And so suddenly you begin to understand how your pipeline is built. It becomes easier to explore it and then easier to understand because we are separating things into
25:43
these different concerns here. And then so as that is the case, I only have a few minutes left, so let me just rush through these guys really quickly. We have that catalog of those standardized inputs. We support a lot of different inputs and outputs. We have the
26:01
nodes and pipelines. So the nodes are those transformations themselves. The pipelines pull things together. The configuration, this is where we have our variables that we would otherwise be hard coding kept inside of one folder so you know where they are, as well as the project template. And so this is the standardization of everything together. And once we employed
26:23
Kedro, we found this consistent time to production, reusable analytic code storage, increased collaboration, and even upskilling of our developers who otherwise would not have exposure to these kinds of software practices. And so of course, pip install Kedro, we can visualize. And then we actually have some deployment mechanisms built in. So
26:42
Kedro Docker, Kedro Airflow. So you can immediately deploy these pipelines as you are, I mean, as they are. And we have a great support team. So we have our own Slack channel internally. And then we have Stack Overflow, Read the Docs, and then our GitHub, which come back into our Slack channel. And we also have a budding community. I run a YouTube channel
27:06
called Data Engineer One, where I mainly talk about Kedro. So if you'd like to learn a little bit more, I mean, I've got a ton of videos there talking about Kedro there. And I think that we have about two more minutes left. So why don't we go ahead and open the floor
27:24
for maybe like a few questions. And I think I'm probably rushed through that last bit. But I'm sure you guys have a lot of questions. And I think there's a lot of great stuff to talk about on Kedro. And so you can find us there inside of the Talk-Kedro discord channel,
27:45
where again, the product manager, as well as one of our tech evangelists is there available to talk about Kedro a little bit more. But going through this final example here, normally the hardest part about a pipeline is figuring out how to run it.
28:00
And thanks to Kedro, we do have this ability to just simply move into the directory and then type in Kedro run. And this will run our Kedro pipeline. And so this standardization allows anybody who is familiar with the Kedro framework to enter into any other Kedro project, run the Kedro project as they wish to, and break it down and understand it as necessary.
28:25
And so I think this is why some of the benefits of Kedro there become evident as you begin to use it and you begin to expand your data pipelines and collaborate with the rest of your team members. And I think that's time for me. Thank you very much for having me.
28:41
And hope I can come back again and speak with you guys soon. All right. Thank you so much. And- Thank you. Actually, yeah, technically we're kind of towards the, we're kind of right up against the clock here, but as the closing session is not until, well, it's 20 minutes from now. So I think,
29:06
I don't think, and there's no one else behind you, so I don't think there's a problem in asking a couple of questions now. Cool. So one is, is it possible if I just use Kedro vis feature, or is that possible if I just use the Kedro vis feature?
29:22
Yes, absolutely. And so the Kedro vis by itself is an open source library that you can use with any kinds of node or graph structures. It just takes a simple JSON file, which shows the linkings between different nodes, then a list of nodes, and then you can actually use it.
29:43
And in fact, we have a React component, which means you can embed that visualization into any kind of front end that is using React. So it's actually really quite cool. All right, cool. And then the other question I have here is, Kedro configuration, is it difficult or no? So for the configuration itself,
30:05
it's actually quite straightforward to set up the pipeline. And then for the configuration of your nodes, for example, inside of data science, we have inside of the split that split data, we had actually a hard coded value for what the split ratio was. And so that's
30:24
here inside of this example, Jupyter notebook. We have a 0.2. In Kedro, the way that it works is that pipelines, you give to the pipeline the name of the data asset that you wish to use. And so here we have the iris data as a data asset, as well as the parameters. And so your
30:44
parameters actually become data assets by themselves, which means that you can actually keep all of your parameters inside of a parameter folder or a parameter configuration. So here we have our example test data ratio is written right here. And so that means that we don't
31:00
need to change any hard coded values if we wish to update our parameters and then rerun our pipeline. So in this example, I can easily change the ratio from 0.2 to 0.6 and then immediately rerun the pipeline and then get a different output there. So it's very, very easy to handle the configuration. Excellent. Thank you so much. So anyone who
31:27
wants to chat with Tom or any of the rest of the Kedro crew, Quantum Black Crew, check out the talk Kedro chat room on Discord and they'll all be hanging out there. And if I remember correctly,
31:42
you guys also have a sprint tomorrow, don't you? Yeah, that's correct. And I think we've opened it up to work with people to contribute and work on the Kedro project. And I myself have a lot of pull requests into Kedro. It's very easy. It's a great way to get started with open source projects because the community is very dedicated and we're backed by a lot of great engineers in
32:04
our London office. Awesome. So yeah, we are a little bit over time, everyone. So these other questions, please do repost those. So Diego and Steven, I do see you guys. Please do repost those over in the Kedro chat room. I really appreciate that. And everyone, thanks for joining us. So
32:24
stay tuned. In about 15 minutes, we're going to have the closing session. So Noah said to see the end of another Euro Python, but hey, all good things must come to an end.
32:42
And I hope you've really enjoyed your time. Also, typically, we're only past the talks section, so we still have sprints over the next two days. So please do hang out for that. Also, after the closing session, we have some fun in the after party here in the Microsoft room. So that is Word Peril, the Python word game where you win absolutely nothing. So we'll be
33:02
taking volunteers from the audience. Maybe I'll see you there, Tom. Yeah, I don't need to go then. So anyway, anyone who's interested in joining in on that, that'll be starting at 2130 in this channel. So see you all, hopefully, in the next 15 minutes for the closing session.