We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Reproducible & Deployable Data Science with Open-Source Python

00:00

Formal Metadata

Title
Reproducible & Deployable Data Science with Open-Source Python
Alternative Title
Reproducible and Deployable Data Science with Open-Source Python
Title of Series
Number of Parts
115
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Data scientists, data engineers and machine-learning engineers often have to team together to create data science code that scales. Data scientists typically prefer rapid iteration, which can cause friction if their engineering colleagues prefer observability and reliability.  In this talk, we'll show you how to achieve consensus using three open-source industry heavyweights: Kedro, Apache Airflow and Great Expectations. We will explain how to combine rapid iteration while creating reproducible, maintainable and modular data science code with Kedro, orchestrate it using Apache Airflow with Astronomer, and ensure consistent data quality with Great Expectations.  Kedro is a Python framework for creating reproducible, maintainable and modular data science code. Apache Airflow is an extremely popular open-source workflow management platform. Workflows in Airflow are modelled and organised as DAGs, making it a suitable engine to orchestrate and execute a pipeline authored with Kedro. And Great Expectations helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.
19
Thumbnail
43:26
GoogolHookingComputer fileValidity (statistics)Text editorLibrary catalogScheduling (computing)Expected valueLibrary (computing)MathematicsCuboidMechanism designPattern languageSoftware testingAbstractionCellular automatonOrder (biology)Configuration spaceUtility softwareImplementationContext awarenessINTEGRALDataflowAutomationLevel (video gaming)Video gameCycle (graph theory)LaptopTelecommunicationLine (geometry)ResultantRight angleBitStrategy gameAlgorithmCausalityFreewareConsistencyGoodness of fitObservational studyMachine codeExtension (kinesiology)Universe (mathematics)BackupSubject indexingWeb browserError messageTable (information)File formatMultiplication signProcedural programmingElectronic mailing listCurveFlow separationIntegrated development environmentClient (computing)Visualization (computer graphics)Product (business)Enterprise architectureShape (magazine)Virtual machineSet (mathematics)Power (physics)Core dumpIntelConstructor (object-oriented programming)MultiplicationGame controllerDeclarative programmingGraph (mathematics)Suite (music)Functional (mathematics)Wrapper (data mining)Run time (program lifecycle phase)Computer architecturePlug-in (computing)Thermal conductivityInformation securityNetwork topologySoftware maintenanceProjective planeCategory of beingInterface (computing)Game theoryInteractive televisionFunction (mathematics)QuicksortCASE <Informatik>Iteration1 (number)Standard deviationTemplate (C++)CollaborationismHTTP cookieSelf-organizationData managementRepository (publishing)Software developerOperator (mathematics)Transformation (genetics)Different (Kate Ryan album)LogicDependent and independent variablesMusical ensemblePoint cloudMereologyNeuroinformatikLatent heatNumberInterpolationTrailDifferenz <Mathematik>Graphics processing unitBranch (computer science)Cache (computing)Constraint (mathematics)Division (mathematics)Wave packetInternetworkingDemosceneAnalytic setParameter (computer programming)ArmComputer fontType theoryCommon Language InfrastructureQuantumDemo (music)FeedbackOpen sourceElectronic data processingEndliche ModelltheorieInternet service providerPhysical systemGreatest elementUnit testingHypermediaHeegaard splittingMetric systemRaw image formatMassComputer configurationTouchscreenDomain nameComa BerenicesSlide ruleProgram slicingOffice suiteReal numberSpacetimeShared memoryBlock (periodic table)CodePresentation of a groupTask (computing)Source codeRevision controlTerm (mathematics)Single-precision floating-point formatAsynchronous Transfer ModePerspective (visual)Series (mathematics)Modul <Datentyp>Machine learningoutputSoftware engineeringBit rateInformation engineeringDynamical systemSoftware frameworkVideoconferencingBuildingVariable (mathematics)Problemorientierte ProgrammierspracheExploratory data analysisLocal ringData analysisData qualityOpen setPerfect groupMeeting/Interview
Normed vector spaceOpen sourceSlide ruleTouchscreenSoftware engineeringStack (abstract data type)Computer animationProgram flowchart
LaptopStandard deviationKeilförmige AnordnungProjective planeMachine learningQuantumAnalytic setArmProduct (business)BuildingDomain nameCore dumpScheduling (computing)Open sourceInternetworkingDemosceneComputer animation
LaptopStandard deviationDew pointIntegrated development environmentStandard deviationProjective planeLaptopExtension (kinesiology)Product (business)Scheduling (computing)Strategy gameJSON
VideoconferencingArchitecturePresentation of a groupCodeLaptopData analysisDifferent (Kate Ryan album)Computer architectureVideoconferencingDifferent (Kate Ryan album)LaptopRepository (publishing)Perspective (visual)Interactive televisionAlgorithmCodePhysical systemSeries (mathematics)Presentation of a groupGreatest elementLine (geometry)Machine learningSelf-organizationFeedbackExtension (kinesiology)HypermediaInternet service providerVideo gameSource code
LaptopData analysisError messageCodeLaptopExploratory data analysisSoftware maintenanceProjective planeJSONComputer animation
LaptopError messageAxiomError messageCellular automatonFile formatLaptopLogicProjective planeCuboidThermal conductivityFunction (mathematics)CodeSoftware engineeringUnit testingConsistencyObservational studyInformation securityResultantCausalityNP-hard
System programmingScripting languageCodeModul <Datentyp>Flow separationRead-only memoryEmailAerodynamicsData managementVariable (mathematics)Integrated development environmentLocal ringType theoryInternet service providerSource codeImplementationRevision controlControl flowPhysical systemParameter (computer programming)Projective planeSoftware frameworkLaptopIntegrated development environmentStandard deviationData managementDifferent (Kate Ryan album)Dynamical systemVariable (mathematics)LogicStrategy gameSoftware engineeringOpen sourceParameter (computer programming)Library catalogSet (mathematics)NumberInterpolationConfiguration spaceDeclarative programmingInterface (computing)Level (video gaming)Product (business)CASE <Informatik>Iteration1 (number)File formatConsistencyCuboidLocal ringSoftware developerAbstractionType theoryCodeMultiplicationInformation securityWritingScheduling (computing)Flow separationModul <Datentyp>Client (computing)Enterprise architecturePattern languageCore dumpEndliche ModelltheorieComputer fileLibrary (computing)ImplementationExpected valueSource codeDataflowLatent heatRepresentation theoryProcedural programmingSimilarity (geometry)CurveComputer animation
SoftwareGoodness of fitIntegrated development environmentSet (mathematics)Projective planeMass1 (number)Problemorientierte ProgrammierspracheMultiplicationHeegaard splittingLibrary catalogCodeComputer fileBlock (periodic table)Computer animation
World Wide Web ConsortiumGEDCOMKeilförmige AnordnungBoom (sailing)Gamma functionTerm (mathematics)Configuration spaceDemo (music)Real numberVideo gameDifferent (Kate Ryan album)Integrated development environmentComputer fontLibrary catalogProjective planeSet (mathematics)Coma BerenicesProduct (business)File formatUniversal product codeText editorLevel (video gaming)Computer animationSource codeProgram flowchart
LaptopVariable (mathematics)Latent heatSoftware testingEmailCodeUnit testingPower (physics)Machine codeCellular automatonPattern languageUtility softwareData managementCuboidLibrary catalogLogicBitOrder (biology)Declarative programmingSet (mathematics)outputProjective planeLatent heatRun time (program lifecycle phase)LaptopFunctional (mathematics)Wrapper (data mining)Variable (mathematics)AbstractionComputer animation
Asynchronous Transfer ModeoutputMaxima and minimaElectronic data processingDataflowEndliche ModelltheorieSet (mathematics)outputSubject indexingDemo (music)Function (mathematics)Table (information)Graph (mathematics)Cycle (graph theory)AlgebraCategory of beingSource codeJSON
Cellular automatonLaptopCodeGradientComputer-generated imageryGamma functionAsynchronous Transfer ModeoutputFunction (mathematics)Link (knot theory)Structural loadType theoryInformationTask (computing)Thread (computing)Coma BerenicesProduct (business)TelecommunicationComputer animationSource codeXMLProgram flowchart
Level (video gaming)Gamma functionTransformation (genetics)LogicMachine codePattern languageDataflowNetwork topologyTable (information)outputLevel (video gaming)Transformation (genetics)LogicFreewareTask (computing)
Asynchronous Transfer ModeBefehlsprozessorVisual systemUtility softwareLaptopData analysisInformationIterationHead-mounted displayVisualization (computer graphics)FeedbackInteractive televisionCodeBoilerplate (text)Bit rateBlogData modelPhysical systemEndliche ModelltheorieDrill commandsProduct (business)DatabaseTable (information)outputGEDCOMMaxima and minimaStapeldateiACIDGamma functionCAN busMassStructural loadLaptopPattern languageCodeMachine codeProjective planeSoftware developerShape (magazine)Integrated development environmentData managementDifferent (Kate Ryan album)BitMathematicsComputer fileCommon Language InfrastructureDifferenz <Mathematik>Expected valueBranch (computer science)Suite (music)Power (physics)Visualization (computer graphics)Set (mathematics)QuicksortTemplate (C++)HTTP cookieFeedbackScheduling (computing)Library catalogOpen sourceOpen setStandard deviationValidity (statistics)DataflowResultantInterface (computing)HookingPhysical systemProduct (business)Latent heatInteractive televisionMechanism designLibrary (computing)Computer architectureImplementationConfiguration spaceDemo (music)Data qualityUtility softwareAutomationParameter (computer programming)Context awarenessEndliche ModelltheorieINTEGRALCuboidDependent and independent variablesField (computer science)Extension (kinesiology)Data analysisCollaborationismConstructor (object-oriented programming)Exploratory data analysisTelecommunicationAlgorithmLevel (video gaming)Run time (program lifecycle phase)Plug-in (computing)Multiplication signElectronic mailing listComputer animation
GeometryOctahedronVery long instruction wordBuildingService (economics)EstimationGroup actionSuite (music)Maxima and minimaDigital filterBuffer overflowGEDCOMWeb browserExpected valueValidity (statistics)Source codeJSONXMLProgram flowchartComputer animation
CNNGEDCOMSingle-precision floating-point formatMaxima and minimaDifferent (Kate Ryan album)Common Language InfrastructureLine (geometry)Scheduling (computing)Validity (statistics)CodeExtension (kinesiology)Asynchronous Transfer ModeIntegrated development environmentSoftware developerProjective planeBitSingle-precision floating-point formatProduct (business)Virtual machineGame controllerStrategy gameData qualitySource codeComputer animationJSON
Common Language InfrastructureTask (computing)Single-precision floating-point formatDemo (music)Software developerLaptopTask (computing)Constraint (mathematics)Wave packetDataflowMereologyNeuroinformatikKey (cryptography)Different (Kate Ryan album)Virtual machineEndliche ModelltheorieBitScheduling (computing)Integrated development environmentElectronic data processingGraphics processing unitSource code
Source codeExtension (kinesiology)Projective planeExtension (kinesiology)Open sourceTrailTemplate (C++)Library (computing)Multiplication signCodeCommon Language InfrastructureExpected valueAbstractionJSONComputer animation
Demo (music)Software developerProjective planeTerm (mathematics)Standard deviationComputer fileCollaborationismRevision controlSelf-organizationScheduling (computing)Pattern languageCodeMultiplication signPerfect groupComputer animationLecture/ConferenceMeeting/Interview
Transcript: English(auto-generated)
Well, hello people. We continue in the afternoon talks Now we are going to have a Lim talking about Reproducible and deployable data science with open source, right? How are you Lim? I'm good. I'm good. Can you hear me? Okay. Yeah
Perfect. Where are you streaming from? I'm streaming from the office actually, so I went to the office so that I have some private space to do this Nice is this a your first ever Python? Yes, this is my first euro Python That's very I can know us as well. Yeah, it will go great. Um
Well, if you are ready, we're gonna share your screen Okay We are gonna start with good luck. Awesome. Cheers. Can you see the slides? All right, hi everyone
Very excited to be here today to talk to you about Reproducible and deployable data science with open source Python before I continue. I'm just gonna check the chat to see if everything is okay Yes, okay. Everyone can see my slides and my screen. That's great Yeah, my name is Lim. I'm a software engineer My background is in full stack software engineering. I mostly worked for startups before
Some example companies I have worked with include the liberal and memorize right now I'm working in the data domain at quantum black labs Quantum black is the advanced analytics consulting arm of McKinsey and right now I'm building data products for a data scientist that engineer and machine learning engineers to accelerate their projects
I love open source Python Currently, I'm a core contributor on schedule one of the open source tools explored in this talk today I'm limo to everywhere on the internet. It means limbs be cut in in Vietnamese So The agenda today is I will be setting the scene in which you need to deploy a realistic Jupiter notebook into production
I will explore some of the challenges that you might face in this journey and how you can move from a notebook environment Into a standard Python project with a kato then I will explain how we can use schedule extensibility To integrate with other tools in the ML ops ecosystem and a few deployment strategies we can use to deploy the project as well as
How do we go from here from one project to hundreds of projects in the future? So with that in mind Let's imagine this following scenario a media organization wants to provide video and video recommendations to their users As we have seen in real life if done correctly This can have material impact on the company's bottom line
And this is the organization first attempt in adopting machine learning and data science in their workflow So one of their very talented data scientists have conducted extensive research into different algorithms and architectures for building Recommendation systems the data scientist has organized her findings as a series of Jupiter notebooks
And a disclaimer is that all of the notebooks and data science code in this presentation adapted from Microsoft Recommenders repository where they details best practices in building recommendation system. I cannot recommend it enough That's a bad joke, sorry All right, so
When it comes to notebooks, there are a couple of different perspectives On the one hand the interactivity fast feedback loop and convenience of a Jupiter notebook is almost unbeatable especially for exploratory data analysis and Rapid experimentation on the other hand a notebook pose some challenges for reproducibility
maintainability and operationalization Specifically notebook actually makes it hard to collaborate between Each other in a team member. I know that there are some reason really cool projects to address this issue but out of the box Jupiter notebook is essentially a single-player game and
Then it also make it it's also quite hard to conduct code review In a notebook format as well as some of the other code quality controls that we usually enjoy in software engineering Hard to implement such as writing unit tests documentation generation linting so on and so forth and Notebooks sometimes give you a full sense of security because it cached the results
When you see us the the output cell of your of your notebook and you see that it has a correct results It might make you think that your code runs without errors, even though the logic has changed and all of these problems combined Cause I think the biggest issue of them all which is consistency and reproducibility in your project
So a very injured in a very interesting study in 2019 New York University executed 860,000 notebooks they found in GitHub and they found that only 24% of them ran without error and Only 4% of them actually produced the same results
So With that in mind, I'm going to explore Some of the strategies we can use to turn a notebook environment into a standard pattern project using schedule Disclaimer, I'm a core contributor to schedule. So this talk will be bias So what is schedule it is a framework for you to build?
reproducible a maintainable and more to modular data science by apply software engineering best practices Such as separations of concern and versioning into your code and it was created by quantum like From our own battle scars of delivering data science project to our client is used at startups major
Enterprises in academia and is fully open source But instead of selling you on this tool from the top down I would like to explore some principles that are multi versa multi betters to to build casual in the first place And some of the problems that is is trying to solve. So the first problem is data management
Data management in jupyter notebook has a few challenges whenever I come into a new notebook My my first instinct was to ask a few questions. Like what are the data sets uses in this notebook? Where are they stored? How are the data loaded? What are the formats again?
I reused my data loading procedure for other similar data sets. How I how can I incorporate incorporate new data set if necessary? In in our example notebook You could see that it's actually miles ahead of the curve in which it factor out all of the data loading logic into a reusable Library for the movie lens data set. It's not quite ideal because it's programmed directly into
The the specificity of this data set and it's hard to reuse later on later on if you have more data sets You will have to build more libraries so the the way that we serve it in casual is we We produce we provide a declarative data management interface through a YAML API as configuration for you
It separates the what the where and the how of data loading and When you stack all of these data sets together declarative definition of data sets together in a centralized data catalog It gives you an instant clarity on which important data sets are used and persisted in your project
even to non technical team members It supports a number of features such as interpolation for dynamic values such as environment variables It also supports changing data set definitions between different environments local development staging production and so on and so forth
So a common use case is you would use a smaller data set in local development for rapid iterations and bigger ones in staging and production It also support data versioning partitioning incremental loading through different data set types and configuration it promotes security best practice by accessing data without leaking credential and
It's extensible through custom data sets So you if you don't if you find a particular use case that we don't support out of the box It's very easy for you to write custom data connector and use it in your in your project I think the biggest benefit of declarative data management is it provides a consistent interface between business logic and
different IO implementation So it abstracts the way the differences between the different data sources processing engine data formats out of the box we support many of them and It promotes the irreducibility of the data connectors as they are separated from your business logic
So you can swap them in and out without changing the data science code In in data science code parameters and configurations are also important Parameters including the project parameters as well as the hyper parameters of your model So in the screenshot here you could see that I have different configurations from different tools that I integrate into my project such as
great expectation which we can see we'll see in a few minutes as Well as ml flow and and spark all in the same place as well as my parameters I also manage in the same way in a yaml file The trade-off here is that
well YAML domain specific language works very well on small and medium-sized projects But it becomes unruly for massive ones with hundreds of data sets even with good IDE support So to me to get these problems, Kedro supports splitting your data catalog into multiple YAML files It also supports in the templating for to avoid repeat repeatability
At the expense of readability and then you can also use some other yaml native features such as reusable code block to To improve your configuration. I'm just going to show you a very quick demo how this look in real life In our project. So this is our data catalog. It's located under com base catalog yaml
In my project and as as I mentioned earlier, this is a base environment But you can also create more environments such as staging and production down here to overwrite your catalog definition in different environment and
I really like a feature in vs code the the outline editor where you can see the outline of your catalog So it's very clear what data sets are used in this project and to demonstrate the idea that we can swap in and out different Data set I'm just going to change it into spark spark data set. I'm gonna do file format
CSV and it will just work the same way with a different processing engine the product code Okay, so this is the This is the data catalog and it hopes that it gives you some idea on how it can help you with data management in jupyter notebook and
The next bit is about how do you manage the code in your project? So the challenges with managing code in jupyter notebook is that Cells need to be run in a specific order. There are global scope variables that may or may not have been initialized It's hard to unit test specific cells in isolation It's still necessary to factor our common logic into pattern utilities outside of notebook to prevent the notebook from becoming polluted
So out of the box get to give you a few simple but Powerful coding patterns as well abstractions to help you manage the code better in in a project So the first thing is that business logic in casual are written as pure pattern functions
but that means there are a lot of benefits to this but the One of the biggest one is you can unit test it in isolation and then you can use other tools in the pattern Ecosystem when it comes to functions such as decorator composition to help you write more modular code and Then these pure python
function Can be connected together to use in a bigger pipeline in a concept that we call a node and a node It's just a clean wrapper around these functions with With inputs and outputs which are dynamically injected at runtime by the declarative data sets and catalogs that you
That you have seen earlier So, um, the pipeline shapes is always a deck which is a directed acyclic graph by design so there is no cycle in your data flow and Algebraically speaking. It's just a set of node so they can be concatenated together to form bigger pipelines index in this example
you could see that I have three nodes in my data processing pipeline to clean my racing data and movies data and I use the outputs of these two nodes at the input of my create model input table node And this is my data science Sorry data processing pipeline and then I could create my data science pipeline the same way and then in the end
I just concatenate them together because in the end of the day, it's a set of nodes. So it's It has it's it has algebraic properties and you can build pretty big pipelines this way Iteratively and modularly, so this is a demo that we have online, but I also like to show you
the An example article that one of our users Actually wrote on mediums and just want to show you the screenshot of their pipeline, which I think is quite massive It is at the end yeah, so this is a pipeline that This company runs in production. It's one of the biggest telecom company in Indonesia, I think
One Thing worth pointing out about the coding patterns in in casual Is that the topology of your pipeline is that dictated by the data flow as we saw before you need to connect the node? Using their inputs and outputs. So inherently a casual pipeline is data centric
It's a bag of data If you are familiar with other workflow engine such as airflow prefect The pipeline is a lack of tasks and data artifacts So in a way in casual you all already have a table level Secondary lineage of your data for free and the transformation logic that produces it
It's not column level image, but it's a good start So that's the coding patterns that can help you extract the code from Jupyter notebook and then put into a pattern project in a maintainable way and The next big thing that I want to talk about is the development experience in casual because as we know the development experience in Jupyter
notebook is amazing, especially when it comes to exploratory data analysis and rapid experimentation It also has really really great communicative utility right? For example, I come into this
Algorithm Recommendation system algorithm notebook and then I could understand very easily what's going on So I think I think this is a great strength of notebook that we try to match With the tooling that we provide in casual. So the first thing is we provide interop interoperability with Jupyter notebook for exploratory data analysis by allowing you to use schedule constructs
Inside the notebook. For example, you can use the data catalog to load and save the data within the notebook itself Before you move on to writing code in an IDE We Also allow you to embed a pipeline visualization within a notebook So then you can visualize the shape of your data flow when you explore your data
But beyond the notebook environment, we provide a very powerful CLI environment to help you run your project iteratively There's a run command here that support running the whole pipeline running the pipeline in different environment Running different subpipeline within the main pipeline running a single node
Overriding parameters at runtime so that you can for example here. I want to try different hyper parameter for my for my model so I can run my pipeline many times with different parameters just to see how it works and There's a long list of run commands. Most of them are powered by the fact that our pipeline
It's just a set of nodes so you can filter it in in any which way that you like Beyond our provided command we also allowed people to To provide their own CLI commands either through a set of plugins
For example, the visualization tools that you saw earlier is built as a plug-in and provided as a command in casual You will also see Another plug-in later, which is airflow where we will create an airflow DAB from from the casual pipeline and finally you can also Create your own commands it within your project
As you can see down here where you create your own run command within the CLI to party in your project so it's a very extensible way to to add more to your development experience in in casual and Here are some of the examples your extensions by built by our community There is a casual ml flow plug-in to provide commands to interact with ml flow
There's a casual diff plug-in which is really cool where it shows you that the diff of your pipeline between different get branches another Another tools that come out of the box with casual is a powerful pipeline visualization with a casual bits plug-in
It helps you develop and communicate your pipeline with fast feedback loop. So in my example here when I change my pipeline definition the visualization to What listens for the changes in the file and then automatically refresh itself so you could actually see that your Pipeline shape actually has changed and it's also being actively worked on to turn into an interactive data science IDE
Even though my product manager might kill me when I say this, but I stay tuned Then the last thing is Schedule allowed you to scaffold your new projects with standardized project template It's originally based on cookie data science cookie cutter data science
And it comes with a few tools out of the box such as linting with flicket And I sort of code formatting with black so on and so forth It support more advanced setups such as sparking neutralization through a concept of starters and a support custom tap starters to tailor to your project need such as
specific CI CD configuration Alright, so that's that's it about casual I Would like to talk a little bit about how you can use schedule to integrate with other tools in the mlops Ecosystem if you think about mlops, there are many different ways people describe it these days but one of my favorite one comes from Nvidia where they model mlops at the lifecycle and
as you could see casual helps you with the Collaboration development workflow in the middle and then it provides some tools out of the box to help you with data collection data ingestion and data analysis But what about all of the other? net responsibilities in the mlops lifecycle and
Instead of trying to become like an on-site field all kind of tools can provide a very extensive integration mechanism for you to hook into These different needs in a pipeline lifecycle. For example, how do I run data quality checking after my raw option? That is loaded. How do I emit runtime metrics?
After a no runs so that I can set up a monitoring system so on and so forth It provides its extensibility mechanism through a concept of hooks lifecycle hooks that map exactly to the mlop lifecycle that you saw before and We have seen it used to integrate kettle with different tools like Grafana and Prometheus monitoring ml flow
experimentation tracking so on and so forth In our example today, we will look at how we can automate data quality checking with great expectation using Kato hooks Great expectation is a pattern based open source library for validating documenting and profiling your data
and The way we do it is that we are going to write a hook After before we save our data set to validate the data so that if there is some changes in the data with bad qualities We want to stop it from Propagating down the pipeline
I'm going to show you a live demo of this very quickly So, um, this is the code editor. This is my data catalog as we saw earlier In casual you can provide these hooks in a file in your project or hooks.py So this is my hooks.py file And then this is my data validation hook using great expectation when I initialize it
I initialize the data context in great expectation using the configurations Located in in conf base and This is the hook implementations. This uses the same hook mechanism as pytest uses In fact is powered by the plug the library that pytest people made for building plugin architecture
so this is this hook is called before data set saved and The idea is very simple When before you save a data set you try to get an expectation suit Which is like a set of validations to run against this data set if there is a suit that matches the data set name
Then we will just run it using great expectation. I have configured my project to have one expectation suit to match my clean movie data set This is a JSON file, but it also has a Jupyter notebook interface for you to interact with it Provided by great expectation and when I run my pipeline Let me just do this quickly
This hook will be called automatically and you will see that after After the validation runs we can open what we call it the data docs For you to view the pipeline sorry to view the validation results Actually because I changed my data catalog earlier and has some typo in it. I'm going to change it back to pandas
Data set I'm going to run this again I think that should work Yes
The data the data doc is located under Uncommitted data docs and there's an index.html here. I'm going to Open okay, I'm going to open with my this with my Browser just a second
and this is a great expectation data docs where you can see all of the previous runs and what not of my pipeline and All of the validations come from those runs and if you click and click on one of them You will see like what expectations will run and then if it fails and it will tell you why it failed
So that is how You can use schedule Extensibility to add automated validation checking into your pipeline quite easily with just a few lines of code The last another bit I want to talk about is after all of this development effort as well as putting
Controls in place to assure that your coding quality as well as your data quality are pristine now We need to think about deployment. How do we deploy this pipeline into productions? So to this end? Get your support of your deployment strategies if your project fits if your pipeline can run into a single machine We support a single machine deployment mode where you can just containerize it using Docker or
package it as a will file and then install it in your Python environment in production and then just run it as any Python package, but it also supports a distribute distributed Deployment mode if your pipeline cannot run on a single machine Then you will need to split it up and then run it on different nodes in a cluster and I will demonstrate how it looks
Today using that principle With Apache airflow. So the idea is very simple. This is convert every node in your pipeline into Into an airflow task and then the whole pipe sorry into an airflow that and the whole pipeline Each node is converted into a task
as you could see in the screenshot here my my task flow in in casual look exactly the same as my Task flow in my airflow deck If there's time, I will show you a live demo of this in a bit But a very good question to ask is like why starting with schedule and then having to convert this into airflow later on If they look exactly the same
So as we saw earlier starting with casual gives you the benefits of rapid Development much closer to that of jupyter notebook it focus on dataflow not task flow And it gives you the flexibility to stay simple So if single machine plumbing works for you Let's do it before you have to go distributed
It gives you the flexibility to choose between between different distributed orchestrators So if you don't have airflow, you can go with algos key flow prefect, whatever the principle is the same You convert parts of your pipeline into them The primitive construct in the orchestrator environment and then off you go and then there's a very powerful concept here
I would like to to promote and that is Your deploy pipeline doesn't need to have the same granularity as your development pipeline So in development, we want we want as much details as we want as we can to Yeah for all of all different purposes but in
In production, sometimes we are constrained mostly by the computing resources as well as the production environment so we might want a different way to Slice the pipeline and then deploy it based on those constraints so in in exactly in this example here and I have split my pipeline into just two different
Tasks in my airflow deck one do data processing and one wonders model training and theoretically speaking with Airflow you can run these two different Two different tasks in two different workers and one might support spark and the other might use gpu if you use deep learning models
And the last thing I want to talk about is beyond how do we go beyond a single project? How does these tools help you to scale from one project to hundreds of projects? Basically, it helps you do that by promoting the concept of reusability So you can reuse your pipelines between projects using a an abstraction that we call modular pipelines
I didn't have time to cover this today, but it's in our documentation It helps you build reusable data connectors reusable extension hooks so beyond just great expectation It can build other hooks as I mentioned before to do performance monitoring or experimentation tracking it help you
It help you build reusable cli commands. How do you create? scaffolding template with starters And you can also publish them as open source libraries for other for the community to use too Um And that's it for my presentation today. Um, all of the code
For this project is hosted in this Repository and thank you for listening Thank you so much. It was really nice and well done. So now we have a two minutes left. Um
If you want we can do a bunch of questions that you have. Are you ready? Yeah So the third question is does schedule support pipeline versioning if yes, how do you track dependencies Okay, so, um as mentioned before pipeline is constructed as pure code so you can check it into version control
Um, you would version control your pipeline exactly as your version control pattern code Maybe git or if you have other version control system like material and then dependencies are also I suppose you meant you talk about Project dependency not dependency between pipeline
in terms of project dependencies, it's tracked again as any other python projects with your standard tooling such as requirement files or if you use Well, we use requirements dot txt and dip generally But I think you can also use poetry if you are if you're advanced or um more trendy
Yeah Um, and the next one is what do you think are the main pros and cons of ketro with respect to dvc? yeah, so I think um That's a great question. I'm not that familiar with dvc to um To talk about this, but what I
Think ketro helps is it helps with the development workflow? It helps with collaboration it helps standardize your practice across your organization So every team uses the same standards so that it's easy to transfer between projects or it's easy to reuse a code later on Whereas dvc in my limited understanding is specifically concerning
Version control of your data and your code and I think these two tools are complementary like they're all forgotten in my opinion Perfect. Thank you so much. Lim. It's really good for the first time. Thank you. Thank you The time is limited so if people have more questions, please go to the breakout room
And lim you will be there answering questions They have thank you Thank you so much