We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

From Pandas to production: ELT with dlt

00:00

Formal Metadata

Title
From Pandas to production: ELT with dlt
Title of Series
Number of Parts
131
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
We created the “data load tool” (dlt), an open-source Python library, to bridge the gap between data engineers and data scientists. In this talk you will learn about how dlt can help you overcome typical roadblocks in your data science workflows, and how it streamlines the transition from data exploration to production. We will also discuss the pains of maintaining data pipelines and how dlt can help you to avoid common engineering headaches. Join us to learn best practices around data handling and managing failures with real-life examples!
2 (number)Perspective (visual)Block (periodic table)Perspective (visual)Different (Kate Ryan album)Information engineeringVirtual machineArtificial neural networkTerm (mathematics)Data modelPropagatorDampingBitRight angleWave packetEndliche ModelltheorieMultiplication signClassical physicsTable (information)Semantics (computer science)Library (computing)Design by contractEvoluteImplementationBackpropagation-AlgorithmusTime seriesComputer animationLecture/Conference
Graphical user interfaceWebsiteView (database)Bookmark (World Wide Web)WindowData qualityIntegrated development environmentLaptopStructural loadLocal ringProduct (business)Virtual machineBitHypothesisMemory managementTime seriesDefault (computer science)Point (geometry)Software testingResultantFormal verificationRight angleMultiplication signTransformation (genetics)Computer animation
Table (information)Variety (linguistics)Archaeological field surveyKey (cryptography)CodeLoop (music)Extension (kinesiology)Casting (performing arts)Electronic mailing listCorrelation and dependenceWindowArtificial neural networkConservation of energyMultiplication signNumbering schemeCodeEndliche ModelltheorieProduct (business)Software testingScripting languageTrailState of matterMemory managementCuboidSpeicherbereinigungSpeech synthesisInformation engineeringWave packetConnected spaceTimestampTerm (mathematics)Different (Kate Ryan album)PreprocessorStructural loadSemiconductor memoryDampingPoint (geometry)MathematicsDistribution (mathematics)Game controllerControl flowComputer animationLecture/Conference
SatelliteBuildingPerspective (visual)Information engineeringData warehouseBuildingComputing platformInformation technology consultingComputer animation
View (database)Bookmark (World Wide Web)WindowGraphical user interfaceComputer fileBootingData analysisParsingInformation engineeringCurvatureType theoryState of matterMathematicsCondition numberTable (information)Parallel portSoftware testingInformation managementDesign by contractProduct (business)Boilerplate (text)Right angle2 (number)Complex (psychology)Line (geometry)CodeSemiconductor memoryScaling (geometry)Different (Kate Ryan album)String (computer science)Perspective (visual)BootingDatabaseMereologySoftwareSequelMultiplication signMemory managementElectronic mailing listPanel paintingComputer animation
2 (number)Data storage deviceStatisticsMaterialization (paranormal)LogicTable (information)StapeldateiParallel computingGUI widgetKey (cryptography)Set (mathematics)Source codeType theoryDot productConic sectionTime evolutionRead-only memoryPlastikkarteSchemaevolutionCAE Inc.CurveSoftware frameworkFormal languageInformationSemiconductor memoryState of matterSource codeInformation engineeringINTEGRALDifferent (Kate Ryan album)Inheritance (object-oriented programming)Table (information)Installation artTrailElectronic mailing listKey (cryptography)EvoluteStrategy gameStructural loadRight angleControl flowFreewareEvent horizonInformation managementInternetworkingParallel portLibrary (computing)Design by contractAtomic numberUsabilityMoving averageOpen setReverse engineeringLattice (order)MathematicsIntegrated development environmentComplex (psychology)Complete metric spaceReading (process)Constraint (mathematics)Normal (geometry)Frame problemType theoryException handlingMultiplication sign2 (number)CASE <Informatik>Numbering schemeMemory managementLogicLevel (video gaming)Data qualityOpen sourceRepresentational state transferPoint (geometry)Row (database)WritingRollback (data management)Data storage deviceBuildingRecursionLecture/ConferenceComputer animationSource code
Source codeMemory managementRelational databaseGraphical user interfaceView (database)WindowFacebookDisintegrationEnterprise resource planningSynchronizationLine (geometry)Flow separationStructural loadProjective planeOpen sourceElectronic mailing listTable (information)DatabaseInformation engineeringCloud computingProduct (business)Transformation (genetics)WebsiteSet (mathematics)Library (computing)MereologyDataflowCodeToken ringRevision controlInformationFunctional (mathematics)BuildingType theorySource codeNumberMultiplication signCryptographyMetadataState of matterPlotterResultantRow (database)Dot productQuery languageMusical ensembleFreewareTheory of relativityComputer animationPanel painting
Intrusion detection systemData conversionHand fanPixelInformationCASE <Informatik>Dimensional analysisStructural loadMedical imagingEndliche ModelltheorieExtension (kinesiology)Event horizonTheory of relativityField (computer science)DatabasePoint (geometry)Source codeSet (mathematics)Semantics (computer science)QuicksortFile formatMetadataData storage deviceLine (geometry)Pattern languageLevel (video gaming)Polygon meshDifferent (Kate Ryan album)Product (business)CollaborationismBitPresentation of a groupCuboidTable (information)Task (computing)EvoluteFile systemSerial portSoftware developerDataflowTransformation (genetics)Parallel portINTEGRALScheduling (computing)Pivot elementSpacetimeSequelLecture/ConferenceMeeting/InterviewComputer animation
Transcript: English(auto-generated)
So today we are going to talk about two different perspectives on ETL or ELT if you want Process and we're going to discuss machine learning perspective and data engineering perspective So I will be talking about machine learning perspective a little bit about me. My name is Valeta
I started as a data scientist in Nokia R&D. I worked there for four years. Mostly I was like training models Both classic ML and neural networks Then I moved to MLOps engineering and worked on ML tools and finally half a year ago I joined DLT Hub which is a company which is working with data engineering and
What was surprising for me? Is that like both machine learning people and data engineers? Like work with data work on data pipelines ETL But they talk about data really differently and I had to learn a couple new terms
And it was really funny at the beginning when like our users mostly data engineers were asking me Hey, is your library working with data contracts? I'm like, yeah, it's working with data. What do you mean? yeah, but I learned these terms like schema evolution and Semantic layer and stuff and what I now think is that
ML Specialists could learn a couple thing or two from data engineers As an ML specialist, you probably heavily rely on tooling like you don't want to implement Backpropagation from scratch you just use PyTorch, right?
And if you work with data, especially if you work with structured data time series tabular data you probably use a lot of pandas and This is like yeah default tool for transformation, right? I also used it a lot and I use pandas for all data related stuff like for loading data for memory management
At some point you discover data that pandas can load data in chunks. Yes Then a nesting data cleaning data for transforming time series and pandas especially good with time series Verifying data quality checking the data against no schema incremental loading and stuff and it works really really good
Especially when you're prototyping when you need to test a hypothesis really quickly Get test results and you work in local environment You have all your data in local machine your Jupyter notebook in local machine works pretty fine but when we go to production it become a little bit more challenging and
Let me give you a couple of examples of such such challenges So if you work with like semi structured data Usually it's data from API's or it could be JSON which you receive from Kafka topic or something You would need to manually understand the scheme of the data and unless it you need to write this code
And it will take a lot of time and especially when your data will change and it will change You will need to debug it for hours Pandas does not have support for automatic and nesting. Yeah Speaking of data changing when your model is working in production
Your data will eventually change at some point. It could be changing distribution or schema and you need to know this Instantly because you want to retrain your model or do different pre-processing And if you want to do this, you need to either implement it manually or use different tool than pandas
Another thing when you move into production you usually move data around a lot Like for your CI scripts for your model training for model testing and you usually use Python scripts And you need to connect to all these entities and it could become quite messy memory management my favorite
probably at some point all of you had to explicitly call the garbage collector and Yeah pandas using a lot of RAM and sometime it's not enough to load data chunk by chunk You need to have more control over memory and the last thing I'm going to talk about is incremental state
So incremental state or incremental loading is another fancy data engineering term Which basically means that I want to load only new data with new timestamps. I don't want to reload all the things So I need to keep track of the incremental state Pandas do not support this out of the box and it's especially useful for your data pipeline resilience because if
Connection breaks in the middle of your load. You don't want to load it from scratch, right? so all things said We discovered that maybe pandas does not cover all our needs It's still really great tool for a lot of things But probably not all and this is just a tip of an iceberg and here my colleague Adrian
We'll tell you about data engineering perspective So I'm Adrian About me I've been working in data for 12 years now five years and startups building data warehouses end-to-end
five years freelancing and consulting doing what we call build and hire so building a data platform and then Hiring the team for it and for the last couple of years I've been working on building DLT and my guiding values are simplicity transparency and learning which I also apply in my work so from my perspective
you know the part of the issues that are coming between ML people using pandas and data engineers Building their types of pipelines are from the different roles that these people have So for example a data scientist will experiment they will discover they will explore they will try they will validate
Well data engineers scale they make things efficient. They dry the code. They reimplement things to be resilient They defend it with tests or data contracts and they maintain pipelines Second issue there's actually a lot of technical complexity to data loading that is not apparent when you're a junior So usually once you've been doing this for a few years, you've encountered all kinds of issues
For example, the pipeline might break on a schema change or an out of memory non-deterministic loading Network issues it will break on the last request from a long chain You might be loading JSON a string you might be loading everything a string you might have 40,000 lines of JSON extracts spaghetti and sequel
You might have the same thing in Python and then you have mystery JSON parsing in Python You don't know what it is, but you have to maintain it of course if type changes Schema changes so your pipeline will break you have race conditions when migrating schemas if you have the data loading pipeline and the schema in the database
They tie in the list right all kind of funny things like 300 line methods called main when they could be reusing some code and Of course all the data pipelines are just a little bit different So they get you when you try to fix them and then you have the silent killer when pipelines fail successfully and you can't even Tell testing in production, right?
So, how do you close the gap? As Violeta was saying there are lots of dev tools for ml But actually for data engineering you don't really have much So wouldn't it be nice if we had a tool that was a dev tool for data engineers So Python people can just use it like a Pythonic tool that you don't have to learn up front
So they don't have to rewrite all this boilerplate code all the time that keeps breaking So, yeah, that would be nice but the requirements are not simple, right We want to be able first of all to manage schema. So we need to know what data we're loading We shouldn't have surprises. We don't like surprises Data should be auto normalized and typed. So when we go from complex JSON to flat tables, we want that to be clean and easy
Loading should be atomic and important. So when things break it's easy to fix The incremental states should be persisted Extractor concepts should exist. So when I'm doing all these API calls I can do them in the same way and I stopped reinventing the flat tire
Loaders should be able to talk to the incremental state and the extractor. So everything needs to kind of work together We need memory management We need parallelism and it should of course be easy to develop on local because that's where we're developing And then when we go to production and things should be kind of the same, right? And of course we want, you know resilience and retries which is something basic that we all do
So why don't we start with pandas? Very good question. You could think okay. Why don't I start with Pandas data frame to sequel and go from there Well, first reason is it might be a little inefficient
Second reason is, you know, you have all kinds of complexities So let's look at what it would take to build a complete data engineering pipeline from a pandas loader So let's say we start with the nested JSON If we want to load it with pandas We'll have to take that JSON, normalize it, unmist it, flatten it and load it. That's fine
Now if we want our loading to be atomic So when the loading breaks we don't end up with half the data loaded half the data not loaded And then it's kind of a pain in the behind to clean up. We need atomicity So we need to tag the data. We need to have a rollback strategy
We want it in potency. So if something breaks we can just rerun our pipeline We want memory management So we stopped running out of memory because when you run out of memory, you might be Actually killing all the workers that are also loading other pipelines We want to have incremental extracting and loading for that we need the incremental state storage
We need to capture that state inject the state Into the extractor. It's starting to get complicated We want to add incremental loading So like different merge strategies So for example, we want to be able to up sort the data or we want to hysterize it through a slowly changing dimension
For that we would need to add that materialization logic Now, let's say we want to standardize the way we extract data. We need to create some kind of extraction helpers We want to make it efficient and scalable. Well, we need async we need parallelism. We should be able to easily leverage them
Of course we want schemas so, you know what they say I don't always use schema Oh, yes, you do whether it's schema on read or schema on write you're using it We need a normalization engine because you know, sometimes we have strange column names We have you know, we don't want to manually type data and all of that. We want to
Recursively unpack this data flatten it to tables create join keys Then we want retries, right so Pipelines will break they will break all the time. So you want them to be resilient Then finally schema evolution. This is probably the reason why most pipelines break all the time whenever you have schema changes
Yeah Finally, you know, sometimes their data is not coming from a transactional environment. It might be events Maybe some people are sending it randomly from the internet. You probably want a data contract So you'd need to build a data contract engine
You want governance you want to know where the data is coming from and where it's ending up So you need to have some kind of start for lineage for for example Debugging data quality issues are simply keeping track of personally identifiable information so you can be compliant with documentation
So finally, you know when you've built out all this pipeline as you can see it gets quite complex Why the hell are we using pandas at this point? Unless it's really the piece that comes it comes together really well with all the other pieces Probably don't want to use pandas. It's actually quite easy to just insert some data So, how would we load this data with DLT? Well, it would be a lot easier
DLT data load tool it can take a generator It can take a JSON data frame or DLT source and just load it basically So yeah, as you probably guessed we both work in DLT hub and we both work on DLT data loading tool If this is an open source Python library emphasize on open source
You can use it and what it does it loads data from any given source. You can see them at the right Any given destination at the left? vice versa and along the way it will automatically extract schema normalize the data and
Integrate with the destination and this is my favorite I don't need to know how to connect to S3 or how to load data to snowflake DLT will do this for me So what do I mean when I say the DLT normalizes nested data? Let's assume we have a JSON Inside we have like a parent Alice and a list of children
So DLT will understand this is nested data and will provide it with two different tables for parents and for children Moreover it will connect them through ID Basically, this is like join keys, which you can use if you want to join table together back
So why do see it's really really easy to install pip install DLT and go it Easy to use because learning curve is shallow and it's by tonic. You don't need to learn any new framework or any new language So The other thing which I really like about DLT as an ML engineer that it was built by really experienced data engineers
So if you just work with Python and you don't have a lot of experience in data engineering you can build pipelines on senior level and What are those best practices? First of all schema evolution if your scheme is changing DLT detect this automatically and just updates the schema
Moreover you can use data contracts So you can tell DLT what to do with this data if we say evolve There are no constraints data changed in source data will change in destination We can say freeze. So if the new data is not fit to the previous schema
DLT will raise an exception you can say discard row in this case DLT will just throw this row which is not fitted to the schema and And Yeah, and you can say discard value Incremental loading so we briefly talked about this It's a crucial concept because you don't want to reload entire data set
You want only to load new or changed data. This is implemented in DLT you don't need to implement it manually just use it and DLT has a lot of different things for performance management Parallel execution I think execution you can control memory and you can scale your pipelines basically
so apart from that DLT is a Open source Python library. It also provides you with Ecosystem, so what we have is verified sources, which is basically integration with most of the used sources any SQL databases a lot of REST API something which is just like
Highly usable Google sheets notion hub sport github slack a lot of them We have 16 different destinations Doug DB snowflake BigQuery postgres Etc. And if it's not enough you can build your own destination with reverse ETL and also DLT is heavily
Integrated into modern data stack. It's integrated with dbt with airflow with Dexter Different things and if this is still not enough for you, then you can build your own verified source and to illustrate this I Have here this plot which is basically showing
How many custom sources were created by community and we track it through telemetry? So it's like rough numbers because a lot of people turn this off. So we probably five times more something Yeah, so enough of me trying to like advertise DLT let's look at some code and deep dive
If you if you are still alive guys So let us assume that we have API with crypto data and it's not supported by verified source by Ecosystem so you need to build your source yourself. Let's see. How easy is that?
What we're going to do first We're going to create a DLT resource. It's just a function which calling to API and yield list of tokens, then we're going to build DLT transformer which another type of resource and what it does for each Token in the list it will call another API to get additional data
Then we combine them together in one resource. This is DLT in one source. Sorry. This is DLT source And then we use pipeline run to load data to destination so both resource and Transformer will generate at least one table each
Then source will generate a data set consisting of several tables and pipeline run load data set to database one important note here is that DLT decorators use yield to produce data on the fly So you won't run out of run This is how code looks like so you have like decorator
DLT dot resource Which is function which yields a list of tokens in the set then DLT transformer which takes data from? the colon colon list resource and then call API for each For each record in this list then we just combine them together into one source crypto data
We call it and run the pipeline The whole code looks like this of course instead of three dots. There is like a request library calling API But it like quite small and convenient We have here resource
transformer source which combine two resources together and then pipeline We define pipeline we define the name of pipeline the destination here is a Postgres But you can change it to I don't know snowflake BigQuery anything You want and then we just run this pipeline So as a result what we get we get a lot of tables in our Postgres
First two tables pipeline metadata. It's where DLT keeps some information about incremental state which I mentioned earlier About loads about version of the load and other tables are relational tables from nested data. You can see coin list
This is from coin list resource coin information This is from coin information transformer and all other tables are actually automatically extracted automatically and nested They are like children tables. Yeah so The cherry on top first of all DLT is a libraries Python library
So it run anywhere we part where Python runs So if you have like your infrastructure with air flow or Dexter you can run DLT there It's open source. So go to github. Give us a star, please So you don't need like license it you can use it in your work in your pet projects in your product
Maybe if you want It has quite extensive documentation. We constantly work on this documentation and we actually have amazing community in slack we have More than 1700 users there and you're just welcome to join and ask any questions regarding DLT or data engineering or anything
You have in mind Also, this is really good on Lowering and cutting the cost. For example, this is one of our users who replaced like famous cloud provider with DLT and they saved Money from 6,000 euros a month to 45 years a month and only because they need to pay for a WSLambda
Because yeah, DLT is free Yeah, so that's probably all check out our website during our community and give us a star on github
Thank you. If you have any questions, you're welcome to ask them right now. Sorry. It's your line Yeah, so they are open for questions now you can walk up to the microphone and ask them thank you
Well, you're thinking of asking questions. We also have a booth there. So you're welcome to go and ask us directly and Yeah, I see a question. We use images of We have a source which called file system which basically works with like a lot of different data images
You can use it as well. And for example for metadata, you can use DLT directly to extract them and store in sequel Can you extract the information from point by point? pixel by pixel No in this case like images would be just like loaded to your destination
For example, if you need like them to be as free do to just take this image and load there But with metadata, it will do the un-nesting and other stuff you can also comment
Okay You mentioned that the metadata table is stored at the same place Where all of the other data is stored, right? Yes So is it possible to configure the metadata tables to be stored somewhere else? And how configurable are These
So actually this one not but you know We recommend a loading pattern where you're loading your data to the raw layer you use schema evolution You notify that to the analysts and then the analysts will promote this data to the staging or to the production layer So this shouldn't really interfere with anything so basically you're not recommending that this is like
Finished data for analysts they should Check it out further You can use it as a finished data set But usually you need to model it because usually you have business requirements and business analysts that need to the pivot tables So to do that you need to create the fact and dimension model like a star schema. So you need to work with it
Thank you Yeah, it's actually my question was Would it be possible to somehow customize like to each extent the data will be unraveled for example If I don't want to create this additional relations or just stores them in Jason be filled or array filter some space out
Yeah, we actually have several ways to do this like on the level of resource. You can just create like column hints You can say okay, don't unless please this column I need it as a complex as a JSON for example and you can also encourage your schema if you have it from somewhere or like on the
Define it for some tables or for some columns guilty can work with us, too Thank you My questions are about integrations with other tools. So if So you said that it can integrate with things like a flow Where would DLT end and airflow begin
I mean could you use so DLT wouldn't handle any kind of scheduling anything like that, but it does a pipeline. So Like doesn't have any kind of monitoring built in or would you have to rely on something third-party to monitor it? The idea about DLT it's a really good question You can drop it in anywhere and it should be able to function without the you know
A special orchestrator features, so I would say where DLT ends and airflow begins You can just run DLT locally off of airflow. So, you know, you have a nice development flow and when you go to airflow like Violeta was explaining you have resources you have transformers
So you have some dependencies and you can actually unpack these in a dag We have a deployment helper so you can choose parallel parallel unpack or serial unpack stuff like that Your DLT pipeline is a task in your dag or okay cool exactly. Yeah Yeah
You mentioned a data lineage as well in the presentation so how much of that is supported out of the box So the lineage is available for loaded data You get basically just some information. When did this column appear?
What table and so on and Then it's a little bit up to you what you do with it So for example, we have a user that is handing this over to SQL mesh and they have complete lineage there We have another collaboration with SDF semantic data fabric there We tag PII columns and then they handle the documentation
Sorry, you know how How do you actually and You know how that lineage is accessible. What's how What what what format is a certain so the lineage information can be?
Obtained basically, there is something called load info that you get back when you run a pipeline. You can load this load info back And then you'd have that information either in tables or you can use it directly in Python And I think yeah for the semantic tags, we're actually
So DLT infers YAML schema of the data and then you can do things on the schema So you can put PII tags and then these can be read by something else. Okay. Thanks Thanks a lot
If you don't have any questions then thank you as I said go to github give us a star check out LT and Also, we have an after after party in Thursday if you want to come after the official event, you're welcome