Mastering a data pipeline with Python: 6 years of learned lessons from mistakes
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 130 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/50083 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Perfect groupTouchscreenSoftware developerSoftwareComputer animationMeeting/Interview
00:38
Software developerTwitterComputer programmingMultiplication signFeedbackSelf-organizationRight angleTrailTwitterSoftware developerYouTubeMessage passingMusical ensembleBitZoom lens
02:41
FingerprintLambda calculusCohen's kappaPlanningCodeProduct (business)LoginTracing (software)TheoryScaling (geometry)Right angleVariable (mathematics)SoftwareSoftware engineeringComputer fileAuthorizationHeegaard splittingNetwork topologyVolume (thermodynamics)Bounded variationWeb pageType theoryPresentation of a groupComputer architectureCodeVelocityExpected valueMedical imagingDatabaseSource codeSet (mathematics)Process (computing)ResultantCalculationPoint (geometry)Web browserVirtual machineEndliche ModelltheorieReal-time operating systemMathematical analysisDecision theoryMultiplication signLambda calculusNear-ringWeb 2.0Information engineeringVariety (linguistics)
09:36
ComputerComputer programLambda calculusArchitectureCohen's kappaRouter (computing)Event horizonBasis <Mathematik>Vertex (graph theory)Virtual machineComputer filePhysical systemEndliche ModelltheorieSlide ruleData storage deviceState of matterCycle (graph theory)Process (computing)Complex (psychology)Exterior algebraBitNumbering schemeOrder (biology)StapeldateiFrequencyFilm editingRootMultiplication signLattice (group)MathematicsPlastikkarteView (database)Representation (politics)Data managementDatabaseDifferent (Kate Ryan album)File systemFault-tolerant systemType theoryPredictabilityChainSequenceRippingInformationVariable (mathematics)Database transactionPoint cloudSoftwareFunctional (mathematics)NeuroinformatikLambda calculusComputer architectureSource codeCodeSoftware bugRight anglePhysical lawMusical ensembleBootingGoogolSet (mathematics)Streaming mediaBlock (periodic table)Semiconductor memoryoutputLimit (category theory)System callLevel (video gaming)Computer programmingLoginProgram flowchart
19:21
Cohen's kappaInformation securityPoint cloudCodeoutputMathematicsComputer architectureSoftware testingDifferent (Kate Ryan album)Demo (music)Table (information)Lambda calculusMultiplication signLinear regressionPlanningVirtual machineMilitary baseSoftware bugMereologyInsertion lossRight angleInformation securitySoftwareStreaming mediaServer (computing)Analytic continuationINTEGRALLink (knot theory)CodeRevision controlPower (physics)Computing platformWrapper (data mining)LoginKey (cryptography)Product (business)WindowCoroutineComputer fileFile formatCartesian coordinate systemFactory (trading post)Process (computing)Limit (category theory)ResonatorScaling (geometry)Medical imagingSequelStatement (computer science)Flow separationNP-hardInformation engineeringSoftware developerComputer programmingReal-time operating systemSoftware engineeringConnectivity (graph theory)StapeldateiLevel (video gaming)ImplementationMathematical analysisAnalytic setDependent and independent variablesInteractive televisionSet (mathematics)Grass (card game)Source codeVideo gameEvent horizonOrder (biology)NeuroinformatikDeterminismSensitivity analysisContinuous integrationInstallation artVolumenvisualisierungBasis <Mathematik>Time zone
29:05
Software testingMathematical analysisStreaming mediaLibrary (computing)Computing platformComputer virusMultiplication signJava appletProjective planeWrapper (data mining)Virtual machineProduct (business)MereologyAxiom of choiceTwitterGene clusterSoftwareDifferent (Kate Ryan album)LogicProcess (computing)Macro (computer science)Reduction of orderVideo gamePlastikkarteExecution unitConnectivity (graph theory)Message passingSocial classKerberos <Kryptologie>Task (computing)Formal languageEmailRight angleDemosceneMathematical analysisAreaData managementCodeStructural loadSoftware testingTransformation (genetics)Library (computing)WritingInteractive televisionDatabaseDivisorScheduling (computing)Graphical user interfaceComplex (psychology)Service (economics)Type theoryInformationCodeINTEGRALoutputDeterminismQuantum stateMusical ensembleStreaming mediaComputing platformAnalytic continuationComputer filePosition operatorData analysisData structureWeb serviceOpen sourcePlug-in (computing)Parallel computingInformation engineeringContinuous integrationString (computer science)ParsingLevel (video gaming)Source code
38:38
Streaming mediaProxy serverProcedural programmingCodeDatabaseComplex (psychology)Goodness of fitMereologyProduct (business)Strategy gameRevision controlRow (database)Cohen's kappaComputer architectureFlow separationFunctional (mathematics)Scripting languageCASE <Informatik>Virtual machineLinear regressionSoftware testing1 (number)Order (biology)Event horizonPay televisionConnectivity (graph theory)outputProcess (computing)Function (mathematics)Musical ensembleLinked dataTime seriesSoftwareStatisticsDifferent (Kate Ryan album)CodeResultantQuery languageBasis <Mathematik>Right angleSet (mathematics)CountingState of matterMultiplication signSmoothingMeeting/Interview
Transcript: English(auto-generated)
00:06
Our next speaker is Robson Jr. as he is online perfect. He works at Microsoft. He is a developer there and he is involved with software community, of course, especially the Python community.
00:23
And he talks about the data pipeline with Python. Six years of learned lessons from mistakes. What is better than learning from mistakes? So please, Robson, start sharing your screen. OK, it's time to start. Thank you so much.
00:43
Good morning. I'm talking from Berlin. My name is Robson. Thank you for the organization. Thank you for everybody to make it possible, especially online. It's hard at times. It's quite a challenge. I know how it's hard to organize in a conference, especially online.
01:02
Keep everything on track, on time. So congratulations for all the team. Seems to be a nice conference. It's my first time talking on your own Python. So I'm quite happy to be here.
01:20
Actually, I work in the GitHub. GitHub was bought by Microsoft in 2018. So basically, I'm still in Microsoft, but GitHub right now. Yes, feel free to drop a message anytime, even in Discord, even in Zoom.
01:44
Let me introduce myself a bit. So I'm originally from Brazil. So I moved to Europe four years ago. I'm a developer for more than 16 years. I learned programming with Python.
02:00
Python followed all my career. I'm quite happy. Even I transitioned for another technology along the way. So Python always followed me. Feel free to reach me out on Telegram. My Telegram is public. So if you have a question, if you want to talk, feel free to add me there.
02:21
BSAO0. Or you can ping me on Twitter or, of course, in the GitHub. It's the same username. So BSAO, feel free. All the feedback is welcome. And I love to talk about everything.
02:41
What do you talk about today? It's very important to equalize the expectations. Even this is a fully Python conference. Of course, you talk about Python. But for now, half of this presentation is about data products.
03:02
Python is a keystone on data products. It's because you are talking Python here. But first of all, you need to equalize. This is a beginner talk. This is a beginner talk. So I've introduced all my mistakes because, as mentioned,
03:23
so I learned a lot with my mistakes. I committed to helping me a lot. And now I think that's fair enough to share the knowledge as well and come back to the conferences. Right.
03:41
It's not about code. It's not about how to optimize your code. But it's about how to understand the anatomy of a data product. Data product is not basically data pipelines. But having these data products as well and all these principles, it's applied for the same.
04:04
We covered the concepts of Lambda and CAPT architecture. It's very important to understand. The main qualities of a data pipeline as any other software in the world. Softwares should have some qualities and pipeline as a software as well as qualities.
04:25
And of course, you talk about Python. Python is a keystone. So you need to glue everything with Python. And you talk about where Python matters. My goal in this presentation is to help you to start planning
04:44
a great data product. So think as a data engineer. How to help your team of software engineers to produce, to collect, to store, to process, to serve data in a large scale.
05:09
First thing is understand the anatomy of the product.
05:21
It's very important. When you talk about a data pipeline or a data product, you need to understand which kind of variables you are dealing. So nowadays in modern softwares, you are collecting a bunch of data.
05:43
You can imagine any kind of product today are collecting using behaviors, telemetry logs, databases, interactive APIs, and you need to collect all the kind of traces. It's very important. The concept of big data,
06:01
the old concept of big data that follow us by the book, by theories that are represented by five or four V's, the pains of the author. And this is our splitting tree. One of these V's is volume. Volume means the quantity,
06:21
the amount of data that you are collecting, that you are storing, right? The variety. So the type of data. So JSON files, Sparky files, API calls, unstructured files, unstructured data, unstructured data. Imagine that you have an image, you have tweeters,
06:43
you have web pages, logs, database, and everything needs to be stored. In general, this kind of unstructured data is stored on one thing named data lakes. And data lakes, it's basically where you store everything.
07:02
You put everything there. Of course, you have some techniques to store the data in the data lakes, but in general, you put everything there, you organize there. And from that place, it's called source of truth. You are able to collect your data and extract what you call it off data sets.
07:22
For example, you are collecting a bunch of tweeters and you need to get just a piece of these tweeters by the date or by username or something like that. So you extract this data set, right? In the middle, you have like the process.
07:42
The process basically are the softwares as well. So when the magic happens, right? So you have a veracity. So you can trust in your data. This is the most important question to do, right? So you can rely on our data.
08:01
So your data represents something for the final user for who needs to use this data, right? And how fast your software are able to process all the data that you need. This is the velocity. How frequent you need to process this data
08:24
to deliver this data, right? So imagine a bank that you needed to process the payment, the fraud detection. So you have like machine learning models or you have some kind of analysis and you needed to do it in a near real time process
08:43
or you needed to be this kind of decision in a short time. So it's the fee that you needed to attack the velocity. And the less in another point, you have the veracity. It means your data represents exactly that you collected
09:04
and after you process it, after you analyzed, your data represented exactly that you are expecting for. Not the result itself or exactly what you're expecting on good or bad things, but the calculation,
09:20
the analysis is right. So you needed to make sure. And it's based in the code as well, right? As you can imagine, a data architecture is a software
09:41
as a computer programming, right? So you have a memory, you have the files, you have inputs, you have the functions and variables that deal with the software, right? They are processed. They happen inside the process. And in the end, you needed to deliver it. You needed to spread this information that you took from your software.
10:02
Sometimes you have the IPIs, you have the files, you have UIs, or even you're delivering another way, but works exactly the same. The idea is the same, right? Another important aspect is to understand the different architectural styles.
10:22
And you are talking the most acceptable architectural styles in the market nowadays. They are not different, but they complement each other, right? So Lambda started first. So you need to understand a little bit and then you needed to understand the cut after
10:41
in order to determine what's more simple and you can have a root better in your challenge, right? There is no silver bullet. So you needed to understand what you need for example, even I work for a bigger company. Actually, I'm not working more with data
11:01
since two months ago, but before, I'm just working with batch process. A long time just to work with batch process for solving my problems. You understand, so I don't need the real time process, but probably you can need because you needed to understand the architectural styles.
11:24
The first one is the, what do you call as Lambda. Lambda is a architectural style that try to deal with a huge amount of data in a efficient way. And here you have two prints, right?
11:43
You needed to reduce the lattice between the process and the serving the data because you'll have a high throughput. You needed to have a low latency to process this data and you needed to deliver this data. For that, you needed to keep in mind
12:01
that any change in the data state needed to generate a new event. So it's like it's a hetero event. So every time that you have an ingress, you are collecting the events. Some systems are sending events to you like logs, API calls, whatever,
12:21
and you needed to decide if you needed to process it in a real time or in a batch layer. It's up to you to decide. But you needed to understand what's happened when concepts call it event sourcing. The event source is a concept that you're using the events to make predictions and starting chains in a system in a real time basis.
12:42
So you'll have like a colony or data storage based on data or something that you can have historical data, right? To ensure that all the changes in our data, it means our events that are happening in our data, you'll be stored in a sequence.
13:01
For example, you are trying to buy something with your credit card in e-commerce. So you type your credit card to pay for something and then some things goes wrong, and then you have like a transaction. This transaction has a holdback,
13:21
and then you try it again, and this creates a sequence of events that you needed to analyze it before. For example, this kind of situation that I mentioned, so can result, for example, to block a card by doing a fraud, for example, because you needed to understand what's happened
13:41
and how it can happen in this speed or batch layer. Imagine that you are trying to put your credit card in e-commerce, right? So we identify your RIP, and your RIP is not from Berlin or it's not from Dublin. Your RIP is from any place in the world that is suspicious.
14:01
So you needed to get all this information, cross this information in a very short time, low latency to determine if this card is facing a fraud or not, right? This is called event sourcing because you are generating events and events and you are able to analyze this timeline quick.
14:22
But you have two splits here. You have the batch layer. The batch layer is the most common nowadays because it's the most cheap, it's the most deformed technique. Basically everything that you consume from the ingress is started in a data lake, right?
14:40
And from this data lake, you have like periods of jobs or have jobs running hour by hour, day by day, or in periods that get this data from the data lake and create new data sets, right? They extract the data sets, they clean the data, they structure the data, they can generate the machine learning models,
15:01
they can do all the jobs. And then in the ends of the batch layer, you'll have one thing named batch views. It means that is the representation of your data, right? So your data lake. So any manager from your company, any person who wants to consume this data can go there and curate the data.
15:22
In another hand, you have the speed layer. The speed layer consume the ingress as a string. Once the data comes to our system, they started to be processed. As I mentioned, it can be used for financial proposal. So to identify a fraud or not, right? And in the end, all this data,
15:41
even use the speed layer or batch layer. In the end, you have all this data prepared and stored in a database that permits you to consume or to serve this data, right? The main difference in the next slide
16:03
we'll talk about. So you have some pros and cons on these Lambda architecture. So for the Lambda, you need to keep your data permanently started, right? So based on that, you might have problems with GDPR in the Europe.
16:25
So because you needed to retain user data, you needed to be compliant with your previous loss, or even you spend a lot of money to store a huge amount of data to be processed. So you needed to plan it, right?
16:41
The second one, all the cures are based on the immutable data. It means that nobody can change the data. If data is changed or the state of the data is changed, a new event is created and then you can follow all the things happened before, right?
17:06
Since this is used for a long time, so it's reliable and safe, so it's fault tolerant. Why it's fault tolerant? Because you have all the data started in a permanent basis. If you found any bug, anything that is wrong in your code base,
17:20
you can correct it, and then you can run one thing called the backfill. It means that you can go to your data lakes, read all that data, and reprocess all the steps of your pipeline. And it's become a limited architecture, it's scalable. Verticable is scalable because if you need to process
17:42
a huge amount of data, you can put more machines in your cluster to process it, right? And you can manage your historical data in a distributed file system. The most common are basically in facto distributed file system nowadays is Hadoop or in the cloud based
18:01
in Amazon, Microsoft, or Google, for example. But you have some cons that you need to take care. One is prematurely modeling. It means that when you are talking with your team and you are preparing your data ingestion,
18:21
when you are modeling your data or how you want to receive your data, you need to think a little bit more so you need to avoid some kind of prematurely modeling, try to evaluate your schemas and try to evaluate your tools, how you validate your scheme step by step
18:40
in order to not broke your pipeline. As I told, so can be expensive because you are storing a huge amount of data and the pains of the batch cycle that you decided to use maybe you can spend a lot of money to process Hoge complex pipeline, right?
19:06
Now we will talk about the CAPA, right? As I mentioned, it is not a replacement for Lambda architecture, but it's alternative or even you can consider like it's like a update of serving layer
19:23
of Lambda architecture, right? And CAPA architecture is unique using for streaming process. Things that you needed to do a nearby real time process, especially for analytics segmentation
19:44
or fraud detection or something that you needed to have like a faster response. Another nice thing about the CAPA architecture that you can keep just a small code base for that.
20:00
Different from Lambda architecture that you have different components that you can use different pipelines inside the one pipeline soup pipelines. In general, CAPA architecture is you can share the same code base because the data ingress things to be restricted for few data sources, right?
20:24
So basically, if you are not, it's the most important that if you are not up to a real time answers, keep your life in the batch process and apply all the things that you know about
20:40
software engineer to keep your code healthfully and probably you'll have success in that GP way. Applications of a CAPA architecture. So you always have a well-defined event order as you have like in a Lambda architecture most part of the time. So we can instruct any data set any time, right?
21:05
It's more used, it's more often used for social networks, platforms for detections. And it's very important because you needed to answer the questions fast. So if you needed to change and deploy your code fast to fix any bug
21:22
or to implement a new feature, you can do it in few minutes or few hours. You don't need it to run a bunch of tests or needed to create a huge amount of effectors. If you started to create your code, clean it enough, right? Another thing about streaming process that
21:40
they use less resources than Lambda architecture because you can implement idea of horizontal or zone ability scale that you can just grow your machine instead put more machines and then you can apply better the ideas of how to scale in your machines. Another things about the CAPA architecture
22:01
is that leverage machine learning to real time basis. Okay, Javier did a question here. Let's see. So to do stream. Okay, you answer after, right?
22:29
But you have a cause because if you introduce a new bug into your code, probably you needed to reprocess part of your code if you have some loss of data, right?
22:42
And sometimes you needed to stop your pipeline. It means that if you are running a fraud detection, if you needed to stop your pipeline for a few minutes to deploy or to fix something in infrastructure, you needed to have like a big plan for that.
23:02
So otherwise you can bring problems to your business. So you'll have a pros and cons here. Now you talk about quite often a pipeline or a data engineer product, data product. So since it's like a computer programming,
23:21
so the problems are almost the same. Of course, you mentioned what I see the difference, right? If you see something wrong within a software development, probably you'll see the same wrong thing in a data product as well, right? The first things that it's very, very, very important
23:43
on data products is access level for the data. You are a software developer, probably you are dealing with very sensitive data, user data, financial data,
24:03
and you as a data engineer, you must implement that access level to all data levels. It means that you needed to implement the access levels for your tables, for your data lakes, for your data sets, how your code interact with your level codes.
24:23
For example, how your code can read the data from the data lake, how your code can read the data from different data sets. So this is more about the ethical and how you implement compliance in your company, doing the GDPR in Europe or in another country,
24:44
then technical things, right? Another thing about security is try to use a common format for most part of your data. For example, try to use JSON files for text files
25:04
or PNG or CSVG for image or Parquet files or Avo files for commoner format. So you'll have different kind of formats
25:20
that you needed to understand better how to use. There's a bunch of formats and there's a bunch of applications. Separation of concerns as well, who needs to access the data and why, right?
25:40
And based on the code, of course, separation of concerns in your code, but avoid any kind of hard codes, the duplication in your code, and especially when you are writing down a SQL into your code, try to use all your columns or your installments because you broke your code easily
26:02
when you needed to change and then you'll get easy to fix, right? Automation is the most important, I guess. I think that's a consensus. So size of data products are codes. So version of your codes is usually the best way to use your version.
26:22
Power of tech platforms. Try to distribute the automation in different tech platforms. So you have your server for continuous deployment, for continuous integration. You have your tools for code review, links. You can use different tools for monitoring logs
26:42
that you talk ahead. So automation is the key to keep the data product easy to fix, easy to improve, right? Don't waste your time creating monitoring, using the monitoring. It's something that I learned in any company,
27:02
even the small, even the big company that I'm working for. Delegate your problem of monitoring things to cloud all the time, logging, but try to avoid some kind of render again. So you'll have some ripe wrappers around the market
27:20
that you can use different vendors in the same code base with the same analysis. It's very important, but try to not waste your time deploying monitoring infrastructure. Even it's easy to deploy, it's quite hard to maintain and tends to be very, very, very expensive
27:43
because even more, as much as you started to collect in the data, more expensive your infrastructure become, right? And here, the challenge. So test and trace your code. Regression test is a must in the data engineering.
28:01
It's a slightly different. If you change your schema anytime, if you change your input, if you change your code, you need to make sure that your old data is still working, that you can read the old data from your data lake. Remember that the data started in data lake,
28:21
for example, is immutable. They can change along with the time. So for that, you needed to have some regression tests to make sure that you can read the data from a data lake, for example, or from some stream because everything changed. Remember that I told about inputs.
28:41
Inputs can change, can bring problems. Try to keep your inputs deterministic enough. So how you can define your inputs, your schema to test and how first your test is too broke to correct and then you can identify and trace all your code, right?
29:01
And of course, as any other software trying to, oh, sorry, try to focus on unit tests, especially for internal components and just two tweaks here that's very important for data products. It's trying to containerize
29:21
all your third party components. Like let's imagine that you have a Kafka, you have a message quiz, you have like a Spark, you have different softwares from different platforms that interact with your Python code, with your Python platform and then you needed to interact with all this data
29:43
to deploy it into our machine. You can use some kind of container and integrate if your continuous integration servicing, continuous deployment service, right? And of course, when you reach this position,
30:00
it's easily to reach like end-to-end test. So once a week or each new nice feature that you are developing or delivery for your pipeline, you can run a end-to-end test to make sure that all the things are running, right? That said, a lot of concepts.
30:22
Let's talk about Python, our love it tech knowledge that bring us to these nice conference. I will talk about six mainly areas of data products. The most common is what you call the UTL,
30:41
instruct, load and transform. And this kind of tools, this kind of APIs is responsible to read the data, to process the data and to send the data to another place, right? The most common and famous tool is called Apache Spark.
31:03
Apache Spark was developed in Java, Scala over JVM, but they have a native wrapper for Python that you can write Python and interact directly with JVM, right?
31:21
There's a lot of magic behind the scenes, but don't worry about managing Spark clusters. You can just use a PySpark for that. It's my recommendation for anyone who are starting to deal with data, starting with data engineering stuffs, right?
31:43
Apache Spark is a must nowadays, right? Okay, I don't want to use Java 2. I want to use something more Python way. So I offer to you Dask. Dask is a Python project that's writing low level code,
32:04
but they have like a Python wrapper and works basically like the same as Spark. They are a parallel computing library that you can integrate with your pandas and you can have like a bunch of machines in that cluster
32:22
and you can distribute like your process in different machines, right? It's a very useful product. It's not well used, I guess, for most part of the market, as far as I know, correct me if I'm wrong, but it's a great choice if you want to
32:42
start to interact with pandas, Jupyter, don't want to spend so much time to learn Java and Spark. You can go directly to Dask. Luigi is a open source project that handles different tools, but they works as a library in Python,
33:05
is writing in Python, is quite easy that handles complex pipelines. Basically, you have a class that you define methods and based on these methods, you put your logic to consume, process and deliver your data.
33:21
It's really useful. I love Luigi. Luigi was created in Spotify and I still use Luigi until today, nowadays. MIR jobs is a quite old project. It's wrote in Python as well, right? But they run Mac reduced jobs over Hadoop.
33:42
It's more old school life, smart old school style way to create distributed files, but this is still useful. I need to be honest, a long time I don't use MIR job and Ray, I just read the documentation.
34:02
I'm not able to recommend or not, but seems to be promising. When you needed to deal with your stream, streams, we don't have like a Pythonic tool for streams, but you have Kafka. Kafka is basically the fact tool in the market nowadays.
34:25
Basically everyone is using Kafka. For that, Python has like a library named Faust. Faust, you can just plug into one topic and start to consume it. As I told before, Python is a keystone here. You can do everything with Python,
34:42
even integrate different platforms and technology. So you can use different platforms, instruct the best of each platform and bring everything to work properly with Python. It's awesome. Storm is another platform for stream
35:00
that Python is accepted as well and a stream parse is where we use it as well, right? For who uses Amazon or who use Amazon Storm is supported in Amazon web service. Okay, you have like a brief in analysis,
35:20
analysis when you needed to do so. Pandas is the fact tool, analysis tool for that. So you have different parts and different parts for Pandas, the plugins that you can use in Pandas to provide a high-performance analysis. So you have a data structures, data analysis tool
35:42
and integrated with NumPy and then you can instruct a lot from Pandas. I won't talk about those three, four libraries because my time is about to finish, right? But keep in mind that Pandas,
36:02
if you are going to analysis way, Pandas is the fact tool and you can go and interact with a desk or Spark. Okay, it's very important. Panda works with Spark and desk. You can use above if you want. Management. Airflow is another Python the fact tool
36:21
for managing scheduler for your pipelines. Please, if you want to work with data, learn Airflow, right? Airflow is a Python-based tool that acts as a cron job and you can create, as a Luigi, you can create very complex pipelines with Airflow, okay?
36:43
For testing, testing is very important. So when you are testing your data code, you needed to create fake database, it's freaky mess of tests, see the tests. So you have a bunch of those libraries, especially for Sparky testing base, if you are using Spark
37:02
because they have like a Python library for testing your Python codes or by test that is an awesome Python that's integrates with data things. And so for finishing, the most important, the most important note, so it's very important that you validate your data
37:24
all the time. Remember that I told about to validate, to have inputs, deterministic, and something like that. Yes, you have a bunch of tools that help you to validate the schema, how your data are coming from the external data source.
37:44
So you can define if the types of the columns, if the types, if the information is correct enough, right? Cerberos and Voluptos is really awesome tools, are really recommend, right? I use the schema for small tasks,
38:01
but I really recommend those three tasks, okay? And I would like to say a very obrigado, thank you, Danke and another language senior European. And if you have any questions, you'll have two questions here in the Q&A or you try to answer.
38:22
And if you have further questions, please drop me a message on email, Scordi, Telegram, Twitter, whatever you want. You'll be more than happy to interact with you. Thank you so much. The talk, I mean, quite to happen. Was a pleasure.
38:44
Thank you very much, Roxanne. We indeed have questions. One very interesting question is related to the CAPA architecture, how to deal with streams that needs a strict order so one event cannot be processed if the previous failed.
39:04
Okay, great question. So you can assume that in the CAPA architecture, all your events, you'll be in a temporal timeline or an historical basis, right? What can happen if you lose some events
39:21
is because your streaming software failed and you couldn't retry, but you can assume that our data, you'll be in a time series. If you needed to process, as I told you, so the state of your data, you always change. So if you have one update, one deletion,
39:41
one upgrade, whatever, you can follow it by your time series, and then you can reprocess and then you can create, extract your data sets from there, okay? But if your code are not able to process the event, you must make sure that at least
40:00
you have some retention policy in your stream software. Like in Kafka, you can retain your software, you can retain your message, or if you are using another software in Azure or Amazon, okay? Thanks. So we have other questions. Oh, they're popping up very quickly now.
40:22
Next question is, what's a good way to link data and the version of the code that processed it? For example, version column per every row doesn't seem like the most elegant solution. It's a quite complex question. I needed to understand better what a mentor wants,
40:41
but are you trying to be generic to try to give you some insights? Does it make so much sense to relate the version of the code with the version of the data? Because what's matter is you version your schema, not your, let's separate,
41:01
version your code as a tool, it's okay, but version your schema as a technique, as a good proxy is even better than relate your code with your version of your data. For example, in general, when you use some JSON or if you use some tool for creating schemas for your machine,
41:25
for your data, in general, you specify the version of that schema. And based on that version, you can have your regression tasks, you can manipulate. For me, this is the most elegant solution because you can focus on your schema
41:40
and then go to your code to fix it. So we have another question, thanks. Any recommendation for end-to-end tests in the stream world? A simple Python script can handle it. So for my case, most part of the case,
42:02
I create like a small Python script or I create some simple script that's been docker container with all the 35 components that I need. And then I run end-to-end.
42:21
And this Python script that I mentioned, check the inputs, the process and the output to make sure that everything is running smoothly. Because there are three kinds of tests. So you have the regression test, my unit tests, but the end-to-end, I needed to make sure that the previous tests are running fine.
42:41
You can use this technique for any kind of end-to-end tests. For example, if you needed to run like a streaming process with Kafka and Spark, you can have two different docker containers running Spark and running Kafka, and then you can deploy your Python codes. And then you can create a small script
43:01
that validate the results for you. Thanks, so we have one more question. When is a good idea to replace let's just see SQL based pipeline with Python code? Benefit would be just readability or one my count on the speed up?
43:22
I think the answer is when you are not able to create more easy queries and you started to create a lot of procedures, functions, or anything that helps you to automate your pipeline, probably is a signal that you need
43:41
to improve your architecture because functions, procedures into the database is quite hard to test, it's quite hard to maintain, it's quite hard to version. So probably you bring more complexity to your architecture that I afford to migrate to code
44:01
then you'll have like a good data pipeline, it can work better than a simple SQL based pipeline. Okay, thanks. Just one thing for Phillip, so I'm still using a lot of SQL based pipeline
44:22
because so the most part of the problem, this is not a problem, but when you get your product to get in more users or get in more complex, you need to change the strategy. But if your strategy is still working, this is very good. So thanks again, Robson.
44:42
Thank you so much.