Cross-Platform Data Lineage with OpenLineage
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 56 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/67200 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 20224 / 56
22
26
38
46
56
00:00
MetadataDemosceneStandard deviationImplementationoutputMusical ensembleMetadataProcedural programmingTransformation (genetics)Open setSet (mathematics)QuicksortSelf-organizationDemo (music)State observerRight angleProjective planeFrictionOpen sourceKey (cryptography)Limit (category theory)Graph (mathematics)Context awarenessXMLUMLLecture/ConferenceComputer animation
01:55
Context awarenessComa BerenicesSource codeMetadataHierarchyMathematical optimizationSet (mathematics)MetadataContext awarenessSource codeBoundary value problemInformationRight angleHierarchyState of matterLecture/ConferenceComputer animation
02:42
HierarchyMathematical optimizationInformationRight angleMultiplication signDecision theoryProcess (computing)Disk read-and-write headWater vaporLine (geometry)DiagramComputer animationLecture/Conference
03:25
Standard deviationMetadataInferenceOpen setMetadataTransformation (genetics)Projective planeSource codeStandard deviationInformationMultiplication signDecision theoryHypermediaPairwise comparisonComputer animation
04:44
Standard deviationCategory of beingScheduling (computing)Open sourceMetadataTransformation (genetics)Projective planeLecture/ConferenceComputer animation
05:38
Graph (mathematics)OpticsClient (computing)Physical systemStandard deviationService (economics)Transformation (genetics)Projective planeLibrary (computing)Product (business)Medical imagingLecture/ConferenceComputer animation
06:38
Event horizonSeries (mathematics)Client (computing)outputComplete metric spaceFunction (mathematics)Open sourceStandard deviationNumberProjective planeOpen setVideo gameMomentumoutputSource codeInformationAdditionEvent horizonTrailSeries (mathematics)Complete metric spaceMultiplication signProcess (computing)Transformation (genetics)Query languageRevision controlBitHypercubeSet (mathematics)Parameter (computer programming)Cycle (graph theory)Lecture/ConferenceComputer animation
08:16
Data modeloutputModemData modelProcess (computing)ConsistencyState of matterTransformation (genetics)IdentifiabilityRegular graphTraffic reportingSet (mathematics)Intrusion detection systemoutputLecture/ConferenceProgram flowchartComputer animation
08:53
Data modeloutputoutputProcess (computing)Instance (computer science)Multiplication signParameter (computer programming)Element (mathematics)MetadataDifferent (Kate Ryan album)Transformation (genetics)Set (mathematics)Endliche ModelltheorieComputer animationLecture/Conference
09:32
Graph (mathematics)Correlation and dependenceNamespaceDatabaseClient (computing)Metric systemOrder (biology)Coma BerenicesEvent horizonTransformation (genetics)InformationDatabaseComputer filePoint cloudSystem identificationProcess (computing)Standard deviationRevision controlConsistencyoutputObject (grammar)Table (information)Set (mathematics)Data storage deviceFile systemOpen setComputer animationLecture/Conference
10:33
Lemma (mathematics)Open setNamespaceOpen setEvent horizonComputer wormoutputRevision controlProcess (computing)Function (mathematics)Right angleUniqueness quantificationInformationMultiplication signMetadataType theorySource code
11:19
Uniqueness quantificationCore dumpDiscrete groupObject (grammar)Data modelLimit (category theory)Source codeControl flowQuery languageStapeldateiStatisticsRevision controlTexture mappingSystem identificationRootMathematical analysisBitAtomic numberMetadataInformationData qualitySource codeRevision controlQuery languageTable (information)Different (Kate Ryan album)NumberDistribution (mathematics)Type theoryStatisticsTracing (software)CausalityData storage deviceMetric systemSystem identificationCore dumpField (computer science)MathematicsRandomizationPlanningRootComputer fileParameter (computer programming)Pointer (computer programming)Endliche ModelltheorieService (economics)Set (mathematics)Right angleProfil (magazine)DatabaseMotion captureProcess (computing)XMLComputer animation
14:07
Standard deviationMathematicsMathematical analysisIntegrated development environmentProjective planeMetadataOpen setImplementationOpen sourceInformationTrailLecture/Conference
14:44
MetadataService (economics)Data managementSource codeRevision controlData modelProcess (computing)MathematicsStapeldateiRootRevision controlMathematical analysisProcess (computing)CodeTransformation (genetics)CausalityPhysical systemCentralizer and normalizerMultiplication signEvent horizonSet (mathematics)WritingTrailFile systemIntegrated development environmentInformationCore dumpFunction (mathematics)Endliche ModelltheorieDifferent (Kate Ryan album)Type theoryMetadataData storage deviceDatabaseSource codeMultiplicationComputer animation
17:19
MetadataTask (computing)Revision controlCodeoutputParameter (computer programming)Open setEndliche ModelltheorieINTEGRALRevision controlProcess (computing)MathematicsTask (computing)outputGeneric programmingScheduling (computing)ParsingDifferent (Kate Ryan album)CodeMetadataType theoryTransformation (genetics)Parameter (computer programming)Lecture/ConferenceComputer animation
18:25
Endliche ModelltheorieOperator (mathematics)File systemProcess (computing)InformationSeries (mathematics)Finite differenceOpen setData qualityRule of inferenceLecture/Conference
19:00
MetadataRepository (publishing)Operator (mathematics)Parameter (computer programming)Source codeLogicoutputMetric systemProcess (computing)Query languageOpen setOperator (mathematics)Point (geometry)Table (information)Reading (process)Process (computing)Task (computing)Source codeCASE <Informatik>InformationAnalytic setoutputQuery languageRevision controlPlanningLogicMetric systemMathematicsNumberMultiplication signTransformation (genetics)Row (database)Data qualityMereologySet (mathematics)Different (Kate Ryan album)Uniqueness quantificationTrailMappingCore dumpType theoryINTEGRALRaw image formatLevel (video gaming)CountingXMLComputer animation
21:01
Streaming mediaQuery languageImplementationInterface (computing)Motion captureInformationMetadataSerial portFunction (mathematics)Metric systemWrapper (data mining)Numbering schemeData modelEvent horizonSource codeCore dumpQuery languageEvent horizonCross-correlationSet (mathematics)Transformation (genetics)Operator (mathematics)Process (computing)Group actionGraph (mathematics)Cartesian coordinate systemINTEGRALElectronic mailing listLevel (video gaming)Complete metric spaceInformationMetric systemLogicTask (computing)Open setUniqueness quantificationNamespaceRevision controlMetadataSource codeEndliche ModelltheorieIntegrated development environmentoutputFile systemConfiguration spaceWrapper (data mining)Row (database)Demo (music)MereologyMatrix (mathematics)Projective planePlanningProfil (magazine)Uniform resource locatorWorkstation <Musikinstrument>Computer animation
24:36
View (database)Materialization (paranormal)Streaming mediaUsabilityConfiguration spaceGraph (mathematics)BuildingDemo (music)Instance (computer science)Virtual machineWebsiteProjective planeOpen setConfiguration spaceMedical imagingCommon Language InfrastructureAstrophysicsINTEGRALComputer animation
25:44
Process (computing)Instance (computer science)Open setMultiplication signRight angleInformationTask (computing)Set (mathematics)Graph (mathematics)Coefficient of determinationMetadataTransformation (genetics)CircleCorrespondence (mathematics)Computer animation
26:56
Default (computer science)Mathematical analysisIntelInstance (computer science)Graph (mathematics)Reading (process)Process (computing)Open setLine (geometry)Event horizonComputer fileSlide ruleMultiplication signLocal ringRight angleComputer animation
27:57
9K33 OsaComputer filePlanningProcess (computing)Task (computing)Event horizonCuboidElectronic mailing listMultiplication signINTEGRALStandard deviationMetadataSource code
28:52
Default (computer science)Event horizonComplete metric spaceSlide ruleINTEGRALInformationNeuroinformatikUniqueness quantificationProcess (computing)Local ringoutputOpen setInstance (computer science)Source code
29:49
DatabaseProcess (computing)Set (mathematics)Transformation (genetics)Computer fileFile systemMetadataRepository (publishing)Right angleNavigationComputer animation
31:03
Demo (music)INTEGRALMultiplicationImplementationOpen setMedical imagingLogicCausalityCross-correlationRootState observerProcess (computing)Set (mathematics)MathematicsData qualityComputer animation
31:58
Open setLattice (order)Open setView (database)MetadataSoftware frameworkMereologyProjective planeIndependence (probability theory)Standard deviationData conversionCycle (graph theory)TwitterInformationComputer animationLecture/Conference
33:00
Open setLattice (order)Scaling (geometry)Lecture/Conference
33:41
Open setData conversionScaling (geometry)MetadataData storage deviceRight angleSelf-organizationVolume (thermodynamics)Process (computing)Type theoryInformationLecture/Conference
34:35
Open setInstance (computer science)Scaling (geometry)Data storage deviceRight angleDatabaseTransformation (genetics)Flow separationProcess (computing)10 (number)Physical systemMetadataMultiplication signStatisticsLecture/Conference
35:27
Open setData conversionLattice (order)Streaming mediaElasticity (physics)CASE <Informatik>MultiplicationDifferent (Kate Ryan album)Information privacySet (mathematics)MathematicsKey (cryptography)Graph (mathematics)Operator (mathematics)Transformation (genetics)MetadataRootNumberRevision controlCausalityStatisticsMultiplication signMetric systemProcess (computing)Pointer (computer programming)Physical lawRegulator geneLecture/Conference
37:27
Lattice (order)Open setProcess (computing)InformationCASE <Informatik>Different (Kate Ryan album)Traffic reportingRegulator geneLecture/Conference
38:16
Open setLattice (order)Set (mathematics)Independence (probability theory)CASE <Informatik>Electronic data processingMathematical optimizationData managementSpacetimeInformation technology consultingSoftwareEnterprise architectureData qualityRight angleSimilarity (geometry)Lecture/Conference
39:30
Open sourceMetadataMultiplication signOpen setKey (cryptography)MultiplicationRight angleMereologyInformationMathematical analysisRepository (publishing)ImplementationCASE <Informatik>Wechselseitige Information
41:11
Lattice (order)Insertion lossSet (mathematics)InformationInstance (computer science)DatabaseBackpropagation-AlgorithmusLecture/Conference
41:46
Lattice (order)DatabaseTrailMathematical analysisStapeldateiCASE <Informatik>Context awarenessGraph (mathematics)INTEGRALLevel (video gaming)State of matterRight angleOcean currentLecture/ConferenceComputer animation
42:43
Lattice (order)Materialization (paranormal)Streaming mediaSingle-precision floating-point formatInformationState of matterMultiplication signMultiplicationRow (database)Type theory2 (number)Motion captureRight angleCodeSet (mathematics)Instance (computer science)Reading (process)Revision controlMathematicsLecture/Conference
43:44
Open setLattice (order)Instance (computer science)Software frameworkStapeldateiOpen setCASE <Informatik>Line (geometry)Interface (computing)Moment (mathematics)Lecture/Conference
44:22
Projective planeINTEGRALMoment (mathematics)Slide ruleMetadataMereologyImplementationComputer animation
45:17
Lattice (order)MomentumMoment (mathematics)Arithmetic progressionVideo gameProjective planeINTEGRALOpen setLibrary catalogMusical ensembleLecture/ConferenceComputer animation
45:59
Open setLattice (order)Level (video gaming)FreezingMusical ensembleLecture/ConferenceComputer animationJSONXMLUML
Transcript: English(auto-generated)
00:08
So for the agenda today, I'll start with talking about the need for lineage metadata. And then I'll talk about the solution with Open Lineage and Marquez open source projects that help with this.
00:22
I'll talk about data observability across your ecosystem, how it integrates with Spark, with Airflow, with DBT, and all the things you might be using. And then I'll end with a demo of how this works. So first, data lineage is the key to scaling your usage of data
00:47
from an organization standpoint. And so the lineage is really about metadata, about what the transformation is, what are the producers, consumers, what data sets are being consumed, produced,
01:01
and all the related metadata. And it helps connect lineage not just for each transformation, but connecting all together and building that graph. And so the reason we need lineage is because when you try to build a data ecosystem, when an organization grows, whether it's a company or any sort of organization with teams
01:24
that consume and produce data, typically inside each of the team's bubble, people have a very good understanding of how they consume data, what their practices, how they publish it, how everything works.
01:40
But as soon as we go across teams and outside of their bubble, that's where a lot of friction happens, because people have very little visibility on where the data is coming from, who's consuming it, and so on. There's lots of limited context for each outside their team boundaries.
02:01
So limited metadata, what's the data source, what is the schema, the schema supposed to be, who's the owner of this data set, how often it's updated, where does it come from, who's using it, what has changed recently. All of those things, if you don't explicitly consume, collect this metadata, you don't know really where all the information about this data.
02:25
And so there's this Maslow's hierarchy of needs, if you know it. It talks about how to reach happiness. You have to start with basic human needs, which is being fed, being in a safe state, and so on. And once you have those basic needs,
02:42
then you can look at being happy or reaching the best definition of yourself and so on, like self-realization. And so we apply this same idea to data. First, the data needs to be available. It needs to be showing up every day. The new information needs to be fresh. It should not take a lot of time for the data to show up.
03:03
And then it needs to be correct. And once you have these three basic needs, that's where you can either optimize your processes, learn, make decisions, find new opportunities, and so on. So it's kind of like this waterline when you need to get your head above water before you can do anything with your data.
03:23
So the solution to this is really open lineage. Open lineage is this open standard for collection of lineage metadata from pipelines as they are running, not trying to infer lineage from other sources, but really inspect an instrument's data transformation
03:40
as they're running. And it's been started by reaching out to a large community of project and data and making sure we can define these standards on how we collect metadata and lineage together, because it's been a historically difficult problem.
04:00
And so one of the design decision about lineage is to collect really this information at the time things are running. So we've used this comparison of this information that your camera collects about your picture you're taking. So you could either try to infer where and when the picture was taken
04:21
by looking at it or after the facts and so on, or you could have your camera actually collect this information when the picture is taken, which modern cameras do. They collect the coordinates when it was taken, a lot of information. And that's basically what we're doing with open lineage
04:41
is really capturing all the metadata that is available as things are running. So the way we started this standard is really by reaching out to this ecosystem and everybody needs lineage. And we reached that conclusion that the only way we're going to solve lineage is not by trying to collect from the outside,
05:02
but really get together as an ecosystem and agree on that standard. So there's two big categories, right? There's the producers of lineage, and maybe it's your tools like Pandas or Spark or other transformation using schedulers like Airflow, warehouses like Snowflake, BigQuery, SQL engines,
05:24
all those things. And on the other side, you have all the consumers of data. You might have used some of those here. I'm showing the open source projects in this ecosystem, but there's also a bunch of proprietary or vendor proposed solution.
05:41
And originally the problem was very complex, like everybody's trying to collect lineage by understanding the inner working of each of the systems. And by defining open lineage, we simplify this project problem a lot. So first, there's lots of removal of duplicate work of collecting lineage, and we enable,
06:01
instead of understanding the inner working of each of those transformation engines, we can actually ask them to publish in a standard way their lineage. And so that's where this ecosystem exists. You have a bunch of producers of lineage,
06:20
you define a standard, and that enables a bunch of consumers to consume lineage in a standard way. So that's what open lineage is. It's very similar to open telemetry, open tracing libraries in the service world and being a standard to instrument and expose lineage. And really the idea is, like in any open source project,
06:41
like defining a standard like this, there's a relatively smaller number of actors at the beginning that starts implementing this standard and collecting lineage, and then slowly, but really this is the strength of open source, it propagates in the ecosystem and takes like enough momentum
07:01
that it takes a life of its own and becomes the accepted standard. So to give you a quick overview of how this works, lineage is reported as a series of asynchronous run events. So whenever a job or a SQL query
07:21
or a transformation happens, we send a start event that may contain information like what was the version of the source code, what were the run parameter, hyper parameter, and then when it's finished, there's a complete event that is sent that tells you about what were the inputs and outputs or what was the schema of the dataset
07:41
at the time it was read, or this kind of information. And there's the run ID that helps correlate the start and complete events and make sure we collect information about what's being transformed. So there's a little bit of a life cycle, like most of the time things work, but we also, in addition to the complete event,
08:02
you have fail event or abort event to keep track of things that are not working as well, which when you deal with data reliability is also very important to understand when things are not working. And the data model is fairly generic and extensible.
08:21
So for each run state update, there's going to be a consistent job name, like which is the identifier of the job. So that's the thing that runs regularly. For example, it could have a hourly or a daily transformation. Maybe there's a report to generate every day or every hour so it's going to have a consistent name
08:40
so we can identify this is the same job running it again and again. We have inputs and outputs pointing to datasets. So same thing, datasets have consistent IDs that are name-based and that make sure we identify consistently what are the inputs and outputs of that jobs, and there's a run ID that identifies
09:02
every time this job runs, right, and knowing that we have different instances of the same transformation, possibly with different parameters. And for each of those elements in the model, it's extensible with this notion of facet. So in the data world, it's pretty diverse.
09:22
There's lots of different ways to define a dataset or a job, and so those facets enable providing more specialized metadata about each of those things. And so the way Lineage is built from those events is really by correlating this information, right?
09:41
So we observe every transformation of data and they always have input and output datasets. And the way we stitch the Lineage back together by making sure we have consistent naming and identification of those datasets. And so there's like the principles and the OpenInage spec defines a bunch of standards
10:03
on how to identify uniquely those datasets, whether they're files in distributed file systems, object stores, whether they're table in databases or cloud warehouses, and something for the jobs and so on.
10:22
So the first step is to have a consistent way of identifying everything. And to give you a more concrete example, this is what the simplest version of an event looks like. So if you have a start event, it's going to have a type start,
10:41
it has a time it happened to have, and then as a run ID, the job unique name, the inputs and outputs unique name, and that's where you can, so the simplest version doesn't have any other metadata, but of course you can add facets with their own piece of information for each of those entities.
11:00
So sending an event, it's as easy of sending a post, HTTP post on the OpenInage endpoints using a payload that describes what's happening. So this is a start event, and this is complete event describing the output of the job when it's finished.
11:21
And so to give you a little bit more of information about this notion of facets, each facet has its own JSON schema, so it's a atomic piece of metadata that can get attached to those core entities and it's self-documenting. It has a schema, they have a name that's unique to them,
11:41
and they're very flexible, right? So, and scalable, and you can use them for various different reasons. So one, it can be for adding metadata that is optional. It could be to specialize information, for example, on a dataset. A dataset may or may not have a schema,
12:02
depending if it's some random files or if it's a table in a dataset, so you can have a schema facet on a dataset. Maybe if it's a topic in a Kafka topic, the schema would be Avro, and if it's in a database, you have a more SQL schema.
12:23
If it's a job, you may capture information, which was the Git SHA version of where it's stored, and you might be storing what the source code actually was, and a lot of different type of metadata that are captured as individual facets.
12:42
On dataset, you have lots of information like data quality metrics and so on. And so here are some example on dataset statistics, data quality statistics like number of null values, distribution of columns, something like that can be collected, the schema,
13:01
the version of the dataset, if the underlying storage is version like, for example, Iceberg or Delta Lake. On the job, there could be information about the dependencies version, what's the version in source control, what was the query plan for the executed query.
13:20
On the run, you may have other type of metadata like a query profile, for example, or parameters that are specific to this. And there's lots of possibilities that are to extend this model, right? You can use it for dependency tracing.
13:40
We do implement, we provide as a service, root cause identification of problems and figuring out why things are broken. You can use it for detecting what would be the impact of that change. Do automatic back fields when something went wrong and you need to regenerate data in the past. You can use this information to very precisely
14:01
know what you need to regenerate. Could use for anomaly detection, finding the cause of problem, historical analysis and compliance, knowing where your data is going, for example, for GDPR compliance and so on. So now I've talked about open lineage,
14:23
which is a standard for collecting this metadata and instrumenting things. And now I'm going to talk about Marques. Marques is this other open source project, which is an open source implementation of open lineage and that keeps track of all the changes in your environment. So what Marques does,
14:42
it collects all this open lineage information, collecting from either your ETL system batch, streaming engine and saves all of this in a centralized database, versions every change in your environment and allows you to navigate the lineage
15:03
and search for information and just navigate this metadata. And so that's what the internal Marques model looks like. So you find the same core entities
15:22
that you find in open lineage. And the big difference here is re-versioning each information. So every time we receive a run of a job, you remember, so we're collecting what's the current version of the code, what's the schema of the data set at the time of reading, what was the schema of the outputs at the time of writing. So Marques receives those events
15:42
and keeps track of, is this the same job version as the last time it ran? Is the data set schema the same as the last time? And version the metadata and keep track of all the changes. So a job may have multiple job version, a run is connected to the particular version of the job it was at the time it ran,
16:02
a job consume and produces data sets. And similarly, for a given run, you know what version of the data set was consumed and produced. And this model specializes data set sources, those may be coming from databases or file systems or various information.
16:22
And so they're going to be specialized depending on what type of storage you had. And that's what facets are for. And similar in the job, a job could be a streaming job, a batch job, the different types of transformation. And so the benefits of this version model,
16:41
it's very useful for debugging and finding the root cause of problems and understanding what changed. And so this is keeping track of data set changes and job changes and enabling this kind of analysis of, I don't just have to dig through what was the version of the code
17:01
at the time this transformation happened, it's all collected and I can correlate changing data with change in lineage and upstream transformation. And so now I'm going to go over how we're instrumenting this across the ecosystem.
17:23
So giving you some example, so remember on the left, you have the open lineage integration that go into your Airflow or Spark or DBT or others and that instrument where the lineage looks like. And these get sent to Marques who's going to version every change in the model.
17:46
So the support in Airflow will map every DAG and task into a open lineage job, right? So every task lifecycle is collected, collecting the task parameters, linking the runs to what was the version of the code
18:02
capturing inputs and outputs. And it's using, depending on the type of transformation, it's going to use a SQL parser or instrument task in different way, linked to the GitHub and extract metadata. So of course, Airflow is this very generic way
18:22
of scheduling and orchestrating your job and you have different operators to do this in the Airflow model. And so the way we work is by providing a series of extractors that know how to inspect each of the operators and extract the information, right?
18:42
So when you're transferring data from a file system to a warehouse or running a SQL operation on the warehouse or using a great expectation, for example, to assert data quality rules, we can, the open lineage extraction,
19:02
understand each of those operators, extract the metadata and send them to the open lineage endpoint. And so when you define a task in your Airflow job, we'll extract a few pieces of information. So first we know from what source the data is coming,
19:25
looking at here, for example, and the analytics DB. In this case, we're going to parse the SQL to extract the table you're reading it from. So you know what dataset this is about. And then the task itself has a name that helps building the unique job name
19:42
to identify recurring runs of the same job. Similarly, in the Spark world, we have a different type of instrumentation, but mapping to the same concept to extract lineage from a Spark job. So Spark has this notion of query execution listener
20:03
that lets you inspect the logical plan and understand, in particular, the lineage, the inputs, outputs, some more information. And so in the current version of the Spark integration, we even extract common level lineage, extracting from the logical plan
20:22
and row count, how many rows were written, which gets exported as a data quality metric so that you can keep track of whether there's a change over time in the number of rows that were generated by this particular transformation. So this is like a rough overview of what the Spark SQL query execution works,
20:43
where you use Spark SQL, you start from a dataset, and then the logical plan gets analyzed. We can inspect the logical plan that gets turned into a physical plan to get executed with RDDs. So that's the part we can look into
21:00
using the query execution listener. So the query execution listener will send a bunch of events as the query gets executed. You have the SQL execution start that starts with a logical plan, and then you have a Spark listener job start when the job actually gets executed, and a lot of information. That's where we can actually look at the task metrics as well
21:22
to figure out how many rows were written, and then the job gets finished. And so we can send the start and complete event knowing when the execution is finished. So as part of this, we can extract information that the logical plan, metrics around the inputs and outputs
21:42
that were consumed from the query profile, and other dataset metadata like its location, its schema, its version if you're using a version data source like Iceberg or Delta Lake, and some other environment metadata can get collected. And so this way this happens,
22:02
and I'm going to show it in the demo at the end of the talk. All you have to do is when you start your Spark session is adding the Open Lineage Spark jar, configure at the listener, provide the endpoint where you,
22:20
Open Lineage endpoint is just given here as an example. For the demo, we use Marquez. You may need to authenticate and configure your namespace to build those unique job names. And that's how you collect lineage,
22:40
and you can see various transformation. And the way this work in the Spark world is whenever you have an action which writes, that's where you have an actual job being executed. And so for each Spark application you're running, you will have multiple transformation showing up for every write operation that you do in Spark.
23:05
And currently in coverage, this is a list of everything that's supported. Typically all file system like input and output, JDBC, Hive, BigQuery, Snowflake, Iceberg, Kafka level integration. If you use things that are more exotic than those,
23:22
those maybe need to be added to the supported set. And then just go quickly on the DBT support. DBT is another very popular tool by analysts to transform data using SQL. So we provide a DBT Open Lineage wrapper
23:42
that helps connect metadata by turning each model into a job in the model and capturing schemas and capturing lineage from DBT and sending it to Open Lineage. And so just to get back to this correlation information,
24:05
that we were thinking of before. So typically, so here giving an example with Airflow or with DBT, all of those tools have some notion of showing a graph in Airflow for each DAG. You can see its own graph, each job,
24:21
you can see the graph of dependency in DBT. You get lineage for you on each project, but they don't connect all those things together. And so here, through this notion of understanding the data set and this consistent naming and connecting things together, that's where we connect those graph together and enable this building a graph of everything.
24:43
So now I'm going to do a quick demo of this. And so first, I'm running an Airflow instance on my machine and I'm going to just start and mark as instance. So I'm just using the Docker Hub command. You can use that from the examples on the website
25:05
for the Marcus project. So just started a Marcus instance. I also have an Airflow instance running that I configured using the Astro CLI tool. I'll show you, one thing I did is I configured
25:21
to add the Open Lineage integration to my Airflow instance. And I also added an Open Lineage configuration to say where my Open Lineage instance is. So here is just my Docker instance that has just started.
25:42
And if I go to my Airflow instance, so I have my Marcus and it's currently empty because I just started it. So you can see it says no job found. Now I'm going to my Airflow instance and I'm going to trigger this job, right, sorry.
26:01
Let me just go back at the page open for a long time. And so just triggering the job right now. So now it's running and you're going to see a job starting up appearing on Marcus, so just one for now. And as the DAG is running, you actually have tasks
26:22
showing up in the lineage graph. And now the Airflow DAG is finished running and so I have lineage for this. So you can see actual transformation circular jobs and you can see the SQL, when it's a SQL transformation for a data set,
26:43
you can see their schema. So we collect the schema information for each data set. If you want to have more detail, you have the raw metadata lower and same for all the jobs. So this corresponds to this Airflow graph here
27:02
that I just triggered. Now I can also run my PySpark command line. So here I'm running a local PySpark. So as I showed in the slides, I'm configuring here the Open Lineage Listener
27:25
to have connect to my local Marcus instance right here. I have a very simple, for the purpose of this, I'm doing just a very simple Spark job. So I'm reading from these text files I have locally.
27:41
I'm writing, I'm counting, I'm grouping the lines and for each line that are the same, I'm counting how many times they are and I'm saving it to a file. And you can see it here on my Marcus instance that it's receiving events from this instrumentation. If I go back to Marcus, now in the list of jobs,
28:03
so I have the three tasks from my Airflow DAG running and now I have my PySpark job that I can see here. If you look at the raw metadata, you can see that we're collecting the logical plan here so that there's opportunity here to kind of do deeper analysis on whether the logical plan
28:23
is changing over time. And the last example I wanted to show of Lineage is by enabling custom integration, right? So typically you might be using these out of box things, but you may be writing custom jobs that are not using standard tools.
28:43
And the way you can do that is by using the, just generating your own JSON events and see what is happening. So here, I was giving example of a start and complete event in my previous, in my slides.
29:05
So here I'm showing you, this is a start event I prepared like as an example of what you could do if you were to build your own custom integration and you can compute information like, okay, when this happened, provide a unique run ID,
29:24
I'm defining the name of my custom job here and having some inputs and outputs. And so now I'm just going to send those to the Open Lineage endpoint on my local Marcus instance. So I'm starting the sign, a start and a complete event.
29:45
And now, when I go back to my Marcus instance, now the two, the Lineage is all connected together because I had this Airflow job and that was writing to dataset
30:01
and the Spark job that's reading from my local file. But now I added the custom job that does take the data from the database and dumps it as a file on my file system and connecting all the Lineage together from the Airflow transformations from my custom job to my Spark transformation
30:23
and connecting everything together. So this kind of Marcus is this repository and UI that lets you browse Lineage and the metadata connected to it. So it's relatively generic, right? And you can search for things.
30:40
If I look for my PySpark job, I'm going to find it here and I can focus on it. You can navigate datasets that could be in the database, for example, or in the file system through here and navigate your metadata and find all the things.
31:09
So this is the demo showing integration with multiple tools and of course, I didn't show DBT and there's also an ongoing Flink integration
31:22
to collect Lineage. And this is the Marcus implementation of OpenLineage. Of course, OpenLineage is designed to be consumed by many different consumers. For example, at Astronomer provide observability features
31:41
that on top of this enable troubleshooting, finding the root cause of why job are broken, why datasets are late across your ecosystem or why data is wrong and where you can correlate problem when data quality downstream with change in the logic or the consumption upstream between those dependencies.
32:03
And of course, all the things like for example, Azure Purview, also Microsoft Purview now also consumes OpenLineage metadata to enable Lineage. And on that, I'm going to give you some,
32:21
here's the information. If you want to join the conversation and learn more about OpenLineage. OpenLineage and Marcus are part of the LFAI and Data Foundation, which provides a framework for those independent project and standards. The main conversation is on Slack for the OpenLineage Slack or Marcus project.
32:44
Slack, you can follow those project on Twitter, which often post about the new release. There's typically a monthly release cycle on those projects. And on that, I'm going to thank you and open for questions.
33:09
Questions, anyone? Okay, thank you Julian. Oh, there's one hand over there. Otherwise, I don't want to stand in between you and I'll launch it as one,
33:20
actually there's one more talk after this. Hello, so this is my first introduction to this topic and it's very interesting. One question that it comes to my mind is, how does it scale? For example, if I have terabytes of data, I have hundreds of consumers, millions of updates happening every day,
33:42
and it seems like, yeah, so how does it scale? Is there any concerns around that or how does that work? Yeah, so this is a metadata store, right? So typically the size of the data doesn't matter, right? It doesn't impact the size of the metadata
34:00
and it's more going to be having to scale with the organization size, right? How many different jobs are running? How many different runs a day and so on. So typically it scales relatively well. The volume of data is not that important. Like the largest organization we've seen,
34:22
they may have like 100,000 jobs is the kind of largest, like big companies type of how many things are happening in every single day. And we capture information about each transformation, right? So you may have how many jobs,
34:40
so there may be several tens of thousands of runs a day or something, which scales relatively well. So Marques is fairly simple system. It stores the data in a Postgres database and Postgres scales vertically relatively well. You can get a big RDS instance and store a lot of data. Typically the value of this metadata becomes less.
35:03
With time, you keep one month's worth of metadata or something like that and after that you expire. You may not need to troubleshoot longer than this. And yeah, typically it's not that huge, right? We collect, like for example, when collect data statistics, it's going to be a few values for each column
35:23
every time it runs, which might be hourly or something like that. Hi, question. What's the best use cases of using lineage?
35:41
Maybe if you can give a few examples, because when I think about it, it makes a lot of sense. We constantly run into issues, but what are the sweet spots where you could say, you know, this is where it solves a huge pain? Yeah, so the interesting thing with lineage is there's multiple definition of it. I mean, there's like one definition of lineage,
36:01
but people use it for very different problems. So the one key problem we focused on is data reliability, like from the operations standpoint, making sure the data showed up and shows up on time and it's correct, right? So it's understanding that graph and really understanding dependencies and understanding how a change here impacts data there,
36:21
right, and it really, sorry. Yeah. Sorry, and then setting alerts on that or something like that? Setting alerts, but also troubleshooting, understanding what, you know, walking back this graph and understanding what the root cause of the problem is, right?
36:42
Like so, because typically you know what the data looks like now, you don't know what it looked like yesterday. So capturing this metadata and capturing precisely this version of the job, consume that version of the data set. That was the schema at the time. Those were the quality metrics and the statistics and the data set at that time.
37:01
And understanding how the correlating, the change in the data downstream, like maybe you have a different number of nodes in this column now, and you can correlate exactly what transformation here impacted this. So that's one use case. Another strong use case is all the privacy laws, like compliance with GDPR or CCPA in California,
37:22
this kind of regulation, because that's where you actually need to understand. But that's a very different use cases. So typically, you know, when you talk about lineage, you really need to understand people, are people talking about reliability? Are they talking about compliance? And in compliance, you have GDPR. You also have lots of banking regulation
37:41
that force banks to prove that the reporting they give to the government on like, you know, how much money the process comes, the lineage to where this data is coming from. Or GDPR making sure that my user private information is going only in the places it's supposed to be going. Of when someone asks for the data to be deleted,
38:01
to know all the places that their data might be stored and might need to be deleted. So that's a very different use case. That's also a very, like, we see a lot of interest in collecting lineage information. And there's just like a bunch of other like side use cases, you know, like optimization, you know,
38:24
understanding the cost of data processing, because you may have like a data set that's very expensive, but the value you get from it is going from where people were consuming it, right? So you don't just want to look at the cost of each data set independently. You also want to tie the value you get from them.
38:41
And that you understand through lineage, through the consumers. So it's kind of, but I would not say that's the primary use cases, but that's another one. We see, I think the two main were these data quality, data reliability aspect, and the compliance aspects are two big use cases. Okay, thank you.
39:05
Yeah, hi, I'm from a data management consultancy called Magilo. And I was wondering, how would you position Marques in the data governance space? I mean, compared to other kind of enterprise software products like Microsoft Purview or similar.
39:24
Yes, so the goal is to build an ecosystem, right? So that's where there's a distinction between Marques, which is an open source implementation and repository on metadata and open lineage. Open lineage is the instrumentation. Open lineage is about making this metadata available
39:40
to enable that ecosystem of multiple consumers, right? So you might use Marques and Purview together. Purview is going to focus more on governance use case, you know, enforcing that data is behaving in a certain way and understanding things. And Marques is like an open source foundation to build your own thing. For example, there's some tech companies
40:02
who use it to build, like for example, Northwestern Mutual is an insurance company which uses Marques internally so that they can build their own tooling around that lineage information. And so Marques, it's not an either or, it's really about building this ecosystem so you can, at the same time, use proprietary tools
40:23
like the tool we're offering around data reliability or Purview or other things. Or, and at the same time, collect the metadata yourself so you can build your own analysis on top of it or your own automation if you want to, right?
40:41
So it's kind of, the goal is really to enable this ecosystem, right? And to enable those things to work together. Because the key is everybody needs that lineage, right? But Marques is not going to be an open source solution that implements everything, right? And it doesn't want to, but it's there to be this open source part of this ecosystem
41:02
for people who want to build their own solutions on top. Yeah, hi. So I can totally see like how Marques is gonna shine when you have like static data set because then it's very easy to go backward, like I mean to back propagate like an information.
41:23
So if the data set is immutable, it's gonna work perfectly. But one of the problem, for instance, I currently have is that sometimes we have, I have like a data set that are built from a store, like from a set that is evolving constantly. So I think in your example, you had like,
41:42
you were like doing an insert from a database. So if that database is updated constantly, it's very difficult to start like tracking like or backtracking like problems. Is there like something I've missed to address this problem in Marques? So the first use case we focused on
42:02
in the batch processing, right? So it's kind of in that way, it's easier to do analysis in that case, right? Because you can track this particular run depending on this particular update from that particular run, right? So we track, the lineage graph is really captured at the run level, right? You see the graph, that's the current state,
42:21
but we actually capture each particular run of something, consume something. So in the context of continuously updating, so we've been working on the streaming integration with Flink, for example. And the current thinking around Flink is to capture snapshots. So having some kind of granularity, so not every update in there,
42:41
but some kind of granularity of like what the state has a checkpoint and capturing this information. So right now we don't, you know, if you do updates on single rows, like every second, this is not the type of granularity we're capturing. Maybe not that often, but like, yeah. Something like, yeah, updates happen multiple times, like maybe per hour.
43:01
Yeah, so. Yeah, I think that would be reasonable, right? If it's a few times per hour, you can capture each of those as a run of something that modify things. So that way you'd be able to connect a particular read with what the version of the data set at that time and connect it to that particular update that came from it, right?
43:21
If something suddenly starts failing, you could connect it to that particular instance of a date that was there, right? And that start broken thing. So you could connect it to someone made a change, for example, in the code at this time. Because we would capture, you know, what was the getcha of the code
43:40
that was writing to it, like every time it was modified. Okay. Yeah. Thanks. Yeah, like the first use case right now is capturing on batch processing. And that's where you get most value at the moment. Maybe one last, very short question, maybe because we have to change on the stage, okay? Yeah, sure. Thank you.
44:01
I just wanted to ask how well adopted is OpenLineage as a framework? Because for instance, I don't think Apache Atlas uses it as an interface for building Lineage. Yeah, so there's Microsoft Purview, which is based on Apache Atlas as a connector, but I don't think it works
44:21
with Atlas directly at the moment. So right now there's integration with Egeria. So Egeria is this other metadata project that is used a lot in banking for tracking Lineage and all of this. So there's an OpenLineage to Egeria integration. There's a Microsoft Purview integration,
44:42
which is proprietary to Microsoft ecosystem. So the slide that was showing what does it fit in the ecosystem was kind of like 90% implemented. The Atlas part is not there yet. There's a experimental integration with Amundsen
45:01
that they showed on their, one of their community talks was talking about it. On the producer side, there's Spark integration. There's a Snowflake. We've been working with Snowflake on having a built-in integration in Snowflake. There's Airflow, DBT, Flink in Progress, a few others.
45:26
So we're still relatively early in the life of the project but it's really having a lot of momentum at the moment. And it's going to keep growing and contribution was,
45:42
it's like community contributed that Daxter integration as well. And on the consumer assumption side, there's a few other data catalogs that have started consuming OpenLineage and consuming it.
46:01
Okay, I'm going to freeze the stage for the next speaker. Thank you very much. Thank you.