The rise of the YAML engineer
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 131 | |
Author | ||
Contributors | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/69512 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Dot productValue-added networkState observerMultiplication signGoodness of fitPresentation of a groupComputer animationLecture/Conference
00:22
Computer fileMultiplication signSoftwareLogicVideo gameYouTubePresentation of a groupType theoryComputer animation
01:07
Personal digital assistantData analysisSoftware developerCodeEntire functionChainClient (computing)Computing platformProcess (computing)Analytic setOrder (biology)Moving averageComputer animation
02:09
Source codeOpen sourceShared memoryState of matterFreewareGoodness of fitComputer fontPresentation of a groupOrder (biology)TrailLevel (video gaming)Stack (abstract data type)MetadataDeclarative programmingOperator (mathematics)LogicDiallyl disulfideSoftwareBuildingComputer animation
03:19
State of matterPhysical systemOrder (biology)Level (video gaming)Transformation (genetics)Traffic reportingLaptopCodeDatabasePoint cloudPoint (geometry)File systemVirtual machineMobile appMultiplication signLibrary catalogSemiconductor memoryLibrary (computing)Different (Kate Ryan album)Data warehouseBitComputer fileMachine codeComputer animation
05:07
Usability2 (number)OctahedronMotion captureVenn diagramEndliche ModelltheorieServer (computing)InformationRevision controlFile formatField (computer science)Uniqueness quantificationSet (mathematics)NeuroinformatikOrder (biology)Data modelCASE <Informatik>Revision controlDeclarative programmingState of matterPoint (geometry)Error messageDesign by contractAlgorithm1 (number)Arithmetic meanTransformation (genetics)Real numberShape (magazine)Data integrityBus (computing)Latent heatComputing platformMoving averageExpected valueScripting languageSoftwareNatural numberInformation engineeringCategory of beingPresentation of a groupRow (database)DatabaseAdditionGraph (mathematics)Computer fileBuildingEntire functionTable (information)Software frameworkTask (computing)Greatest elementMixed realityAnglePhysical systemMultiplication signDifferent (Kate Ryan album)Motion captureExterior algebraQuery languageDataflowAbstractionData Encryption StandardImperative programmingResultantBoss CorporationGoodness of fitProper mapGraph (mathematics)Stack (abstract data type)CountingVirtual machineRaw image formatComputer animation
12:21
Endliche ModelltheorieMeta elementConfiguration spaceIntegerConstraint (mathematics)Uniqueness quantificationComputing platformTable (information)Visualization (computer graphics)Boss CorporationCASE <Informatik>WritingSoftwareSlide ruleSheaf (mathematics)Goodness of fitComputer sciencePhysical systemDeclarative programmingBranch (computer science)BuildingPoint (geometry)Process (computing)Figurate numberCategory of beingDifferent (Kate Ryan album)DatabaseShape (magazine)Complex (psychology)Data modelTransformation (genetics)Connectivity (graph theory)Software developerState of matterResultantSoftware repositoryGroup actionDescriptive statisticsComputer fileMathematicsEndliche ModelltheorieAdditionDimensional analysisPlotterNeuroinformatikSet (mathematics)Function (mathematics)Query languageMultiplication signSoftware testingRepository (publishing)Arithmetic meanElectronic mailing listComplete metric spaceLine (geometry)Point cloudData Encryption StandardPresentation of a groupUltraviolet photoelectron spectroscopyConstraint (mathematics)Graph coloringIntegrated development environmentComputer animation
19:34
Wide area networkGame theoryWebsiteProbability density functionProcess (computing)Category of beingMaxima and minimaElectronic mailing listRing (mathematics)Presentation of a groupSlide ruleInformation engineeringComputer animation
20:22
PCI ExpressDesign of experimentsTask (computing)CuboidStorage area networkValue-added networkRaw image formatRevision controlConfiguration spaceCodeDeclarative programmingPhysical systemDifferent (Kate Ryan album)Multiplication signPresentation of a groupString (computer science)SpacetimeLibrary (computing)Process (computing)CodeStandard deviationElectronic mailing listRoundness (object)Set (mathematics)INTEGRALOrder (biology)Interpreter (computing)Semantics (computer science)Point (geometry)MereologyData structureComputer configurationAbstractionValidity (statistics)Inheritance (object-oriented programming)Arithmetic meanImperative programmingMultiplicationComputer programmingImplementationMathematical optimizationBuildingIntrusion detection systemObject (grammar)Canonical ensembleIntegrated development environmentShared memorySubsetComputer animationLecture/Conference
Transcript: English(auto-generated)
00:04
Thank you, Jake. Good morning. Thank you for attending my presentation. I will introduce myself in a minute, but before this I would like to share an observation, an interesting observation I made about my work, which seems to be trending in the larger data industry.
00:21
So someone writing five times more YAML in their daily software life than Python, which is pretty surprising in a data world dominated by Python. Maybe that rings a bell to YouTube, and I know this is Euro-Python, but the first
00:40
top answer to this was that no comment on this one. This is an example of the type of work I do, and I will talk a lot about pizzas in this presentation. I know it's almost lunchtime. Sorry about this. No need to dive into the details. I will do that later, but this is a YAML file, and that's work.
01:02
That's the kind of thing I do. No Python logic. I will explain why. My name is Mathieu Canet. I'm a YAML engineer. I'm half-skilling. I'm a software developer working in analytics, lots of data, data platform, and software development.
01:21
If you did not catch this yet, I'm French. If you don't understand me, just scream. That's all right. Picnic is my employer. It's an online supermarket based in Amsterdam, in the Netherlands. And it does manage the entire supply chain for our groceries between suppliers and clients who want to order their weekly groceries. It does generate lots of data.
01:47
My job at Picnic is to make sense, build a data platform for data analysts to use this data. Extract some insights, improve the supply chain, make groceries cheaper for everyone at the end.
02:01
Last but not least, this is my little coding assistant. We've been writing lots of YAML files together. So let's get started. The rise of the YAML engineer. In the spirit of free and open source software, I will give you for free already the takeaways of this presentation that you can look forward to.
02:22
First one is you should describe the desired stage of your infrastructure instead of computing it manually. Second one is that YAML is ubiquitous in data systems. Maybe you like it, maybe you don't, but it's there. We do have to deal with it.
02:40
Finally, track all your metadata in Git and use it as the source of truth to build everything on top of it. Well, this is pretty abstract, but I will get over this outline in order to illustrate these takeaways. First, let's agree on what I mean with data, logic, and state. So I will dive into some of my own definitions for that.
03:06
Then I propose to survey the current state of the declarative data stack. Finally, we'll talk about Git operations and share other good practices. So let's get started with data, logic, and state.
03:20
This is pure, raw data. Maybe it lives in a database, in a data warehouse, in a file system, in some kind of bucket in the cloud. And it's pretty useless by itself. We do need a bunch of different systems surrounding this data to make good use of it.
03:41
There are probably some data pipelines, orchestrating data transformations, some reports and dashboards to extract some meaningful insights from it, notebooks for machine learning experiments or any other experiment, some data apps, maybe a catalog meant for humans to understand the data and know how to use it.
04:03
And next to the data and surrounding systems, we do have software. SQL is pretty popular for manipulating data. Python, of course, as well. I don't know, the Pandas library, for example, is used a lot to transform data.
04:20
And another way to look at this is to have it separated in the code and the cloud. The code is what's on your laptop and the cloud is what's not on your laptop. And the code is in a given state at any point in time. The state of the cloud is whatever data is there, whatever system exists and their values in memory.
04:44
And every time we execute the code, we do mutate the state. That's how the state is computed. We do want this state, these data systems to be correct in order to make the best sense of our data.
05:03
So let's focus a bit on the code for now. My boss came to me and, hey, Mathieu, your data engineer. I need a new data insight and we have this whole pizza platform. That's great. But I want to know how many pizzas we have with Pepperoni. Which is a fair question to ask a data engineer.
05:22
One way to do it is to write a Python script which is imperative in nature. So that's a simple algorithm. You can do it in a one-liner, it doesn't matter. Look at all the pizzas, count the ones whose extra ingredient is Pepperoni and return the counter.
05:41
So Python being imperative means we are writing a set of computer instructions which are executed in order to reach the state that we deem correct. In that case, the state is the real number of Pepperoni pizza in our pizza dataset. That's Python, that's imperative.
06:02
An alternative way is to use SQL. As I mentioned, it's pretty popular in data systems and this is a simple SQL query to count the pizzas with Pepperoni. And it's pretty different than Python in the sense we are not laying out computer instructions. We are describing the shape of the data that we want to extract.
06:25
Describing the shape as a current star with a filter to only keep what I want. So here are the two versions on top of each other. I'm not saying here one is better than the other. That's absolutely not the point of my presentation and it's not an interesting debate.
06:41
What I want to illustrate is the difference in abstraction between declarative on top, describe the shape of the desired result and imperative below. Lay out a set of computer instructions to reach this correct state. I will show you a different example to better illustrate it.
07:01
My boss came back and she doesn't want new data inside. This time she wants me to build a pipeline, capture new pizza data from some pizza supplier which is giving us data in JSON and CSV and we want this data in our database. So one way to solve this problem is to build a data pipeline.
07:24
And I could do it in YAML. In this case I'm describing where the data lives that I want to extract in some CSV JSON files and I'm describing some properties of this data. For example how to uniquely identify a row with a given key IG for an IG of a pizza.
07:48
I'm also describing at the bottom where I wanted this data to be loaded. A Postgres database. It's a pretty popular solution for this. And I'm not saying to the computer hey go fetch this data and transform it and put it in the database.
08:04
I'm describing the shape of the ideal pipeline that I want to exist. And I can rely on some data pipeline system to materialize this automatically for me. Which makes things declarative and pretty easy. It's a simple YAML file. My pipeline will exist.
08:29
So let's build an entire data platform for our pizza data. Declaratively with YAML. I call it the declarative pizza stack.
08:41
As we showed just before we can use a tool such as Meltano. There are many others. Meltano is a data integration system which allows us to extract data from some place, push it in some other. And those places are described in YAML. That's the same as the previous example.
09:02
I describe what I want. Meltano takes care of pulling it, pushing it correctly. So good at this point we are extracting raw data from a pizza data supplier and we have it in our Postgres SQL database. Pretty good. Next step to build our pizza platform is to have a data contract.
09:24
A data contract is a piece of software for data producers and data consumers to agree on what they mean when they exchange data. Which is pretty convenient to automatically handle errors and prevent those mistakes to triple down further down the data pipeline.
09:45
One way to do a data contract is to use the data contract specification written in YAML. So in that case I'm using YAML to describe some expectations about the data that the data producer must abide with and that the data consumer can rely on.
10:04
And while this is software, this is machine readable, so both producers and consumers can automatically implement checks on this data looking at those definitions. For example, I want my pizza ID to be always unique, otherwise the data is incorrect, should not be ingested.
10:23
YAML, YAML, YAML. So we have captured data, we have error handling with the data contract, but it's raw data. We do want to build on top of it, to extract further insights, to aggregate it, to make it more usable.
10:40
So we need data pipelines, data transformations. There are many tools for this. Apache Airflow is a very popular one. Argo is another one. As you might guess at this point, it works in YAML. So with Argo, I can describe in YAML a data workflow, which is a DAG, so DAG means directed acyclic graph.
11:04
The details are important. The point is, it's a bunch of different tasks that depend on each other, because they need to happen in order. When it comes to pizzas, I want to first extract ingredients, then cook them, then serve them. I cannot mix up the order, so I do have a graph describing this.
11:24
It's all laid out in YAML and Argo will take care of deploying these graphs and executing it. Everything is done for me. So we are capturing data, we are ensuring it's proper data, we are transforming it in a set of data pipelines.
11:44
Another important tool in the stack is DBT. You might have heard of it. It means Data Build Tool. It's a framework written in Python, which allows us to reason on data models. A data model is nothing more than a table in a data warehouse or a database with a bunch of extra properties.
12:07
And these properties with DBT are described in YAML. So, this is the example I showed at the beginning of the presentation. I have a pizza data model and I give it some additional properties.
12:21
For example, I might want to describe who has access to this data, so I give it a set of grants or credentials. I can define an owner, I can dive into the different columns of my data, give it some human readable descriptions, give it some constraints, even give it some tests.
12:41
I want to ensure that my ID is unique and not null again. And once it's described in YAML, DBT will do the whole job of applying those grants, applying those descriptions, running tests to be sure that the data is correct.
13:00
Everything laid out in YAML. So, we start to have a pretty complete data platform. Something that is missing is dashboards. We do want to build plots and charts so data can be easily understood. There are many tools for this. LightDash is one of them and it's pretty interesting because LightDash directly integrates with DBT.
13:26
The example before described my pizza model. So, in my pizza model I can add some additional properties for LightDash to interpret. For example, my extra ingredient can be a dimension in a plot, in a chart, it can have these colors, it can be represented that way.
13:46
I just describe what I want as a chart, LightDash will take care of building it for me. Interestingly enough, in this case, it's building a pizza chart for my pizza ingredients, which is pretty meta.
14:00
Point is, I didn't build the plot myself. I asked LightDash to do it by simply describing what I wanted. The last piece in our data stack is to orchestrate everything together. So, Docker is a pretty popular tool for building containers.
14:22
Docker Compose is using Docker to orchestrate containers together and it's using YAML. So, with YAML I can have a list of different containers, different components, whose properties I describe very easily here.
14:41
And if I say there must be a database and transform components using DBT and data visualization platforms, Xterra Xterra, and it will do the right thing of building a network on top of it so they can interact with each other.
15:07
Again, YAML, so it's a pretty complete data platform. Finally, for the last section of the presentation, I do want to talk about GitOps and other good practices.
15:20
However, while I was working on it, I did start to ask myself existential questions. So instead, I proposed to look at these questions and answer them and hopefully it will tell us everything we need to know about GitOps. The first question was asked by my boss. Again, she's asking many questions.
15:40
Is this a good idea? All these YAML files, for real, is this a good idea? And obviously the answer is yes. But why? Let me illustrate what happens with YAML files. So, I'm a software developer and my primary computer output happens in Git repositories.
16:00
And that's the Git repo of my pizza data platform. It's a bunch of YAML files. In this case, the DBT data model for pizzas and the description of dashboard. And recently I did a change. I wanted to add an extra ingredient to pizza. Always a good idea. It's a simple line. It's a simple description of this ingredient. Let's have an extra two.
16:28
I also noticed the dashboard was incorrect. It was not grouping data by edges, whatever. So I did this merge request and once it reached the main branch in Git, everything else was automatic on top of it.
16:43
A CACD process kicks in and computes the delta. What I mean by computing delta is to figure out the difference between what is described in Git as the ideal shape of the entire platform and the actual state of the platform.
17:00
For example, the pizza table is supposed to have an extra two column but it doesn't exist yet in the actual database table. That is the delta. Once the delta is computed, an automatic system can apply the right commands to reconcile this stage and ensure it doesn't drift.
17:23
In the case of the database table, we're adding a column by running some SQL alt tables. In the case of a dashboarding system, maybe it uses an HTTP API. So, well, we can do some post query to override a previously incorrect dashboard. Anything else is possible.
17:44
The end result is a data platform which exactly matches what I have described in Git. I have my extra ingredient for pizza. Some pizzas are not exactly popular and I have the right dashboard.
18:03
It doesn't mean anything to group pizza by cheese but hopefully you get the idea. Another obvious advantage of working like this is, well, first of all, the complexity is abstracted away and automated in CACD. Secondly, for data democratization purposes, this allows anyone with a computer science background to play with data and to build their own insights.
18:30
By simply modifying YAML files, it makes everything much simpler. And all the complexities of data platforms and distributed systems and whatnot is abstracted away.
18:43
So yes, YAML or declarative systems is a good idea. Second question my boss asked is, why am I paying you to write YAML if anyone can do it? Which is a fair question again and I wish I had a slide to answer right. I don't.
19:03
However, I can tell her that she's not paying me to write YAML. She's paying me to build the data platforms that is more democratized, that anyone can use, that is automating the complex stuff. By automating the complex stuff, I do free up some time for myself to work on other problems and answer any other questions.
19:27
So she's not paying me to write YAML. She's paying me to build the simplest data platform I can think of. That's all I got. I do want to credit Maxim Bochana for writing The
19:41
Rise of the Data Engineer, which has obviously inspired the title of my presentation. My employer Picnic is hiring YAML and non-YAML engineers. You can have a look at the jobs listing. We also have a booth. Come chat with us. My teammate Picnic recently open sourced a linter for DBT called DBT Score.
20:04
So if you're using DBT or you're curious about it, well that's a way to lint all your declarative properties. Finally, you can look at my website for my contact details and for the PDF of these slides. Thank you very much.
20:25
Thank you, Matthew. Do we have any questions? If you have any questions, please behind the mic so we can hear you. In the middle. In the meantime, I have to tell you, you had a buck in your presentation. There was a pineapple listed as an ingredient for pizza.
20:42
I cannot share my opinion on this one, but I know the world is pretty divided about it. Okay, please let's go with the question. First of all, thank you very much for the insightful presentation. My question is, why do you think YAML emerged as almost a standard for the configuration in the first place?
21:04
I mean, there are a bunch of other configuration options. That's a pretty good question. I do have stronger opinions about YAML. It has some issues. For example, it's not part of the Python standard library and then you do have multiple self-party implementations. So why not JSON or optimal or other things that are part of Python library?
21:24
I think the reason is its simplicity and human readability. It does use white space for semantics as the Python. So the same set of people who are used to using white space to properly describe the structure can do the same with YAML.
21:43
Which might be a reason. That's my own interpretation, of course. I wish YAML was better. It has its issues. It's there. It works. It's human readable. That might be the most important point. Can I do a follow-up question? And the practical disadvantages of YAML. I mean, in your work.
22:05
As useful as a tool for configuration. Are you familiar with the Norway problem? So that's an interesting one. In YAML, if you are using a list of countries in a YAML list somewhere and you are using the ISO, whatever country codes,
22:22
CZ for Czech Republic, FA for France, whatever, and O for Norway. Well, no in YAML means false. So you end up with a list of countries with false in between. That's one example of a problem with YAML. I think it's the funniest one. But yeah, many more examples. If you want to store a version of a system, Python 3.11, you will end up with a decimal 3.11 and not the string.
22:53
So you need to sometimes use quotes, sometimes not. There's no properly defined canonical version of YAML, contrary to JSON.
23:02
Indentation can be a mess. A space can lead to hours of debugging. I would say those are the main issues I have personally with scammer. Thank you very much. Thank you. We had an online question that disappeared because you answered it with this one. So we have one more question. How hard is to debug all these processes when you have YAML only and not the like Python code?
23:28
Yeah, it can be pretty hard. So the point of declarativeness is to be a layer of abstraction on top of imperative or object oriented programming.
23:40
Meaning when something goes wrong, there's one more layer to dig into to figure out what is wrong. One more abstraction I need to go through that might be leaky. So it can be pretty hard. My own experience tells me that it serves more problems than it creates. So it frees up more time than it creates debugging.
24:05
Brilliant. Thank you. And we do actually have additional question. Do you use like a YAML schema validation? So tools like validating the YAML? Yes, we do. So YAML being a superset of JSON, any JSON schema is usually able to solve such problems.
24:26
We do build integrations in IDEs in order to easily write correct YAML. When it comes to the pure semantics of a given YAML configuration, we built DPD score for exactly this purpose.
24:41
For DPD, of course, but maybe the same IDs can apply to different systems. OK, so last question is going to be probably from me. Do you miss writing Python a lot? Of course. I also do it by saving time, having my CI CD system take care of things. I can go back to programming.
25:02
Brilliant. Thank you very much. So big round of applause for Matthew. Thank you.