Nix for data pipeline configuration

Video thumbnail (Frame 0) Video thumbnail (Frame 1851) Video thumbnail (Frame 2629) Video thumbnail (Frame 6305) Video thumbnail (Frame 10427) Video thumbnail (Frame 14984) Video thumbnail (Frame 15972) Video thumbnail (Frame 16822) Video thumbnail (Frame 18637) Video thumbnail (Frame 20562) Video thumbnail (Frame 21818) Video thumbnail (Frame 22672) Video thumbnail (Frame 23379) Video thumbnail (Frame 24610) Video thumbnail (Frame 26095) Video thumbnail (Frame 27494) Video thumbnail (Frame 28143) Video thumbnail (Frame 29511) Video thumbnail (Frame 30912) Video thumbnail (Frame 31577) Video thumbnail (Frame 32443) Video thumbnail (Frame 33385) Video thumbnail (Frame 34139) Video thumbnail (Frame 34853) Video thumbnail (Frame 35621) Video thumbnail (Frame 36623) Video thumbnail (Frame 37368) Video thumbnail (Frame 38082) Video thumbnail (Frame 38995) Video thumbnail (Frame 40075) Video thumbnail (Frame 41367) Video thumbnail (Frame 42584) Video thumbnail (Frame 44415)
Video in TIB AV-Portal: Nix for data pipeline configuration

Formal Metadata

Title
Nix for data pipeline configuration
Title of Series
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2018
Language
English

Content Metadata

Subject Area
Abstract
My team develops a data pipeline to generate music recommendations. It consists of many batch jobs that read data from somewhere and write their output somewhere else, with complex dependencies and parameter tuning. Historically, we have configured these batch jobs with hand-written bash configuration, or with dedicated python-based tools such as Airflow. However, both lack flexibility, often forcing the developer to bypass them and to run jobs manually during development. The tasks of data pipeline configuration and package definition share some requirements: both imply running many programs in a specific order and with specific parameters. Since nix is a language dedicated to packages definition, which allows expressing packages in a succinct and highly flexible way, we decided to try to use it for data pipeline configuration. Nix-the-tool is too centered around package management for our use case, so we built our own tool around nix-the-language. It this talk, we’ll explore how to apply nix to data pipeline configuration. This will give us the opportunity to look at nix as a language, abstracted from its current ecosystem. We’ll also explore how to structure a nix codebase, encountering the same questions nixpkgs encountered a long time ago, but in a much smaller environment. The main goal of this talk is to share the different point of view of nix that comes from applying it to a different problem and starting from scratch. We also hope to serve as an inspiration to explore other nix-based DSLs. --- Bio: Georges is a Software Engineer at SoundCloud, in Berlin. He is part of the team that generates music recommendations. He loves exploring new ways to solve engineering problems, which led him to look into exciting technologies such as Haskell and NixOS. Some of his favorites hobbies are playing board games and learning German.
Revision control Process (computing) INTEGRAL Virtual machine Configuration space
Point cloud Musical ensemble Computing platform
Stapeldatei Context awareness Game controller Multiplication sign Virtual machine Function (mathematics) Mereology Product (business) Formal language Different (Kate Ryan album) File system Energy level Installable File System Stapeldatei Multiplication Logical constant Shared memory Bit Category of being Process (computing) Logic Order (biology) output Resultant Writing
Laptop Filter <Stochastik> Point (geometry) Axiom of choice Trail Expression Code Multiplication sign Source code Set (mathematics) Branch (computer science) Function (mathematics) Mereology Product (business) Formal language Mathematics Average Different (Kate Ryan album) Software testing Stapeldatei Multiplication Software developer Code Bit Software maintenance Formal language Mathematics Category of being Graphical user interface Type theory Process (computing) Logic Order (biology) output Moving average Resultant
Revision control Data mining Different (Kate Ryan album) Function (mathematics) Resultant Formal language
Implementation Multiplication sign Projective plane Data storage device Set (mathematics) Function (mathematics) Derivation (linguistics) Performance appraisal Derivation (linguistics) Different (Kate Ryan album) Personal digital assistant Computer configuration Mixed reality Single-precision floating-point format File system Implementation Library (computing) Resultant Local ring Library (computing) Physical system
Building Building Primitive (album) Database Function (mathematics) Formal language Derivation (linguistics) Revision control Derivation (linguistics) Proof theory Hash function Mixed reality Energy level Tunis
Stapeldatei Code Letterpress printing Primitive (album) Division (mathematics) Function (mathematics) Variable (mathematics) Computer file Attribute grammar Derivation (linguistics) Derivation (linguistics) Process (computing) Integrated development environment Semiconductor memory Function (mathematics) Mixed reality Order (biology) Representation (politics) Physical system
Derivation (linguistics) Process (computing) Hash function Inheritance (object-oriented programming) Semiconductor memory Function (mathematics) Parameter (computer programming) Function (mathematics) Device driver System call Derivation (linguistics)
Scripting language Stapeldatei Code 3 (number) Function (mathematics) Device driver Derivation (linguistics) Derivation (linguistics) Process (computing) Function (mathematics) Mixed reality Order (biology) Pattern language Regular expression Row (database)
Set (mathematics) Function (mathematics) Parameter (computer programming) Device driver Attribute grammar Derivation (linguistics) Derivation (linguistics) Word Different (Kate Ryan album) Personal digital assistant Flag Pattern language Arc (geometry) Alpha (investment)
Derivation (linguistics) Process (computing) Social class Device driver Row (database)
Point (geometry) Inheritance (object-oriented programming) Code Boilerplate (text) Function (mathematics) Parameter (computer programming) Device driver Derivation (linguistics) Process (computing) Personal digital assistant Different (Kate Ryan album) Pattern language Abstraction Social class
Stapeldatei Context awareness Set (mathematics) Function (mathematics) System call Derivation (linguistics) Recursive set Revision control Derivation (linguistics) Process (computing) Function (mathematics) String (computer science) Order (biology)
Predictability Process (computing) Code Function (mathematics) Order (biology) Parameter (computer programming) Abstraction Formal language Computer file Product (business)
Set (mathematics) Function (mathematics) Parameter (computer programming) Table (information) System call Abstraction Resultant Lambda calculus
Derivation (linguistics) Default (computer science) Programming paradigm Beta function Function (mathematics) Function (mathematics) Parameter (computer programming) System call Writing
Derivation (linguistics) Message passing Process (computing) String (computer science) Single-precision floating-point format output Parameter (computer programming) Function (mathematics) Device driver Variable (mathematics)
Resultant Product (business)
Recursive set Process (computing) Personal digital assistant Function (mathematics) Set (mathematics) Mereology Recursion Attribute grammar Derivation (linguistics)
Point (geometry) Recursive set Information Function (mathematics) Set (mathematics) output Function (mathematics) Recursion Attribute grammar Derivation (linguistics)
Recursive set Point (geometry) Functional programming Derivation (linguistics) Function (mathematics) Set (mathematics) Function (mathematics) Extension (kinesiology) Recursion Resultant
Set (mathematics) Function (mathematics) Parameter (computer programming) Extension (kinesiology) Resultant Product (business)
Stapeldatei Set (mathematics) Domain-specific language Domain-specific language Regular expression Abstraction
Point (geometry) Slide rule Context awareness Scheduling (computing) Building Basis (linear algebra) Code State of matter Multiplication sign Patch (Unix) Direction (geometry) Parameter (computer programming) Function (mathematics) Mereology Number Attribute grammar Product (business) Revision control Derivation (linguistics) Mathematics Goodness of fit Different (Kate Ryan album) String (computer science) Intrusion detection system Single-precision floating-point format Software testing Configuration space Extension (kinesiology) Recursion Scripting language Electronic mailing list Limit (category theory) Recursive set Proof theory Process (computing) Logic Mixed reality Configuration space Right angle Iteration
all righty everyone let's continue so I hope you enjoyed lunch and there was enough for everyone so our next speaker is George and if you have ever used the Travis Nix integration and you enjoy using that then give George a hand afterwards because he's the one responsible for it but it's that is not the topic for today because today George is going to talk to us about an X for data pipeline configuration enjoying give him a hand and hello there can you hear me in the back yeah okay I'd like to start with a short show of hands is there anybody here that is working with something around data science or machine learning a Big Data some of you and any of you are any of you are using tools like Luigi or airflow that kind of things to schedule and and run your bad jobs a few of you so for the few of you the short version is ID I replaced that with Nix let's go for the long version so hi I'm Josh I already said that I
work at Sound Cloud it's a music streaming platform where artists upload their own music themselves and what I do
there is that I do recommendations because anybody can upload anything it's kind it can be quite hard to actually find content you like and my team and I we build tools that generate recommended playlists which contain you might like it's a very interesting topic come talk to me afterward if you want to know more with that but for the purpose of that
talk only need to know that it's mostly mostly batch jobs which means it's long-running jobs that run on all of our users data they take hours they read data somewhere they write somewhere else and the reason
I'm here today is that because I tried to use necks for my bad jobs so why why
did I do that let me give you a little bit more context but what that is
so I said earlier we have batch jobs they are at the lowest level there are commands we run we write name in various languages some of them is in Scala's motive it is in CN in the end is just to come and rerun giving it some input giving it its some output and we've got a bunch of them and they've got dependencies between them some of them will explain expect some other one to have run before and part of the job of having path of the part of having these batch jobs is to make sure that they're in the right order they run with the right output I don't want to accidentally run today's today's run with yesterday's data because I will get some weird results and I need to have some some fine control about how this runs this is an example it's a simplification of the thing we run in production we have a bunch of different jobs some of their share some of them share dependencies some of them do not and it's it can be kind of a mess to configure and to well to maintain and monitor just a bit more more context all the job we have all the data we have it's stored on on something called HDFS stands for Hadoop this reddit filesystem it's say it's a distributed file system a clustered file system so we have a bunch of machines and every machine holds some of the data the data is not centralized anywhere and that gives us an interesting property when we develop with that is that we don't care about the individual machine file systems we only care about that big central ID that big single turn the HDFS we don't care about this machine or that machine is just there is only one file system for us when we work with that which is kind of a nice simplification and this bad jobs we run them daily because we compute in your recommendation for users daily but we also run them multiple times a day while we are developing so we've got one production run a day and then I'm trying some new logic trying some new things I'm gonna run the job and run it again iterating multiple times and it's kind of this tension
between the production run and the development run that that is kind of the source of this the initial annoyance that that led me to pursue this because for for production we the best jobs that we want them to be we want them to be nice to express we want them to be reliable we want them to run every day in production they run every night we don't want them to fail and come in the morning and realize we failed we want them to be maintainable I want to come back in months and and be able to still figure out what's happening but for the development size I care about about flexibility I care about doing as fast as possible some new thing trying something out in order to get to the next next in quickly and these two things don't really work together at least for the tools we currently have for this they either have some of the property of being stable and and good for production or being flexible and good for for development and I wanted to to try something that would kind of have both to give you a bit more of an idea of what I'm talking about when I talk about flexibility in tweaking here is an example of one of the pipeline we have so these are bunch of bad jobs it computes some recommended playlists and we start with a set of candidates candidates is what we call tracks that might end up in the user playlist but we are not sure yet and we have multiple batch jobs that will enrich them and filter them and then score them and finally make the final playlist and this is in production this runs every night but no I want to write some new filtering code because we realize the old one is rubbish and we can we can make it better and I've done it I've written the code I've compiled it it's on my laptop and I'm ready to run a new job and what I want is to run the new filtering code and then roll run the stuff that depend on it because I I'm actually interested in the final result what impact was my change to this job has on the final result of the whole pipeline and ideally I also want to do to do that without having to rerun candidates than average candidates because I I know there will be the same I did not change anything there and so sadly with the current tooling we have the only way I have to do that is to go to whatever tool I have in production look how how the filtered candidate job worth run copy that comment and then I did the parts I want to edit change the code change the path to the code I'm using change the past of the output in gfs and then i go to the next job discord candidate and I drew the thing I copy the the comment I changed the input path and the output path and I run it and and same for the last one which is which well first of all is a knowing but it's more that than knowing it it's unless unnecessarily out and the reason
I really find it's a problem it's because it's bad incentive if we want the right thing to be we want the right thing to be easy the right thing is to test my job completely check the final result but if it's hard I will be a bit less likely to do it at some point I will just stop doing it and the quality will be reduced so it's really a matter of lining incentive threatening should be the the easy you want to do and that's what I want to reach here I want I'm testing the whole thing to be to be easy not hard and annoying and another example to show that it's not only a development versus production problem actually is I've not written my my new my new filtering logic and I want to well I'm not quite sure it's actually that better from the that better than the original one and to make sure it's actually better we're using a technique we use quite a lot which is testing a/b testing the ideas that I'm gonna take my old logic that generates my all recommendation and the new one that you know it's don't your recommendation and I'm gonna save both two different set of users and then I'm gonna compare how it performs for example if I'm interested in the listener the listening time in this playlist to see if user like them I'm going to compare the user of the all playlist and the user of the new playlist but in order to be able to do that I will need to we need to compute both that of playlist both that of recommendations so I will need to actually compute both branches of that pipeline and with the currently tools the chrome tools I've in in production the current tools I use to define my pipelines I have no choice but to just duplicate the paths I want to run twice I'm going to duplicate the filtered candidates to score candidates and the final playlists and just copy past them and then tweak the past to measure to make sure that they write two different two different places otherwise it's well just gonna be one one big mess that's gonna stop working or worse and and I find this well I found this annoying and I find find this hard and it's gonna be even worse to maintain because I have to make some changes in the future to one of the jobs I'm gonna have to do it twice all the way they're going to do that it's gonna be a maintenance nightmare and the thing I want to reach here that the code the code for to express my pipeline should be as simple as the idea I want to express if the idea of want to express this I want to run the whole thing but changing the filtering logic it should not be more it should not be oh I'm going to copy past it and then type some change here and there to make it work it should be expressing the whole thing but with a different filtering logic and that the thing I wanted to reach and well with that in mind and I turn to nix because next first of all because thinks it's it's a pretty nice language for package definitions next packages contains a it's it's very nice to contribute to Nick's packages to to make it change some package because the package definition is is actually very nice in Nick's but more than that and
that's really the thing that made me look into into Nick's to work with that is is because it's it's a language that actually allows you to manipulate definitions if you want to change a package in Nick's you don't have to copy past the definition and then change it you can actually in the language itself make make tweaks make overrides in the previous example I had the package definition for the less package I can actually use that definition and say oh yeah but actually the uncursed dependency is gonna be another one is gonna be that different version and this looks very similar to the thing I actually want to reach so this is why I wanted to try and use Nick's for to solve it this problem of mine so let's
talk about how I actually did it well what is the final result so I'd like to
introduce you to to mix I'm very bad at naming so the thing is called mix which is an implement an implementation of Nick's dedicated to data pipeline I'm saying an implementation of Nick's but most of the actual implementation works happened in the hatch Nick HDX library that I'm using which turns all passing and the evaluation of Nik so I'm not actually re-implemented the whole of Nyx somebody else did that for me thank you very much
and the reason we have this implementation is because the definition of the derivation in Nix it's not not quite what we need Nix has a very strong idea of of the nick story everything that you will build will gone will end up in the next or most of the time it's going to be the slash Nick slashed or path you could point next to something else but you will always have one single nick store in Nick's and that does not work very well in in our case first of all because the next store it's on your local fight system and we don't want to build stuff on our local file system we want to build stuff on on that HDFS and then where I work we have strong set of conventions around how stuff should be organized on HDFS every single team should have their own sub dere and different projects will also add their own sub dear so having having everything being built in one big store that contains everything is not not really an option so as a result that mix tool the the the new thing it its redefining derivation the derivation in mix are I very similar to the one in the classical Nick's but they also have they although the output to go pretty much everywhere in the file system and not just only in the mix store we also implement some very simple
building very proof-of-concept level so we have we have these derivations and we we build them we build their dependencies we don't do any parallel building we don't do any any sandbox the most simple thing to get a proof of concept working we don't even sterilize the derivation we don't write them to disk or put them on a database we could that would definitely be very nice to make some some to linger around that but this version does not not no such thing it's mostly we are mostly interested in how to build the derivations the the last piece of that is that for this I decided to go with the Ning docker to specify how to build the derivations once again is because of pre-existing conventions where I work we did most about we build package and distribute most of our stuff with docker containers so the cud I need to use to build my pipelines is already available as the docker container all the tuning for that is used so this derivation on top of having of being able to have an output anywhere it will also have one more to reach with which is the container in which the derivation needs to be built it's mostly just a path through to the Builder which which will then end up calling docker and yeah
we end up with this new tool in which we have a derivation primitive so this is next the language but it will not be interpreted by next the tool it will be interpreted by by mix this is nearly the same derivation function you would expect to see in in classical NICs except it has to attribute that you don't have in Nix it has this prefix which tells us where the where the the output is actually supposed to be and it has this container which tells us how to actually build build the derivation and when you run mix well it's it's gonna it's gonna do what you expect on the dairy is sorry from the derivation function it's gonna is gonna take the prefix and and the hash of the derivation and and the name and and it will give you the possibility to to build it I said that we don't sterilize
the derivation this is the pretty printing of the memory representation of the derivation not nothing very fancy to see here it's it's nearly the same as what you would find in the dot div file in the in the next or accept it as a container a container attribute but it's very it's the same thing and as an usual dairy agent as an output and it tells you with the Builder args and environment variables what to run in order to to get that output okay so
we've got this is very this tool this mix tool very close to NYX that is suitable to define our our batch jobs because it knows about HDFS and and it goes around the the restriction of next so it's it goes with that problem no we end up we have the issue of defining an NYX code base in the next code base being the definition of of the pipeline all my bad jobs and their dependencies we can start with the
most simple way to do it we define a new derivation by just calling duration on a very raw call to derivation for my first job my candidates job I give the name the container the builder is gonna be bash and then the arguments I pass to bash the prefix and then some and very enviable I need to provide for further thing to work and and I can make my first derivation like that and it will
look like this in memory nothing nothing fancy it's pretty much the same thing I had before I can see that the output was well it was filled with the prefix the hash and the name and I can run that it will it will build it it will run the actual spark submit comment which is the thing we need to run to to run small jobs and it will put the output in the output so far so good nothing no nothing super fancy it's it's
not really better than just having written batch scripts so we want to make
this better I know because I look any mix packages I know it's we can make expression of package is much nicer and I know we can we can have this override this tweaking thing when we were talking about earlier and so in order to see how to do that I I looked into well it's pretty much the only mix codebase I know it's the one definition of all the package around nix that that is next packages and III I looked into that and I I saw multiple multiple patterns that we could reuse in order to organize on our own mix code base in order to make it nicer and make it easier to express this these derivations is bad jobs and the first thing we do so here we are back with a very row or a definition the first thing we do is that we are going to notice that most of the jobs we want to build we want to build them with bash and the
thing we actually run to write is this the difference between this one and that
one is that I I just provide the command to run in passion I don't bother writing how to run bash and with with what word flags and to define this this bash
derivation function this is actually a pattern that is very very common in next package it it's about defining defining a function Bosch elevation which will take a set of alpha tributes that will call derivation with these attributes but on top of doing that it's also pick some stuff out of these set of attributes and use them to inject some new arguments here so in this case we are mostly interesting in the common in the container that that are being passed and then we pass them to derivation but we also add the Builder that is gonna always going to be bash and the arcs which is are always gonna be - easy comment and this thing this pattern that is that is very very common in X packages allows us to well
it allows us to have this badge derivation that that let us define this so know every single derivation I have that will that I will want to build with bash rather than just evoke in invoking the the Row the row builder I will be able to use that but we can go further because I know most of my job are going
to be spark jobs and there's a lot of well if I look here there's a lot of
stuff that are boilerplate to run spark so I wouldn't want to have my spark my
spoke derivation definition and once again once again I will use the same
pattern that function that takes a lot of that takes the attribute and extract some interesting out of them and then cause another function in this case called bash derivation not directly derivation but bash derivation so I can actually layer these abstractions or on top of each other and here I take a bunch of different arguments because there are a lot of things that I could want to configure in my in my spark jobs the jar in which to code is to class its this is Java world so the class that is the entry point of the code and and then some of the argument I might be interested in and we call by derivation with the comment that will use all these or all these possible arguments we can override and so we have we have now this
nice way to define one one batch jobs in order to define all of them I can use a recursive set in Nick so I defined my first derivation my first bath job candidates by calling Spock derivation and then I will define the next one in rich candidate by also a call to Spock the revision that that can refer to candidates because it's a recursive set and because when you try when you refer to a derivation and you want it to be to be a string you will get the you will get the output of the derivation the output path where the derivation has actually be built you will also get a and a string with an attached context and an attached marker that says that this string actually comes from that derivation which allows next to know that enrich candidates actually depends on candidates and and
yeah and that works I get my set of job this is this is actually my pipeline it's the set of all my batch jobs that depend on each other then I can run and and that is actually that that is already pretty nice it's already a
pretty nice language to define my pipelines it allows me to have abstraction in order to refactor and reduce the code and can keep it clear and simple it's actually better than what I already have in prediction but
but I wanted I I want more I want to actually tweak jobs because I've done nothing about these tweaking and and this flexibility I was talking about earlier what we want is this I want to
have my weekly dot next file that contains all my production my prediction definition and I want to import it and I want to override it I want to take the candidate and say oh the important parameter should be should be set to 11 instead of 10 because it's it's very important that it's set to 11 or at least I want to try it set to 11 and for
this we can take this this abstraction that is that is present in UNIX packages the make over a table it's it's it's basically wrapper around the function it will when you call it you will still get the original result of the function but it will also injecting the result in the result set and never write function that will allow you to call again that function but overriding but tweaking the arguments which with it won't it was call in the first place an example of how to use it I've ever I have a very simple function here the make path that takes a prefix in the name and it can't get concatenate it with the slash in between if I make it a
variable and then I call it so the perfect is slash user and the name is discovery which is the name of my team I get the path that contained the result but the path also as an a write function I can call with some additional argument that will be used to replace the original arguments of the function and so here I can call I can call the Passover adding just the name it will keep the original prefix and I get my new result which is my new name with my original prefix and this this technique
we can use it on our definition of derivations I can actually make so first of all I have to change things a beta I need to make the definition of my derivation a function that will take the argument I want to overwrite so I change the important I make it a function important paradigm is now it's now an argument of that function it defaults to 10 and it's used in the in the definition of the derivation and then I can make candidates I can make it over a table and get to candidates by calling make candidates and this allows me to
write this this allows me to get my definition and write overrides sorry in call overrides on it too to change the value in the important parameter and get a new derivation that is exactly the same one as before except these specific parameters has been overridden and I can
do that for parameter but I can also do that for dependencies him here I will make my enrich candidate or variable by setting candidate to be an argument of the function is the same thing and that allows me to do that
also I can override what candidate is in the definition of enrich candidates I could make it another derivation or I could make it another how to the pass a string I have pre computed it I absolutely want to run it on that pre computed value I can do it that way this will return me a new derivation that is exactly the same as a rich candidate except the input the candidates input is this one instead of of whatever I had before that's already pretty nice I am now able to take any single that's job in a single derivation I have an ANOVA read it and tweak it and change one or multiple parameters in here but
that's not yet what I want I remember I wanted to trick the entire pipeline I wanted to take derivation and change it and then get the final result that I don't have it yet so let's get it this
is what I want to write I want to say import my production definitions and extend it by changing what candidates is in that set not just what candidates is for an original it died but rather but what it is in that specific in in that in that pipeline and the issue I
encounter while doing that is that I defined all my jobs with a recursive set which means that once I evaluate Nix I
get the setbacks the recursion is part of the syntax of Nix so I get a set that contains candidates in enriched candidates but there's nothing in that set that tells me that can the the attribute candidates in that set was used in the definition of any which candidates it is the case I use the
attribute candidates here in the definition that when I get that back I
have lost that piece of information which I will need if I want to actually do that that override and so the way to
work around that is to is to do the recursion ourself instead of defining defining it as a set at a recursive set we define it as a function that takes a set and written and returns a set and that will use the input set to well to look into itself because we are going to call it by passing its output as as its own input it's a technique that is called it's the fixed point recursion it's it's pretty common in well in lazy
functional programming it's and yeah it
allows us to represent our set this way
and get the exact same result we had before so if I turn my representation of
my pipeline of my set to that I can define this Mac extensible and this is also something we can find in next packages I'll bet this one is a bit tweaked and it will return the fixed point of the set so the actual set of of derivation size plus an excellent function that will allow me to to well to modify to tweak the recursive set before actually applying the recursion which will which allows me to be exactly the thing I want
it allows me to size I want the set but I want to modify something in the definition of the set and then all the stuff that our recursively defined depending on this will also be changed and the last two thing I was mentioning
well I can use them together I can use that extent to change some to change the definition of the whole pipeline a and I can use that override to change one parameter and this is the thing I wanted this is actually the thing I wanted in the first place this allows me to say oh I want to take the definition I have for my production production pipeline and change the candidates and change the important param parameter of the candidate to this and get the final result that depend on this and and this this gives it to me I I actually have it
so so so in conclusion well NYX is a
pretty awesome DSL for data pipelines not not only does it give me a very nice way to express the data pipeline it also gives me that that overriding feature I want it and that I to be frank I have not found in any other tool for that that is usually used to to express data data pipelines but the other thing I
want to conclude out of that is that data pipelines is great in laboratory fornix having this small set of packages a small set of batch jobs and there evasions allowed me to well to explore this abstraction and to actually understand them to explore this abstraction we have in Ex packages but in a much much smaller scope and to finish kind of on on the teaser or not
it also allowed me to explore different techniques that are not used that I'm not yet used in indexed packages I've I took some inspiration from some design document around configurations around expressing derivations and expressing packages as as recursive recursive sets and I implemented another of my own pipeline definition but based on that idea based on having recursive sets and it works quite well it's it's actually nicer to you use and to express than the original thing and it allows me in the end to to do my my end derivation with that by saying this is a different definition of extent it's not the same words as before but to say I want to take my whole pipeline and I want to change candidates out important important parameter set it to 11 and another candidate that number of executors set to 500 and this is actually pretty nice because as as Elka mentioned earlier this is a way to get around the the restrictions or around overrides around override attribute with which I have not mentioned but is another way to override that you kind of have to get into also if you want to express every every single of a ride you want and yeah and it's actually pretty nice so that that's everything I have
thank you very much for your attention hi guys thank you so much for your great talk do we have questions oh yes hands up already thank you for the salt was very interesting question so you you write your own logic in mix um how do you test it how can you prove that what you intend is actually happening Thanks that that's a very good question one when I have not solved yet it's actually a question that you have in every single data data pipeline configuration tool we in the current configuration we have we are very strict tests that test that to command the actual command rerun are the one we want to be running this is too much testing because every time we want to make a change in the pipeline we change the tests to reflect it without really thinking about it I don't have a good answer on how to actually test it in a way that is not just check that the output is exactly the thing that you want to run so I know upstream H Nix doesn't have string context yet how did you implement that and it does have string context just not being added to anything I kind of made the patch quick I was wondering if you had like a good solution there yeah so I am I implemented in a in the previous in the previous iteration of H Nix before all the Rikyu of the recursive thing where we're fixed and I reimplemented it it's got a pull request that is closed because it doesn't go in the right direction but it's good enough to work yet you are using this in production or I am not using this in production it's mostly in the proof-of-concept State for now mostly it's not because it's not good it's because the well the quality of the code I wrote to make it work is definitely not production great sub question did you show this to your co-workers and what did they say I showed this to my co-workers and they say why is this not yet in production any other questions yeah you may feel free to raise your hand so I actually see do you plan to open so at some point oh yeah I still have to untangle the part that are open so scible in the part that are really tied to the the stuff the the actual pipeline we run in that I cannot bike public but yeah I will definitely want to purchase that in the coming weeks good time for one last question yes I'm sorry this looks really great for like pure data pipelines do you have any way to integrate with asynchronous triggers like say another team provides some data or a human has to sign off on something or something like that no it's either imitation or at least something that is definitely not not solved by this in one of the previous version of my slide I I had some some list of off limitation but the the IDS this could be the basis to build some very nice building to not the building tool not all of the build tool itself or or rather job scheduling tool itself but the basis to build it there are still other problems you need to solve on top such as the one you say for now in my in my proof of concept the way I solved it is that I have some bad script that resolves external dependencies and and passed them as argument to the next code you you would definitely want something better for protection okay alrighty that's all the time we have so thank you so much for your wonderful talk [Applause]
Feedback