Get in control of your workflows with Airflow


Formal Metadata

Get in control of your workflows with Airflow
Title of Series
Part Number
Number of Parts
Trebing, Christian
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Christian Trebing - Get in control of your workflows with Airflow Airflow is an open source Python package from Airbnb to control your workflows. This talk will explain the concepts behind Airflow, demonstrating how to define your own workflows in Python code and how to extend the functionality with new task operators and UI blueprints by developing your own plugins. You'll also get to hear about our experiences at Blue Yonder, using this tool in real-world scenarios. ----- Whenever you work with data, sooner or later you stumble across the definition of your workflows. At what point should you process your customer's data? What subsequent steps are necessary? And what went wrong with your data processing last Saturday night? At Blue Yonder we use Airflow, an open source Python package from Airbnb to solve these problems. It can be extended with new functionality by developing plugins in Python, without the need to fork the repo. With Airflow, we define workflows as directed acyclic graphs and get a shiny UI for free. Airflow comes with some task operators which can be used out of the box to complete certain tasks. For more specific cases, tasks can be developed by the end user. Best of all: even the configuration is done completely in Python!
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
a welcoming this last this in the light of the of the speaker is and
trading talk about getting control of Europe flows with time if so I welcome to my talk yeah getting control of the workflows flow and this interleaving and I am working as a something will apply and younger so we have also here and if you're interested later on just drop by and ask any question so imagine the following scenario in which I know personally from my they're like you had a data-driven companies each nite you get data from the customers and the state wants to be processed that's how you make money processing happens separate steps so for example you have to take care of the state as well as you have book that data you apply some machine learning steps you can take decisions based on the results of the machine learning and it also several then you need to get an overview of what happened why did this happen when did it happen especially since most of this stuff was running at nite and you need to see the next morning what possibly went and this is already might have guessed tight pants so time does matter and processing time what options do you have to work and to to realize such a scenario the 1st thing that comes to mind of most people of us is
doing it with point and we also have many projects where we started with the it's a great way to start it works out of the box but you only have country you cannot say this branch of depends on that branch of the start afterwards you just say at some time not so for example let's 20 to a class of what your data at midnight the predictor on and at 2 o'clock the desired from besides that they are handling also it's hard you always have to search for the correct lot fights when something went wrong
so now as I said have a tight time schedule and you would like to get uh finished earlier so as we see each of these steps roughly runs around 1 and a half almost so why not compressing so we could do better
we could you start to predict at 20 now before midnight and at 1 o'clock started besides roads most of the time but sometimes the database slow sometimes obligations and then 1 run takes longer usable data on maybe takes 10 minutes longer data is not that predict fades decision run phase you completely run thing which is sort that when discovered in its money because a customer cannot get the data so that's an issue with 1 because of that when we used to call mechanism we always had passed and yet if efficient schedule was not tuned types that but what about the next steps of customers and small data processing time gets longer and we need to find better solutions why not writing
a lot of tool it's so simple we just have to to check that the 1st run stopped and that the next from start that cannot be that hot and a star that's very easy in multiple projects we did that and it worked for the 1st that's what after what you see the limits you have maybe
concurrency that have multiple tasks running at once you need to know why what task failed you might not only wanting to do this time triggers but also trigger manual things off the walls you might have wanted new ideas and external at that point you have to take a decisions you you can accept the limits that's fine all your homework maintain implementation it's much more complex than thought initially so you are stuck we were in that situation also we wanted to harmonize all these workflow tools have no with different projects and we had to look at several workflow at several open source
implementation of many interesting things with many different properties so for example we also had to look at what you find which he uh but this is more in HDFS space to which was not an technology stack and also the other to in the end we decided for that flow which is an open source project initiated by adding the therefore the name what did we designed for that while the 2 that sentence written in Python we know that you like it
and the 1 thing that was really cool is that the workflows are defined in Python code so they are not sitting in some Jason financing not sitting in some database rose but really each workflow is a Python code you can enter it in your washing control system you get all worsening and that's really where a good way of managing the it has most of the features that you were run within the limitations so you can have a look at the present and the past runs as learning features it's great it's extensible so you can write your own extensions and Python code and log in without having to modify the open source code but it detects the stuck inside the moral on that later it is under active development so at the moment it's an Apache Incubator Project and if people are operating on the pull request so there's lots of traffic and then you can see that it gets further it has nice slide which I will show you that you can define your own REST interface and it's relatively light weight you have to 2 processes on the server and you need a database to store all how does a
workflow look like this is the python code I talked about maybe it is a
dynamic absolutely not much a directed acyclic graph each what each workflow and instantiated to give some parameters like 1 of the 1st what schedule and you have you can give that in constant touch on other intended has also and it can define you workflow steps as operators so here I tell more about operators later but the 3 steps we're doing we're putting the data we are predicting and we are we take the position and the connection between these steps you do wireless that upstream so you say before the predictor and the data needs to happen in the form of the site happens the predicted so this should be the 1st to work here but it it's going to the next complex stuff so maybe you want to have a friend in can alter we have more data and the break predictor and takes longer want to parallelize that and maybe we say we do some prediction for German customers and some prediction for the UK locations so that I can't say I here predict Germany predict UK both depend on the opening of the data and the decision depends on both of them so it's very nicely into this graph and it will give you a proper directly but that you can
build an arbitrary complex of workflows also have the possibility for decisions for switches but at least for us we did not need them to know so most of our workflows are quite linear just with the few proteases so how does the nice I look like I already promised you you have here is an overview where you see world
will close to you have 1 is the schedule of them and also 1 the status of most recently so it's a little bit too small to read but you have here is saying which task how many tasks the front correctly harmony costs are running currently also you can see what our erroneous and what are currently up for tried you can run each you know each you can if look at each background explicitly so you see here the sequence this is color coded also that you can see 1 which in which that was successful which is currently running erroneous sensor fall so this is a random people start so this is just a scheduled point starting with more than 2 the tree view shows you an overview of all the runs so uh here you see each each column is the rendezvous so you see for each day here these 3 days when correctly all green and the last round currently had issue he went and the 2nd step it's yellow this means it's up for tried so you get a nice overview on how they behave in the past and currently also which helps and that the runtime you which uh for which you can see for example performance degradation where we see here we have reruns and M. uh these colors are all different tasks so let's say this is the fucking data so that the blue 1 this is the prediction step and this is the position that and the 1 we have the same and the other to change the world where useful for scene
which of 2 steps might take number you can see each run also a grand challenge to see 1 what each
step happening and you have a lot of you which which really is useful where you can also put things like unfortunately it's a little bit smaller units as this is the decision task has started the job and the backend system and the job 1 17 and the step and the next the next iteration and ask what is the status now and then we see that but you can see how each task
process know what are the building blocks of the workflows these operators and there are already many operators delivered in a flow as an example you can operate and you can start to things on the best you can stop things with
HTTP request you can execute statements of databases you can write directly Python code which is executed by you can send mail and this is just a few examples there are more in the in for the they're not only these operators but also sensors sensors are steps and you work for that wait for things so 1 http sends could for example always fury and you are l and r whether finished or what is the status of the based on that the weight of the proceed in the world In the same way in
HDFS sense or check for final on the file system and sensor could check for ways and many things already you can do with these operators but there might be situations when you need more for example for us we had an asynchronous processing
and all of the consistency so uh which after system we had all back and system for example the machine learning system for the prediction we want to start a job so we train them and we try again http requests that we get that job idea and then we let it run for 5 minutes after often also and we're constantly ask whether it it's finished not and when it is finished we can start the next this will be possible to do already with standard methods of natural so we could use a simple HTTP operator started and the sense of described work and until live in finished this works but it has the disadvantage that you don't see directly how long it it takes so you remember the last you with the runtimes so I would like to see how long did my decision take and therefore I wanted this step aside has a certain length and the length of this is the length of the tube on system so this is possible but we can't use with the new operator I want explain each line in detail and also you can find after waltz them this as a complete airflow example like and on the guitar people which can see after can check for each line so we have an HTTP connection defined we have some some endpoint decide that we can trigger that end in the presence of jumps Modelling and have a job status we can possibly have to report what is the status of so within the execution we run uh the post on the site to get a job on and wait for the job John partying and once the status finished we and then within the actual database we know how long did this decision now how do you get these operators into your system as a said we don't want to modify the code they are filled with directly but we can do this in a python package we can we can say that we have this black and that has some on operators that for some classes and brands and lay that nullifies system and therefore configurations we just can't save you like insulin and you look for the Phoenicians are there and on the
start of with attack them too much also that led to this defined in python hesitancy here with the
flow and manager and it just say inherit from the DF overlapping Europe from like and this is we operate assignment and also inference what this the bottom print why do you need that we had an the requirement we wanted to have an end point to talk and arrest style with a lot systems so that we can also programmatically say I want to manually to start the to trigger I want to know is diagram finished or not this functionality it wasn't there in an air flow but you can write it as a fast range between defined points and it is detected automatically and ended with whatsit also this you will see in the example we how
would such a rest and endpoint look like we have here therefore server
running on port 80 80 we have defined this endpoint trigger and we say we give the name of the workflow which is data processing and we get back the name of the workflow and around which you can use after I want to ask for the step so this works fine know what happens inside of airflow it works with 2 processes at at and these 2 processes such thing with have a scattering process that takes care of when each job should run and we have worked so that gives the U and all the other ones was unique database several databases are supported
we are using in our company the past present as the light as light currently has the restrictions that you cannot run parallel tasks on them but we are using this like morphology will open testing stuff so this is fine and for production you can use a past present and you don't have that in you can also look at how do you want tasks to be executed we are using most of the time just HTTP requests we're
saying only we regard task in the back-end system will wait until it's finished so the actual system itself has a high workload on that so we are happy and that this front with once within this can process directly at all when we want to have multiple tasks in parallel we work with construct process but it's also possible if you triggered the stuff away experts all similar things that you want more power behind the executed notes itself and to do that you can use celery which is framework with multiple worker nodes and you can use that
there's already a connection from him to they how we use most of things already mentioned in the meantime we used automatic schedules and have many we use 1 are for
instance the we managed so we also had to how do we that connection we have 1 sample company instance or 1 apple the system and for us it was easier to do it that way databases we use possible and as light executes a light weight and also we are contributing to outflow this is a really good that works fine this extended for adjusting to conclude that the manually they will not be 1 year ago and we needed
them differently on using outflows of neural for request that was also what with and now we these to pull requests this is an effort and we are also have some necessary functionality for the pattern detection so we also open pull request that active communication with the community with all these good things about airflow there are these few challenges I want to make you aware of because these were the things that we struggled a little bit and also with the project teams using flow at all sites this is has to do with how a Catalan handled and how is the start time the interpreters interpret so scheduling there are 2
dates that are important for that it's the start date this means when did the processing of this task of this work will start on the cell so that's quite easy that's at the time of sort but there's also an execution date is quite prominently shown on the light and that sometimes shows strange when these wages are consistent and their explainable but they are not always obvious reason is the history from air flow so this was used in the G scenario so this Extract Transform Load and this means that for each they wanted to process the data which was a group
which cover was coming in but all day long so let's say on them 19th of July the whole data came in and then you wanted to process that data for the 19th of July and when can you process that data you can process it only after the 19 which is the 20 selected today so today with the cost of data-processing runs and whilst execution date it's the 19th so it's always 1 iteration back the service because the set originally was this is more a description of the data from the 19 that what's the situation at that's fine when you know that it doesn't scare you that you think the system is doing while things but that's consistent but yet you have to get used to that we have some work for starting in weekly schedule which means when not forget that now it gives me the start date of Monday the week before also that is consistent but you need to get to the then we have to start that you might remember for 10 minutes ago that we give a start date for each workflow and if the workflow is scheduled automatically uh and you start to so it will no 1 that
it has to fill up tasks so when we say the start date is today 20 July that no stock and so it went up to like scapula and started we have given the 17th of July and then it will detect that there are some runs missing and will fill these runs automatically so it will 1st trigger run for this with execution at 17 and execution date 18 and a regular run then the 19 with the process that the current at the current point of time you need to check whether this is applicable for you so when you have these
things I would need to really to process this data that's fine when you have more more of a thing I wanted to get something in the back and and need to forget it just once because this package of for meaning of that stuff and this is a little bit strange and can lead to issues when you triggered to much but you can work around with that when you get the correct start times already so you can determine that encodes you can determine that invariable there are several options I will discuss them in detail but this is something we should have you should have in mind when you do that when you wonder why just spectral happened it's possible to handle that but you need to know the answer if you have some further questions that we discussed OK and that's it also from my presentation I give you that's the incubator
project where flow it has a nice documentation which is you also very useful is the common pitfalls page and after the key they also the stuff with the execution date is explained in more detail and the like and which I have shown you parts of you can find you know what me end of the you can download that you have the steps in the readme on how to use that and your efforts let's look at 1 side
with thank yeah I think the just settle things called the presentation um you should as the query uh so is it possible to to manage the dependencies in these query or is it to just to display things that true that you wrote in the group and the workflow definitions itself if you do that encode so you can view that code from the growing but you have to change it in your cold and on OK because we will be in word a word from we've got to the homemade this kid you and uh well with the find think 100 100 told just inside and the really is it's scalable iterative the flu do you use this amount of testing new systems and we know we don't have that high data volumes in all systems so for us it's more that we have system so these these 90 rounds that have several tasks but not thousands of millions of have seen in the documentation page from having the specs seem to be much bigger so it would be what I asked them what is the limit for that but we did not reach up now hi I like to ask about the execution date and the Monday OK is it possible to continue because we have like similar example in you you going at the last moment and you want to learning for example 15 days later or in the opposite direction you want to like going the next Monday and running 2 days before that like this it constant if you can continue at like delay or maybe even if you can postpone if you see OK will come tomorrow but if I have noted that the moral I really the day after tomorrow and um well at 1st the logic is not configurable so this was uh in the scheduling code itself regarding the staff and running 2 weeks after about 2 weeks from now but I would say I have no quick answer to that maybe we could discuss it after wants I think you can do many things with the scheduling so because the scanning just had when run you also have to consider the possibility to scatter around each day and as the 1st task of the run decide on whether you really want to run on the so this might be the 1st iteration when you say you implement the more complex logic and in your 1st task but maybe that's going into it's have on the question you will be evaluated other tools when you decide about and wide you decide if for example we have to look at it gene but that was based on an HDFS that which we did not run so therefore this was to have made just to have workflow system to set up this we also have to look at several OpenStack implementations but their main focus was on doing heavy work lifting with execution processes and how these are distributed and since we had very lightweight processes but needed more you features and more and more of these and more possibilities for the final operators so this also just was not the main focus when you see
this to the things straight out of the main focus of different thing then you have to have a look at of the things you you so doing things knows a lot things of that some clients so if you already have dentists have to convince your team to use something else so what what could be the 1 thing that you can do the way you can I mean Jenkins great we also use Jenkins for all integration testing for scheduling our unit test but not for not for their daily productive from yes the on this people begin this young


  583 ms - page object


AV-Portal 3.9.1 (0da88e96ae8dbbf323d1005dc12c7aa41dfc5a31)