Bestand wählen
Merken

Feeding data to AWS Redshift with Airflow

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
so what my name is very common to the uh from our town not
far from here so it's this Tonya your mother and it's very nice to have a rope items so close to my own town this year um I mean I'm I'm here to talk about a president an with so start up in London and that is 1 of the 1st part of did and that was a very data-focused um so essentially like I'm I'm to talk about like some experience and I had uh with data in the last 7 months uh and um yes on the tools they used and some choices and that we made them both in terms of architecting the system and in terms of like using the tools and so the talk is about is about uh shifts and which is a database and an air flow which is uh like a framework to uh which I'm going to explain later um so my background this
is actually kind web development and understand like a lot of the what's in the past uh I used mostly Django um I've done another presentation etc. Python 3 years ago and it was all about possible uh this is to like a really uh really enjoying and encourage you to problem could which is quite nice and but then like the department's arrived and then I had to uh like we had to choose like a different stock use was a positive where like a lot of a lot of data processing is happening um there was no what components and then essentially we we decided to choose a different stock and we use flow which is on the uh present that was born uh within their being b but then he got a committed by party a nice like Apache project and and there we use thank some proprietary tools like uh Alice or I should say this is like the can of to be 1 of them and also mostly which user uh you like like it's essentially like in an infinite or dryfoos like amount of storage and this value among the the it and the so
we started from like know always you we started from a product problem remote like and and the pro problem was about uh trucking actually like the tennis of uh conferences on toward like and we're focusing technology and so we wanted to understand who was going to conferences in order to like presenters data and truck sponsors like for conferences and as it was like when interesting project uh but they're really like the technical question that we wanted to answer was and we wary and cross-reference like mood for external data sources because most of their data working and with was a minus by us it was external and so we we kind of needed to look at bring the data in our control and was a lot of a lot of integration solution which will extend the data points and which is what the freedom to query a reference so this or this data sets and really the answer to this question is
is artfully in principle is quite simple like you just have to the load on the data feeds on so you can you you have access to the control data and so you can do any sort of operation have the real data a + like he needs to know them in a system which allows you to the data and we want to uh this system can be can be anything can be excel combining also skull uh but we decided to use racial because you can always use case and the kind the type of queries that we wanted to run and were like but like a good use case for a shift um so
yeah I mean especially those are the 2 things they have to do but some time depending on the complexity of the project are sometimes uh there are multiple orders the multiple databases and so essentially it like you need to make sure you builds a good system to manage uh to data and here we talk about data pipelines and which is a which is generic term and adjusting the indicates like the the system that come supervises so the downloading of building them in the database and all that kind of transformation the data and we work on this and uh pipeline become following based on principles like 3 or 4 3 principles and 1st of all
um we need a system that was able to scale to any number of inputs uh in in the sister was generic enough to be easily adaptable to any sort of feed any sort of data coming in with any sort of mechanism and you can think of it as a as makes select essentially you have a lot of input and and then you just mix and match and the way you want to and this is opposite to for example like a web up where the Sun like a up a plurality of all points like a different up before before users but here like this kind of reversed so you have many inputs and you have like 1 output which is from the database and that instructor want the and so so
important life for data pipelines to do all the processing stages because you know you might want to have a quite complex and subtle transformation transformations and so if you think in terms of terms status and it becomes very easy to and can state or should we input of emotional uh you expect the states to produce and they use a spilled your pipeline and in a way that each stage does a particular thing that produce type of process and then if you do it that way you you know you gets who you get call for you usability of stages you get an because seperate and development stages like between different teams and so on and the another saying other
principal and that we used is that we should an archive everything especially the inputs of the system and the storage is cheap like you know you can store a lot of the chemistry and and you don't pay too much and so is very helpful for us to have like an historical record of of of everything that that's thing that that was going in the system and that the purpose of it is that sometimes you you have bugs in your code in and for data pipe class you want ability to can rerun your coat on data that was like old which two-month-old because you realize you have the body that we're talking about has been in production for like 2 months and then you want to you discovered by the want to beyond ability to go back and fix the outputs and this is something they can only do like with the 2 pipelines again already and it's defined like in your and web application by candidates with data so quicker meaning that the you have that possibility
and OK so the took uh the 2 we use is pseudo career flow and it's that they support so that there is a stink beta recently in their parts of the foundation and it's a bus processing framework and as opposed to like a streaming framework uh like storm and I guess that there was talk about storm and but here talk about in batches and and essentially the big difference is that like with a similar architecture everything it's every in transform nations could the start and then they goes like real-time through the stages with a mass processing a framework is different because each step is quality scheduled 1 when exactly needed uh so yeah it's kind of a different architectural like in in my eyes is a bit simpler and is the social like probably the oldest but work at the time of more sound so the therefore as a big community in this and a lot of stars and it and of a companies using its but yet really airflow that allows you to kind of build sort of natural core from interconnected and tasks the around and you can be like very complex network and that you can add things run parallel Tang China sequentially and in this example yeah you uh which is some MCMC from the internet and essentially you get the pool of files and then you start 3 three-part scrapers and that are scraping uh different things but then at the end of every scraper and some Update um which is specific to the scraper but and that we search you can build a sort of networks that of any complexity like in some some quite complex on some of the simpler and despite all the way down so like you is written in
Python and them and you you use it with point the then you you say like why
not using chronic Crosby around phrase is uh the pros cons that this very simple and therefore always case it was too simple like this no way to define dependencies between jobs uh as in like on a starting job be basically successful uh you can do that um sort try mechanism and this very very simple are reporting on so he was used to think for us hassle always you decide to use the air flow and so the wasteful it's very simple you just pick pistol airflow and as it is written in Python initialize a database and which sees so every is a very stateful application essentially trucks uh execution being everything that you can put in the system and and all this information is written the database will be at the existence of database and they said whatsoever and then you have something like
that so there's a URI of of of 4 system and we are in production and essentially it's lists all there were workflows that you have an office is term and there is roughly like so I read every war flies a name as a as a loner and as like struggle it's daily if it's only if this you know you can define it can be very flexible in defining 1 something is wrong um but yeah I mean it's sort of a quite simple in the US quite powerful this a lot of things you can do with the line and the they wanted to show you like
LDA to use flow so log in as a mission before it's in Python and it says you create these fires could dogs that is so is an actual term which essentially use uh
awful means workflow um you create a like yeast and you create a fight you with buy 2 coordinates wits essentially is the shades uh dog class uh which declares them from the war funding and then you pass a few
parameters and 1 of most important 1 is that discovery of what you want daily 1 URI only and so on and then you start to compose your work floor by creating a tasks which instances of operators and the operator is is the thing that that kind of tells the flow of around this step and um you know more than that in the next slide so essentially on um troop commands to bash commands in this case and 1 is uh and generate reports and because there were few is both a kind of jectory parts would be sustained like the mechanics simple the use case and i to the 1st that would be to generated reports and the last that will be in the report and uh come to be sustained um the 2nd task is always happen at the end of the 1st task and only the 1st talks task succeeded um so session this way you do it uses declare that the article declared the 2 steps and then at the end of say like to chew set upstream to 1 which means that tissue is always executed after T 1 uh yeah essentially of the canvas very simple dialog you put them the died in the ducks following offer for and then I just gets picked up automatically in and I for body will be executed only and then the and the other every always have an execution and the
therefore like this different types of operators and you can load so much commands you can use uh can apply to the functions and the same example there um and this is like the the 2 most common things like that the kind we used in the project and binary can we can use like talk this local operators that you can use to loss or containers um and social something consensus which essentially is on a specific that like a specified overrated that is waiting for an event to happen uh which might be like and file uploaded to Westley them all and you know the sort of thing they need to wait for some the everything is completed in Jinja um so you can have a for example like comply to files with we essentially valuables there that can be substituted on the fly by airflow and was sample for much creates uh I mean this is like quite for for example if you want to pass uh variables from from air for itself to your scripts and quantitatively can use like uh campuses arguments to functions and but like the example here um but so I want to kind of
uh that be the frame to like all actually airflow works internally and so everything as a so before it's visible to the URI and so you have uh essentially a and you can see every execution of of things uh in the past and the result of that and so for example if you have on a war for the US daily you will have in the UY and possibly to see all the all the stages is all the results of a every execution of the script and going back from today to 2 datasets this group started running and yeah you can zoom in and you can see for example for every for every task that there is a task instance which is like a database construct which essentially is tossed a start time still spend time the duration and the state of of uh would this be successful not yeah this is very informative especially if you like of 1 of the main things the company is is the processing um so all
that's the floor like around all your task and so you can you can choose like maybe 2 so the types of scaling uh 1 is an outcome of scale-up using a like a pre for model like and like what part you genetics that will all the Unix the demon do especially they just have a point process which faults life off I try processes and then the the power crosses like pools for new jobs and when you have a new job a new task and the child process will execute the task and trucked in the resulting right there's all working in the color from the database and this works very well for like 1 server and if you want if you ever need to scale to multiple so you can use a kind so the executer uh like this it's the frame accomplishing it'll mess skew we need to know be more told In the big enough music um
disregarding all this information that and that the system is trucking and a good I yeah like especially like you have an the uh a couple of information and 1 which is the state say what you want to know when things are happening you want to know when things have finished running and you want to know whether the were successfully around or they work they failed and the if they failed and FOSS physical mechanism of liquid trying you can definite specify like a many times you want that to be retried and you can also be more granular than that and can specified by company on many times do on a specific task to be tried and if the researcher mechanism that's a couple states 1 is retrying 1 has failed and retries like this essentially means if a 1 time the like I'm going to retire in 5 minutes like in the late they concern and the yeah and those are coming distinctions states
so to so another constant to alike introduces that M and is set something and kind of differentiates flow and Kron for example is a powerful deals with time and specifically downtime and so before the idea is to backfill and 1 like the lower were supposed to run but did not and which essentially means and if you have a kind of Connacht war for that is meant to be around the mean then your servers states of fly for a week when the server comes back online and essentially flow and realizes that for the last 7 days the job wasn't running to relaunch my 7 copies of your script and and the difference between all these executions that um essentially this of variable a past to you to the war focal execution date and that the is that it will contain the date of my 7 days ago 6 5 4 3 2 1 days ago and and is very this meant to be used from your scripts to to limit the amount of data the you process that I mean this is like a like a very good behavior for for example like I can report generation by for the business thing because it is a thing doesn't care whether the servers up a lot is also receive a report every day um by separating quite bad before scrapers
and we're running like a lot of scrapers and then we don't want to run like some scrapers at the same time just like 1st order rate console server on it kills their server and and we don't get any and we get exactly the same data and so we started using an operator with settling was recently released the in the in the latest version of a few I think that is called latest run only of greater than in such a skip so the past and in on your roster like the latest uh the richest
the OK so uh yeah that was about airflow black-hat essentially we've user flow is due to the and to do like extracts to transform and most specifically uh because of the tools we picked a we have like to be more of a general structure like a specific genetic structure and so we started generated by cities and because we want to track and every around over like our codes and we want to see what a specific execution what was sort of data a specific execution is generating and I'm putting Russia the which I diversity we create an issue for the structural we around like kind of the law those structural like you know whatever you want to run an this is normally the it takes the longest and no whatever comes out of that we compress a triple University and and would viewership and that's not that's essentially like the generic stuff that we be useful and so I
mentioned by cities and and for us was like coming in general is quite important tool to generate the best the because you want to try and uh essentially a and you want to be able to map back the data they generated 2 in the specific on um for debugging purposes all like you can fix a bug and and go back and fix the data that this blog broke like you need to know what you need to know what better these were involved and so you can remove their own better using inquiries and the data and another reason to about cities is that actually re around some of the steps including building of a shift and so it's very important to have uh to make this operation is important with the a of many times you you do it's and the result will be a was like once 1 and only 1 node in the table uh so we timestamp on the data and again useful for like bugging uh you want to know what time was downloading 2 more times written industry yeah and so on
so and so would it all these operations we got 2 little in all come around on our don't we go to like some they cannot drive we have the compressed it was sent to a stream but then like because we want to use Russia's have a square portal that we pick the farmers that Russia supports um so this to family of from um the form of the Russian 1 is like column-based and and the other 1 is a rule-based and I'm not going to have much of a based uh 1st of all alike so be less common in in the PPI toward and and this part here in sports you think uh the supporting Pitons this not greatest this is not as supported this rule-based formant and also like they are always from generally more more common um so always formats are essentially 3 and 1 sees the money's said that 1 to alliance which which use like a bunch of digital objects separated by newlines and the last 1 and also by the most recent 1 is a form call of a of and which is like a the standardized form of an so this is this very simple around us with trees before collegiate it's sort dependent it's flat and square Arctic stand unless you know what you do is apparent data to the end of it a which is still quite emitting um and those in social typed uh so yeah it wasn't be simple for us um digitalized again it's very simple as everyone knows what originates from season to extend the scale you can say that like keys to the to the jason object and and and you should know the truth to work to be verbose unless you if you don't use compression like you can use compression so we can remove that uh is schema-less uh it doesn't force a key being there all keeping their own specific type which might or might not be a problem uh for you and that thus faulty sovereign and it's not so common is if where is it expand and I think with our this this schemas we can enforce the key is being the right and for some types of for the likes of this more ecosystem and and in our was we decided to use an additional lines to this simple and it was good enough for us as we got to the point that we kind of say what is the thing in redshift and
that and so could bits of or I should um so should is is uh part of uh created by amazon is open source and and essentially what they
did is therefore this club and they made its war for all up work with like analytical queries uh but they change quite a lot the change the aquatic plant and the changes stories engine and the essentially it's like it's like buying a car and changing the engine and the quite be trained and and they made work well for you to monitor data it's very good like on-the-fly compression support but then it's like a column the database switch means for the kind of things that you would expect a database to be fast actually ratcheted questionable um given the right columns of data the and and is the opposite is also true for the things that you know there's long on posters that quite fast and Russia and for example I give it on on single columns having like many values in the columns ofU quite fast for a shift acquittal or uh on them also score of examples um if you have a if you do lots of like a set of basis engrossed and the Stephanie's stored and processed well if you select a single role and that's all you want all columns that's quite long
and it is because like the biggest difference is really like the stores model and the redshift these are called this uses so-called this layout and postprocess as Michael this layout it's really means like 1 when you have a table like and the way posters rising in the disk is like travesti eats the score a robot at all like in Canada or is underway and yet the the result you get like with redshift the traversal exactly reversed so it's a versus all the values of recall and I will use them and it does the 2nd column is the columns on the I mean this is a pretty fundamental and that's that's why performances reversed um and that get and this year this was my queries those long point as well of hostile action and the big difference is that the block size in a solid quite big uh so you can load a lot more data my money at once my with impulses quality of you have a different different like many different cycles for like many different uh loading steps and having considered that the database um that the most critical part of a database is building data from from from a lobbyist tool to the moment I mean this is is quite big um the
OK so that was redshift and that other actually use it from and from our system and the so that she has very good support for string and if you have the defies honestly off off all of that uh should support it's just a matter of life executing a square going uh is that copying is the copy command and you specify the station table you specify what data should derived from and is survivor that is cheese on file or not in our case is the chorizo suggest Alliance file and adjusted to his compressed um and what this query doesn't like 4 things 1 is said reading data from a stream uh busses thread like in the compressed data on the fly and then it does something called plagiarism past flattening and retinitis playing in a 2nd but at that whatever comes out of this 3rd operation is then appended to coexist in the answers loading loading this data in the table I mean the of it's agism past flattening is essentially a uh an operation that takes an isostructural like Jason and in always using you have like you values then a value can be another object so you have like another set of q values or can be an array um and especially liked you have a lot more nesting and then if you think about redshift Russia's still like a relational database you have tables and tables are like a flat flat structure so like the way you and the I mean the way we decided to do it is using Jason Indusion path was essentially a uh when you apply this uh this file to alliance of 2 it will generate like a sort of C is reading with high field are the first one being the value of the of the key idea as circle 1 being the 1st element of the array of and the idea of the 1 and then you have the subfield of tool and so on and so the end of this you have a fast after they can just append to table and this just the way
we did it and the of a single mentioning is that like you know for for all frames of to process the schema of the database is quite important and we wanted to add the schema of the conversion to and storing it's like you probably would with any other and infrastructure core diffusing possibly 1 decibel to being to being represented use their form sensing and that the skin makes us mother of it's another 1 of those assets that you really should putting it in conversion of so yeah the differences as cannot the and we don't rely on a square commit to must pool or am operations because it's not the use cases a fresh shift and and can still a crate queries using a school like mean like they don't really use models that way um and that the benefit of having a square like means is that you can
it is a school nicely integrates with some of migration from local Alembic uh which is quite complex in and quite common future for um but essentially like with some migration framework it's related to they dates that you tend to use uh a 1 it's a 1 and region which generates migration files and the and must have this migration fires can uh run this migration file the current schema to upgrade a schema or downgrade the scheme depending will what the direction to go and that to applied fire this is the matter of running like and become great as a says quite quite a complex the framework possible multiple environments a of generation only works sometimes and because redshift stuff that's a lot less cases is a bit we sometimes uh especially for example in the articles like um so you can really answer closer shift you have to drop columns and we had them on and which has a lot of consequences uh which give me
into these annoyances and the but the Russians as the slow things which are quite annoying for sampling the bottle and is expressed in bytes a sum of characters and than while I will post this is pressing copper and so sometimes you have as cases where like you know can scrape uh size with Japanese characters and then suddenly like uh the loading breaks this signal can Japanese characters are mostly uh using the extension so you have you probably have to buy system 1 bite and I was really annoying for us you can either call types as a said before a refresher and there it is not really a thing in the universe shifts primary keys and foreign keys and already forced um but that's OK because we don't have good on for our use case we don't need it and controllable by default which is think is different posters is this a
future-proof um yelling and reversing their per with it's um precious pattern out which seems very interesting because it would have the session mean that we would and we could potentially people the loading of a ship and the speckle data directly from mostly of so like a couple of open source projects which we might look at and 1 is called pressed being and which is again this is able to warn queries directory mostly an with value I mean to kind of injustice flies in the database and in is quite nice and the the yeah but we don't think history which might be like a good thing to look at them and this other project which I want to mention it is often an extension to posses quell close tools and I stand in a kind of by the current company in the in in US and they're like if your final posses call you want to come usual it like the result was the square of its own expansion that's the view and then to like and you can use you can look at this company in to the status quo um you
thanks a lot of things those those at an actor automation like an actor looking for work a special kind of this sort of stuff because really enjoyed it so if you have and we know which of course with that in a kind of talk about it and connect to talk about it good yeah this is entering the room was full with the few the the and the the it out OK now so any question to fit of the call the and dousing so we'll talk right thank you and
Intel
COM
Datenhaltung
Mereologie
Besprechung/Interview
Physikalisches System
Term
Datenfluss
Framework <Informatik>
Subtraktion
Punkt
Quellcode
Biprodukt
Kombinatorische Gruppentheorie
Datenfluss
Computeranimation
Integral
Unendlichkeit
Quellcode
RPC
Menge
Gamecontroller
Datenverarbeitung
Web-Designer
Projektive Ebene
Zusammenhängender Graph
Speicher <Informatik>
Ordnung <Mathematik>
Nichtlinearer Operator
Datenhaltung
Güte der Anpassung
Abfrage
Transformation <Mathematik>
Physikalisches System
Komplex <Algebra>
Quick-Sort
Computeranimation
Generizität
Last
Datentyp
Gamecontroller
Projektive Ebene
Ordnung <Mathematik>
Videospiel
Kraftfahrzeugmechatroniker
Subtraktion
Prozess <Physik>
Punkt
Matching <Graphentheorie>
Benutzerfreundlichkeit
Datenhaltung
Zahlenbereich
Transformation <Mathematik>
Physikalisches System
Ein-Ausgabe
Term
Quick-Sort
Benutzerbeteiligung
Reverse Engineering
Datentyp
Mixed Reality
Softwareentwickler
Aggregatzustand
Funktion <Mathematik>
Bit
Subtraktion
Prozess <Physik>
Natürliche Zahl
Klasse <Mathematik>
Web-Applikation
Stapelverarbeitung
Code
Framework <Informatik>
Computeranimation
Internetworking
Markov-Ketten-Monte-Carlo-Verfahren
Task
Datensatz
Speicher <Informatik>
Parallele Schnittstelle
Funktion <Mathematik>
Prozess <Informatik>
Datennetz
Betafunktion
Gebäude <Mathematik>
Ruhmasse
Ähnlichkeitsgeometrie
Physikalisches System
Elektronische Publikation
Ein-Ausgabe
Datenfluss
Packprogramm
Quick-Sort
Programmfehler
Arithmetisches Mittel
Software
Mereologie
Bus <Informatik>
Speicherabzug
Computerarchitektur
Kraftfahrzeugmechatroniker
Software
Punkt
Prozess <Informatik>
Prozess <Informatik>
Existenzsatz
Datenhaltung
Stapelverarbeitung
Kartesische Koordinaten
Information
Datenfluss
Quick-Sort
Computeranimation
Sichtbarkeitsverfahren
Home location register
Mailing-Liste
Physikalisches System
Biprodukt
Term
Datenfluss
Quick-Sort
Computeranimation
Office-Paket
Task
Flächeninhalt
Gerade
Sichtbarkeitsverfahren
Kraftfahrzeugmechatroniker
Nichtlinearer Operator
Parametersystem
Hash-Algorithmus
Klasse <Mathematik>
Computeranimation
Arithmetisches Mittel
Task
Rechenschieber
Task
Mereologie
Verkehrsinformation
Instantiierung
Resultante
Nichtlinearer Operator
Lineares Funktional
Parametersystem
Subtraktion
Einfügungsdämpfung
Prozess <Physik>
Rahmenproblem
Datenhaltung
Gruppenkeim
Nichtlinearer Operator
Elektronische Publikation
Binärcode
Quick-Sort
Ereignishorizont
Computeranimation
Task
Variable
Task
Funktion <Mathematik>
Datentyp
Skript <Programm>
Projektive Ebene
Instantiierung
Kraftfahrzeugmechatroniker
Videospiel
Punkt
Prozess <Physik>
Rahmenproblem
Schiefe Wahrscheinlichkeitsverteilung
Datenhaltung
Physikalismus
Datenmodell
Physikalisches System
Natürliche Sprache
Computeranimation
Task
Informationsmodellierung
Task
Prozess <Informatik>
Gruppe <Mathematik>
Datentyp
Mereologie
Information
Kantenfärbung
Dämon <Informatik>
Aggregatzustand
Leistung <Physik>
Nichtlinearer Operator
Subtraktion
Güte der Anpassung
Versionsverwaltung
Digitalfilter
Nichtlinearer Operator
Bitrate
Datenfluss
Template
Variable
Computeranimation
Differential
Prozess <Informatik>
Server
Skript <Programm>
Spielkonsole
Ordnung <Mathematik>
Verkehrsinformation
Leistung <Physik>
Aggregatzustand
Umwandlungsenthalpie
Resultante
Nichtlinearer Operator
Web log
Gebäude <Mathematik>
Stapelverarbeitung
Datenfluss
Gesetz <Physik>
Quick-Sort
Computeranimation
Last
Knotenmenge
Datenstruktur
Codierung
Taylor-Reihe
Operations Research
Datenstruktur
Normalvektor
Tabelle <Informatik>
Zentrische Streckung
Nichtlinearer Operator
Krümmung
Punkt
Krümmung
Familie <Mathematik>
Systemaufruf
Dateiformat
Quick-Sort
Computeranimation
Objekt <Kategorie>
Netzwerktopologie
Streaming <Kommunikationstechnik>
Bildschirmmaske
Quadratzahl
Forcing
Typentheorie
Digitalisierer
Mereologie
Datentyp
Dateiformat
Quellencodierung
Schlüsselverwaltung
Gerade
Resultante
Subtraktion
Bloch-Funktion
Punkt
Momentenproblem
Gruppenoperation
Mathematisierung
Analytische Menge
Benutzeroberfläche
Bildschirmfenster
Computeranimation
Informationsmodellierung
Mini-Disc
Speicher <Informatik>
Quellencodierung
Tabelle <Informatik>
Antwortfunktion
Datenhaltung
Abfrage
Quellencodierung
p-Block
Roboter
R-Parität
Menge
Rechter Winkel
Basisvektor
Mereologie
Cloud Computing
Mini-Disc
Umsetzung <Informatik>
Subtraktion
Krümmung
Rahmenproblem
Datenmanagement
Element <Mathematik>
Computeranimation
Streaming <Kommunikationstechnik>
Informationsmodellierung
Bildschirmmaske
Arbeitsplatzcomputer
Digital Rights Management
Thread
Operations Research
Datenstruktur
Implementierung
Tabelle <Informatik>
Nichtlinearer Operator
Relationale Datenbank
Videospiel
Kreisfläche
Datenhaltung
Güte der Anpassung
Abfrage
Physikalisches System
Elektronische Publikation
Mechanismus-Design-Theorie
Quick-Sort
Arithmetisches Mittel
Objekt <Kategorie>
Datenfeld
Quadratzahl
Menge
Speicherabzug
Zerfällungskörper
Schlüsselverwaltung
Lesen <Datenverarbeitung>
Tabelle <Informatik>
Zeichenkette
Bit
Gewichtete Summe
Schlüsselverwaltung
Stellenring
Nummerung
Programmierumgebung
Physikalisches System
Migration <Informatik>
Elektronische Publikation
Framework <Informatik>
Computeranimation
Richtung
Generator <Informatik>
Multiplikation
Typentheorie
Migration <Informatik>
Datentyp
Kontrollstruktur
Maßerweiterung
Schlüsselverwaltung
Grundraum
Programmierumgebung
Default
Resultante
Sichtenkonzept
Open Source
Abfrage
Systemaufruf
Maßerweiterung
Quick-Sort
Computeranimation
Arithmetisches Mittel
Quadratzahl
Rechter Winkel
Mustersprache
Projektive Ebene
Wärmeausdehnung
Maßerweiterung
Verzeichnisdienst
Beweistheorie

Metadaten

Formale Metadaten

Titel Feeding data to AWS Redshift with Airflow
Serientitel EuroPython 2017
Autor Marani, Federico
Lizenz CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
DOI 10.5446/33708
Herausgeber EuroPython
Erscheinungsjahr 2017
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Feeding data to AWS Redshift with Airflow [EuroPython 2017 - Talk - 2017-07-13 - Anfiteatro 1] [Rimini, Italy] Airflow is a powerful system to schedule workflows and define them as a collection of interdependent scripts. It is the perfect companion to do extract/transform/load pipelines into data warehouses, such as Redshift. This talk will introduce some of the basis of Airflow and some of the concepts that are data pipeline specific, like backfills, retries, etc. Then there will be some examples on how to integrate this, along with some lessons learned there. At the end, there will be a part dedicated to Redshift, how to structure data there, how to do some basic transformation pre-loading, how to manage the schema using SQLAlchemy and Alembic

Ähnliche Filme

Loading...
Feedback