Bestand wählen
Merken

Taking the Hipster out of Streaming

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
yeah thanks a lot for coming everyone that special greatest everyone watching the stream at the moment and today we want to talk a little bit around our experiences and story of all these supplied stream processing over the years what if learned and to what since they don't have an exciting announcement of how think we can make stream processing graph Python credible that so in
short worry that we evolved to developers spent would works as a global investment management and data science company you might have already talked to us quite a lot of our stand downstairs and you've been in the game for quite a while for 20 years and all-cause thing really is that we believe that the scientific method can be applied profitably to financial markets which means we run a lot of data processing on financial and non-financial data and try to make a profit out of and as part that we apply a lot of technology and data science and this talk is really about 1 part of the of how we do real time stream processing
and I'm interested so quick answer please if you've ever heard of Kafka or Benson stream processing already had the eject an idea about this of a charter and basically what stream processing is this that you instead of doing the classic batch processing where you would go at the end of today and you collect all of the data referred to they plotted into big file and and 1 for example a huge spark was something that it an you work on your data as a stream which means that data comes in throughout the entire day and the processes as it comes and this is quite nice because they get real-time and interactive data you can put them in last asphalt and often a dozen makes thinking about your processing quite a lot simpler because you don't have those around random interactions layer 1 batch jobs stops and the next 1 starts
and 1 example of a stream processing time in finance reflection been doing stream processing for quite a long time now like this is just a natural thing you have for example if you buy and sell stocks at an exchange we can subscribe to their market data feed and basically what they do is every time someone by something else something you get a little message that's tells you for example someone that 10 15 has bought some stock in apple and it what 10 chance that the price of 144 bonus example and you really get this flowing and you get a massive amount of messages every 2nd and then you do various things that 1
of the things we do for example is really showed various batch Fonseca and it feeds into various ranking algorithms and did not necessarily always cataloged every single trait that happens but like in the questions for the talk before and after we want to have like a little bit of an aggregate so we want for example do know at the minute how many Apple has been traded in total for whatever its price for example and that's something that's wonderful stream processors will do so we take a plies inference scheme of high-frequency invariance and run a bidding process and then group down into slower streaming events that's not a closer to what we really care about and then go on Our downstream consumers don't really need to do that much rock if they're not interested in all the details but
and the the time we found them it the streaming model it really works well so no it's expanded from just having market data which is just call financial data of the stock has been traded to eternity of data which means Berenson rap scraping we look at news we look at various of but weak or financial data and we use nowadays got an entire office instance Francisco and focusing on this and the other thing that we found it's really useful it's just not business events so something happens on the date and you wanna make sure that once that happens the summit operands all you want to update the monitoring page and there we freely font found the combination of stream processing very would sense for example of the staff has run on and you event passed in combination of a more traditional database is really useful so you would write canola data into the database and then write a little modification anterior event stream and that then can tell various applications to the right thing for example we on monitoring and we do risk management based of those events in the dual trading decisions and there's various other things and what I really wanna focus on today kind of 2 thing in the middle where get various raw events in and then the process is a little bit by various transformations into the format is we really care about India it and
if you're talking about stream processing nowadays what he really been usually is Apache Kafka which is the project which has formed the before mentioned about streaming was really an Apache Kafka's a Message Broker which means it's a little bit like a database so if you install it was over and that provides you there's something about topics where you want to write letters that data that belongs to go on a single topic from producers and and various consumers can pick up that they tend to think of it and so 1 an important thing around it is the topics subdivided into different partitions and that gives you parallelism so for example if if trace what's and some of the traits of tradition someone partition dual and then you can run down on multiple consumers each consumer goods of national and then they can share their work a little bit and the other nice thing around crafted as it makes it really easy to buy it has a really nice underlying model of hot deals with data so the from petition every message that is all that and you get involved in exactly the same order and it's really quite beautiful and elegant what's implemented and because it's a popular and this huge support around to big data ecosystem so we talked before we had the system can interact with gusto and the a policy which can
do it so there's a lot of projects that have sprung up lately in the last couple of years that try to make it easier for the you for you to do them stream processing of those events he slept at pretty much all of them and here and we too happy with them and basically for 2 main reasons if you're interested in details about come talk to me later I can talk about as far as I but really due to call reasons that we had issues with those projects is that usually the pretty hairy rate so kind of if you want to you use box they may introduce stream processing you need to run and maintain an entire spot cluster and a new code runs inside the spark cluster and describe a bit of overhead and just installing and something about configuring it correctly and then doing things like scaling up and the other thing is pretty much all of those projects far written for the JVM which means that Python is a bit of a 2nd class citizen and then doesn't leave it and all you talk about the protection system wide text Jason input output which is really great for performance that does
1 nice Outline of pretty much 1 year ago Flint and the creators of event together a introduce the new API for forecasted that dust very similar things but these are the stream processing frameworks to but does as just a simple library which means that turns are on the relationship between your code and the framework so India must he would submit the code you need to pick it up and then it went somewhere inside the storms over and what it can do have practice streams there's no it's just a library which means the master of everything is still just a moment Python process you can run it anywhere you want you can run it on AWS if you alter to scale you just want to skip completely anomaly because it's just a moment process and it is it's also got proper real and event by event processing which means like for example is spot streaming you don't need to up so you can get a pretty good latency and it helps a lot of all those who really tricky things around stream processing but it's difficult if you want to do some stateful processing for example for the binning all if you want to join different streams together if you want to do some aggregations and positive for tolerance and distributed processing is soft pregnancy that and it's no nowadays part of 2 main approach capital but debate throughout class is that Cervantes job only and it's not that easy to let binding student our interfaces the and that's a bit of a
problem for us because of a a big python uses the around 450 people in total and I really at least half of them knows some Python that not all developers but you have lots of data scientists researches operations people and if you want to know a little bit more about the Python uses and the a talks last year some of our colleagues that shows the minute the what we do for classes yeah and but really what was once this found to give all those people who know Python the ability to do stream processing nicely and easily and what is done so far is kind of this it's the programming which means that if the time did you will it's on know that the test grammar Kafka Atlantis various that once and the connected plaster and in lines you can connected his obstruct messages you get the product for every single message and to get started with this is actually a really good idea and you've been running private not just like this and various processes to update monitoring and so on and if you want
to go down destitute time and recommends as long written by Don promised thinks a lot of writing that and it says that the native Python science that written all of the particle interactions and Python and especially of a reunited by phonic interface which is very natural to use at this value really nice and robust intends for example a you cluster scaling correctly and the other recommendation would have the consonants Kafka and plant which is a wrapper around the C library which means it's really really high performance so if pump pump something like a million messages per 2nd through it pretty easily and yet they're both good and steadfastly talking earlier thanks to the well and the the Buddha Python giant but I haven't really used it that much the father should mention the the so so what we've done is we've really done all in straight streaming framework tried a couple of times and at the time he started those 10 lines of violence maybe that you start building it and then you run it in production in the demo set of it all looks fine and you let of news bad things slightly break off a slightly wrong because streaming is actually really really hard for example if you want to do the spinning from earlier and where you say you group some traits together they didn't aggregation over them and calculate some average or total of that you really need to be careful that you handle every message exactly once because of the doppler count decorations just wrong and thus we use something that people have actually started rough that stateful processing is difficult if you want to do joins or if you want to catch that de last that's the kind of run into the same problems that if you keep a local database that as look up you need to be really really careful that you don't apply things twice and so on that's hot than distributing loads described can be quite difficult because why declared supported usually what you have to do is you need to handle certain coordinates that get triggered that when the rebalancing happens correctly and that can be quite tricky it's fine if you've been doing it for a while but if you really just the data scientists who wants to write a murderer was imprudent to run that on it can be quite difficult and and if you want performance of Sofia streaming usually want to do my preventing which means that you buffer up you don't come to like a database rights on every single message do but you group them together and after a while that for example every 2nd you and process all the messages you got as 1 time and that gives you quite a lot of speed up if and
because we want all of that we've decided we get and few we started an open source projects on new open-source organization that basically bring toughest means which we think has some subtilis problems really really nicely for talent and we are doing a implementation of that just in Python here prephenate which means you don't have any dependencies and over and above a simple practice and my colleague when I show you live demo of how it really looks like and what it can do 5 I'm very excited delirious delight them above was missing so I'm going to start off sorry the the this the yeah right
so the now I'm going to start up all the components required for the application
and starting with ZooKeeper so comfy uses ZooKeeper in the background for and customer management for storing the topics leadership elections for example so at the start of our zookeeper yeah started a
Kafka now on the search 1 instance in production really would want say 3 instances and the replicas for the extras events and so on so let's go ahead and start a Kafka that's no running and the start up a consumer this is
just the default consul consumer comes with Kafka it'll take whatever is in the topic bean prices and sorry can you see this is big enough to go on a so it will write whatever finds been prices and add to the Council is guarded thing in there were not running our application just yet so next not just before
that's that's sort the application but there is it started we can see that it's listening to the prices topic so I think was of the 45th lie down on it says that the since the price topic you were right at the bin price topic recurred look up no data and then the price topic of good answer 1 more consumer 1st and that's here in much of the log book yeah
so let's do imports so the focus
set up a consumer and this is very similar to the ones go just saw undress and show you and they are plotted as a
plot ready and we go 1 more thing to do is just a start up this move done here which will loop on consumer to read in the prices from the input topic so let's start now again like the previous conservators no up yet because they're that there no inputs in the so 1 last thing to do let's get some inputs right we've got
this generator script generation scripts that generates and prices just run the prices that had to do well normally distributed returns for the prices sets are generated provide propagation they can see it's already
begun to process those go to this consumer here we
see there's nothing there yet there's nothing there because we're getting the prices our definition of being is when we reach the end of the Minister for the purpose the X example about 2 minutes we will push the last the last value on the being and the last 2 very sorry the last 2 minutes thanks to the help of a topic so let's have a look inside the true other lost we can see that are input prices are being and parsed but or being written or been prices will appear as these red dots here I don't want to reach 2 minutes and I have spent the so the you can run the and the generator in approximately real-time so while no more and we should see the adults there we go so they saw the say in production if we're running some recall we will obviously interested in pushing out the been price every minute for the purpose of the example here we just wish to price to be available before pushing the match and this is carry on closing and we just leave it got and move back here you can see in the consumer the consul consumer here it's also found 2 prices to the to perfectly independent applications neither and knows about the other but both are consuming the same data so what's co behind this look like theory
of so the 1st thing you problem do is
to run a topology yeah article that comprises these 1 source I would think the source is your input topic so prices Our sink is the output from topic prices and you want at least 1 processor in between the processes where the meteor compilation interface so in
here in the spinning class is the
process of so we got an initialization 1st on itself and we got this process function
personality it the pros method will be called for every value is reading from the and the topic and then we converge to the approach when we find 2 things you don't here when a
sort tends to work with so we obviously the transition into the 3rd minute that I we will
call punctuation punctuate simply forwards the values to being values i to the topics so that's what that's the flow from an input to output and and what we saw in the book a plot in strong so was in the application this season code but what comes next check it so tha
but still the 1st thing that I'm most supported really is to finish argumentation weren't very early stage and when various data we want to encourage contributions will love you to check out our code start again get and thoughts issues and why tests were documentation whatever you like just get involved and would really ought to experiment more Potomac API in particular for the DSL DSL is the domain-specific language and that sits on top of the streams API and Kafka and we feel that there is a very good opportunity to leverage Python's constraints and to make it even easier to set up a a typology using a domain-specific language optimize patterns great language which we all of the wouldn't the if we didn't but were under no allusions it does have it can hit limits and and they were limits we will work on optimizing the performance further and this could include matching including to using some of the uh the great Python libraries nmrpipe undesired there or even leveraging saying you new projects like RG Europe nothing about lossy on this slide but really the robot is is longer and there are many more advanced features of Kafka and the 0 11 released recently aged used at exactly what semantics for example so yes others said before
and repeat again get in touch there's the give up on page constant it cannot
but so it is there at the beginner page to get involved Jacquenette and whether any questions here here at the who I think you're not very interesting I wait for this full or you know also thank you and I would take it out just now we have fossil fuel before this woman doesn't show so you it was uncertain and optimize the l was it was other just today OK in the exhaust of those that are already tweets of excess energy that question about have to the last presentation we saw was stolen homeschool united these like facing workflows in light of the there's stuff like that and can you have like you you have out of the wall box with but you have to use in your life and it's a bit my life right so part of me can have especially confluent there because I think happy to sell you entire seat of tools and on the other side now because assessing library and it's found pretty libraries and doesn't do like spinning up of the processes which means that if you already have existing infrastructure to do all of that for example we run a lot of code on Cuban eaters nowadays and you can just do it on their and it's integrated before the already have thanks another 1 living things will lead or if you try some some cancel each other to use them I hope our whole stream processing part than DQE and not yet delivered laughter talk to you about that how you think that's correct because this is exactly something that really would really like to make nice input from it so please come to maybe later I would say that if you if you look at the the current go basis so it follows a Java codebase reason closely and which you felt was you know of a reason which get started get get there in the coding behind the job implementation and but yes it would be the DOS it introduces a new code where you know where we can leverage as I said before Python strengths the when I walk 1 comment so if someone is asking you to do so should use the mean it all this book from to model where you should use the mean into and this is really great to see the projector is not because I also said this is missing in Python at the moment of a request to write out uh who also have dreams and 1 thing about that you in is the scaling approach and state management approach is that already in your approach of the of the road and the scaling we do after working in a in a simple fashion so basically you you can run to different pattern processes and their a share the load between them and the that have state processing is about 4 about so before we want to do that all basically that's what in working on next but the it's really tricky to implement which is why we want to content to the this library yeah but it's coming so what think there already always is yeah so here's school 2 years reading is open source and the character of strings were used by the system so I can you hear me you were described in the migration is A I try to combine the arrest by the reacted stations yes we talked about all of these works you to right cold it's kind of funny repeated how things were so feel like I don't know how the operation of all this really worried even the economy through a crack in wall we do mean yes exactly it's basically exactly that so it's the same you get the extreme input object and then you can map a function on the right can group by and antigens that up it's basically exactly that of real time full maybe 2 the questions yeah so the sum of binary particle to get events from capital is prevalent is 5 so our case how to talk to cut so basically what this says that sits on top of an existing conflict and basically takes away all the boilerplate that you have to write and in the sense that at the moment we using the from French after plan and whether it just 1 the 2nd 1 and basically that's talking about just a normal binary Kafka protocol but this highly optimized C implementation it's really fast and well I think the current bottleneck would be this right goes from C implementation right to get 1 column vector using a message into Python code and basically what we've seen this that we really need to include some matching that for example you get a numpy array of audio events back because if you do and Processing with Python over millions of records per 2nd and if you look at the profile really every single if it shows up and it's getting a bit hard to optimize their the and by insured yeah reusing our really optimized profits and underneath and that toxic binary protocol OK 1 last question thank you and we recently implemented bits of this solution of the money for it has a lot of this is just a very similar to the that I was wondering if you follow any way to uh then you applications using these library that having a needle half full cancelation in about that so I think this library makes it a bit easier because you know and low level so it's relatively if you write a processor you're reading nicely defined Python interface and basically would test against that but if you around if you states small often together at some point we're we've managed refer mostly gone for integration tests because there's so many small things that I really hard to unit test correctly article uh undresses comes just say that the code that I showed the binning was the the full obligations so it should be not necessarily easy but you know not very difficult to and to abstract art in a way to make more amenable to testing because that is that is everything OK so thank you very much for all of which are in full which
Streaming <Kommunikationstechnik>
Bit
Prozess <Physik>
Prozess <Informatik>
Momentenproblem
Graph
Facebook
Streaming <Kommunikationstechnik>
Softwareentwickler
Prozess <Physik>
Prozess <Informatik>
Stapelverarbeitung
Klassische Physik
Datenmanagement
Interaktives Fernsehen
Computeranimation
Streaming <Kommunikationstechnik>
Echtzeitsystem
Datenmanagement
Spieltheorie
Mereologie
Datenverarbeitung
Randomisierung
Softwareentwickler
Stapelverarbeitung
Streaming <Kommunikationstechnik>
Bit
Prozess <Physik>
Spiegelung <Mathematik>
Total <Mathematik>
Prozess <Informatik>
Inferenz <Künstliche Intelligenz>
Invarianz
Gruppenkeim
Ruhmasse
Nummerung
Ereignishorizont
Computeranimation
Streaming <Kommunikationstechnik>
Algorithmus
Reelle Zahl
Coprozessor
Stapelverarbeitung
Streaming <Kommunikationstechnik>
Message-Passing
Bit
Prozess <Physik>
Stab
Schaltnetz
Datenmanagement
Schreiben <Datenverarbeitung>
Kartesische Koordinaten
Transformation <Mathematik>
Computeranimation
Datenhaltung
Homepage
Streaming <Kommunikationstechnik>
Informationsmodellierung
Datenmanagement
Ereignishorizont
Ganze Funktion
Parallele Schnittstelle
Serviceorientierte Architektur
Datenhaltung
Güte der Anpassung
Einfache Genauigkeit
Störungstheorie
Physikalisches System
Partitionsfunktion
Fokalpunkt
Ereignishorizont
Entscheidungstheorie
Office-Paket
Rechter Winkel
Dateiformat
Projektive Ebene
Dualitätstheorie
Ordnung <Mathematik>
Ablaufverfolgung
Streaming <Kommunikationstechnik>
Message-Passing
Instantiierung
Bit
Mereologie
Prozess <Physik>
Quader
Momentenproblem
Ortsoperator
Klasse <Mathematik>
t-Test
Code
Framework <Informatik>
Computeranimation
Systemprogrammierung
Streaming <Kommunikationstechnik>
Font
Prozess <Informatik>
Programmbibliothek
Ereignishorizont
Ganze Funktion
Funktion <Mathematik>
Schnittstelle
Prozess <Informatik>
Applet
Physikalisches System
Ein-Ausgabe
Bitrate
Ereignishorizont
Mereologie
Projektive Ebene
Programmbibliothek
Overhead <Kommunikationstechnik>
Streaming <Kommunikationstechnik>
Aggregatzustand
Schnittstelle
Demo <Programm>
Bit
Prozess <Physik>
Total <Mathematik>
Klasse <Mathematik>
Gruppenkeim
Formale Grammatik
Interaktives Fernsehen
Zählen
Doppler-Effekt
Framework <Informatik>
Computeranimation
Streaming <Kommunikationstechnik>
Last
Mittelwert
Wrapper <Programmierung>
Programmbibliothek
Skript <Programm>
Optimierung
Softwareentwickler
Widerspruchsfreiheit
Gerade
Implementierung
Schnittstelle
Softwaretest
Einfach zusammenhängender Raum
Distributionstheorie
Nichtlinearer Operator
Softwareentwickler
Prozess <Informatik>
Datenhaltung
Stellenring
Übergang
Störungstheorie
Biprodukt
Menge
Rechter Winkel
Last
Client
Partikelsystem
Programmbibliothek
Streaming <Kommunikationstechnik>
Message-Passing
Arithmetisches Mittel
Demo <Programm>
Rechter Winkel
Selbst organisierendes System
Code
Open Source
Implementierung
Kartesische Koordinaten
Zusammenhängender Graph
Projektive Ebene
Streaming <Kommunikationstechnik>
Computeranimation
Server
Task
Applet
Elementare Zahlentheorie
Information
Biprodukt
Ereignishorizont
Kundendatenbank
Computeranimation
Instantiierung
Sinusfunktion
Serviceorientierte Architektur
Mailing-Liste
Kartesische Koordinaten
Default
Computeranimation
Server
Elektronische Publikation
CDMA
Güte der Anpassung
Kartesische Koordinaten
Fokalpunkt
Computeranimation
PRINCE2
Coprozessor
Multi-Tier-Architektur
Code
Binärdaten
Ein-Ausgabe
Plot <Graphische Darstellung>
Addition
Vorwärtsfehlerkorrektur
Streaming <Kommunikationstechnik>
Lie-Gruppe
Fehlermeldung
Server
Elektronische Publikation
Schlüsselverwaltung
Ein-Ausgabe
Gerade
Computeranimation
Homepage
Eins
PRINCE2
Loop
Open Source
Message-Passing
Rechter Winkel
Code
Klon <Mathematik>
Plot <Graphische Darstellung>
Energieerhaltung
Server
Matching <Graphentheorie>
Ausbreitungsfunktion
Abgeschlossene Menge
Ein-Ausgabe
Biprodukt
Physikalische Theorie
Computeranimation
Coprozessor
Message-Passing
Generator <Informatik>
Skalarprodukt
Gruppenkeim
Menge
Code
Skript <Programm>
Streaming <Kommunikationstechnik>
Hilfesystem
Netzwerktopologie
Mailing-Liste
Prozess <Physik>
Compiler
Quellcode
Ein-Ausgabe
Gerichteter Graph
Computeranimation
Funktion <Mathematik>
Schnittstelle
Lineares Funktional
Prozess <Physik>
Prozess <Informatik>
Vererbungshierarchie
Klasse <Mathematik>
Evolutionsstabile Strategie
Dateiformat
Computeranimation
Offene Menge
Code
Zellularer Automat
Gruppenoperation
Systemaufruf
Kartesische Koordinaten
Ein-Ausgabe
Streaming <Kommunikationstechnik>
Datenfluss
Code
Quick-Sort
Computeranimation
Softwaretest
Parametersystem
Nebenbedingung
Minimierung
Formale Sprache
Domänenspezifische Programmiersprache
Code
Computeranimation
Zeitrichtung
Roboter
Formale Semantik
Rechenschieber
Streaming <Kommunikationstechnik>
Mustersprache
Speicherabzug
Programmbibliothek
Inverser Limes
Vorlesung/Konferenz
Projektive Ebene
Implementierung
Demo <Programm>
Bit
Komponententest
Gewichtete Summe
Prozess <Physik>
Punkt
Momentenproblem
Gemeinsamer Speicher
Applet
Versionsverwaltung
Kartesische Koordinaten
Binärcode
Übergang
Homepage
Streaming <Kommunikationstechnik>
Datenmanagement
Prozess <Informatik>
Code
Mustersprache
Beamer
Schnittstelle
Softwaretest
Nichtlinearer Operator
Lineares Funktional
Zentrische Streckung
Profil <Aerodynamik>
Matching
Ein-Ausgabe
Ereignishorizont
Arithmetisches Mittel
Twitter <Softwareplattform>
Rechter Winkel
Maschinenschreiben
Extreme programming
Message-Passing
Aggregatzustand
Lesen <Datenverarbeitung>
Zeichenkette
Maschinenschreiben
Subtraktion
Quader
Besprechung/Interview
Implementierung
Kombinatorische Gruppentheorie
Code
Datensatz
Binärdaten
Migration <Informatik>
Arbeitsplatzcomputer
Programmbibliothek
Coprozessor
Ganze Funktion
Videospiel
Protokoll <Datenverarbeitungssystem>
Open Source
Vektorraum
Physikalisches System
Integral
Objekt <Kategorie>
Energiedichte
Echtzeitsystem
Last
Mereologie
Basisvektor
Codierung
Partikelsystem
Normalvektor
Textbaustein

Metadaten

Formale Metadaten

Titel Taking the Hipster out of Streaming
Serientitel EuroPython 2017
Autor Heider, Andreas
Wall, Robert
Lizenz CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
DOI 10.5446/33676
Herausgeber EuroPython
Erscheinungsjahr 2017
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Taking the Hipster out of Streaming [EuroPython 2017 - Talk - 2017-07-12 - Arengo] [Rimini, Italy] Winton ingests data continually from the world's financial markets. We track millions of individual timeseries, with divergent formats, from disparate time zones, and whose frequencies vary from months to milliseconds. We go beyond simply reading and storing it - we stitch distinct and vast data sets together and subject them to intricate calculations in real-time. This talk will focus on the way we use Python to achieve these ends, and how we are creating tools to further commoditise streaming as a service

Ähnliche Filme

Loading...
Feedback