Merken

Building a RESTful real-time analytics system with Pyramid

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
some hello and welcome everybody thinks for making it through the date of the last round of talks but yes so my name is on the I work at the company called set of 1 of this to make my Python experience around 5 years so far but this is the 1st time and giving this talk so I also hope for some the deck up to you from you what's in the end end yeah let's start so
1st of all I will give some words about the company who we are what we are doing then introduce the
architecture of our platform in a coarse-grained level I give some information call we use Python environment in general and to our software I will describe in detail our analytic subsystems and we finish with some you know general development process in our company so 1st of all we are companies called 1 up for sure short we are calling ourselves C 1 the company's relatively young it was established in 2011 needs based in Berlin the company's quite small it's safer now we're around 25 people but it's already quite international because we're coming from 9 different countries and the main product of the companies they are platform for doing debate content content recommendation real-time decisions on content access for users and of course analytics we are also developing our own programming language it's called COPA as you might correctly guessed that stands for so long programming language it's functional language and probably type the main customers of our company the media and publishing companies in Europe so and try to to represent the In infrastructure of our software in a layered level it's somewhat harder because in reality the quite often interconnected but this is how it looks and we go from bottom to top so 1st of all this 1st and maybe the heart of clarify our system it's in memory what we call engine and engine is a constant solution implemented in C + + this is some always fuel in-memory database databases it's a bit special it's not only are both storage it also provides some business logic so engines are usually come in pairs where 1 is the master and the 2nd is the replica and there are connected to each other values as you would you know and this is the point actually where all the recall time system happens and it source data in the form of events and streams in in this is end of the typical use case would be of the real time user's segmentation of for example when request comes in we can define already you to which user group user belongs and this is not task because it usually each user action going seem actually to the different groups so basically each action can take him to different groups and range is quite fast in this case I African compute the user group membership just within a couple of milliseconds and provide the result the next letter would be analytic systems this is sort of scheduling application of reading jungle and set of workers Douglas children of course for its admin panel and workers what they're actually do the connected the engines collect of systematic with similar text metrics and statistics they talk and stored for later usage by the upper levels and the upper level is varied actually use the pirate finally this is the level of RESTful API and this is somewhat of later because it's used for integration of all the required customer systems into our platform so basically it exposes API which are then used by the customer systems like a CPU and so on and so forth to interact with our system but it's of environment applications volume child could be served as 1 big melody complication running in several you measure processes and it's almost literal later itself political communication proxy and it's implemented in open rest framework basically it's a bundle of engine exon long-code will wrote our own extensions in and because it's super robust superfast but yeah let's go is that back and sometimes to be slow and all the rest was super-fast so part of the API is also implemented in this layer for example the ones for event collections these are the most frequently trigger the guy we get like 10 thousand requests per 2nd for example yeah it's implemented in these parts also it together with the engine is responsible for making these real-time decisions for example on content access and it's also responsible for request forward into different of obligations they're running in support he was Jefferson so
before installing software and customer side we are usually doing some assessment and somewhat we sometimes face challenges and the biggest challenges for example and that our biggest customer rephrase that we we need to use surf at least like 10 thousand requests per 2nd for this we kind of our system and the other depending on the customer were called the is expected in assessed this about can come in different ways the typical most typical ways is when we have 2 front-ends and to and because I engines and the biggest cluster so far it's up to 5 from terms which are running our Python applications also serine just create an open wrist applications and the back and could contain up to 9 in Paris so totally 18 machines 64 gigabytes from each and the data will be some part of the data is shot it's all over the cluster some part of the data is copied for availability reasons and some of these makes us events in memory providing super fast access to this data and it's given us possibility to sell the right hand column progressed per 2nd we also use 3 tho sorry to model replica sets the first one will be used as a storage for the application data for Python applications and the 2nd 1 is the persistence layer used by the engine internally so the logic is that engine heaps of data for sliding window of 30 days and then starts to back up his dating the persistence layer for the availability reasons so how does the you of pockets of perspective like so 1st of all this is you John as a web server Running is usually and promote then the parameters of of application server then we are using some plug-ins together with Parliament's notably this call there and corny there is the libraries for data civilization is a realization of we're using Jason but it is also suitable for passing from amount some basic validation of the incoming data could be also used invented in colander and then the coordinates is a plug-in from was a but it's actually simplifies our developers life to implement RESTful services it's also quite useful because if integrated with things and felt to generate but argumentation then we wrote a couple of proper some top of requests library but because we are interacting with the engine over HTTP and we we have just some classes which request to the interrupted our engine and then fire itself is built on top of this result component architecture and we are also reuse in these components in our code to implement so-called team played points I will talk about this in a moment and then the build system build out it has also mentioned already were using for you but application our workers management and the Robert framework is used for testing so hopefully this is readable this is an example of something multiplication of which is using the parliament called corny cynical and there and then I'm going to explain the best but it's on this slide so 1st of all we are defining the data schemas they describe the use of parameters that can lost with gold later expect and would be the costing parameters for for request-offer formulas or incoming data payload parameters they could be passed into it it as a specified types so the 1st you will be used in the get can learn it specifies 138 recalled username it should search for these long Bermuda indeed on query string and treated as a string parameter and we're saying that these if it's me seeing it could be discarded from the world so basically that means that after trying to access this bringing doing all and learn all it'll be neat missing we should keep in mind this and I yeah going to 2nd scheme which is used and both camera we're of describing some basic data structure consisting of the fields where each of those fields should be also to you'd like a strange we're saying that that they should be formed in the request body and at this point we can already use some basic validation for example we say that the field message should be from 5 to 20 characters land and of the full field should be 1 of the valid values Barbara
provided these coordinates of quite good interacts with with colander brighten this information during this data decentralization these basically will be already checked and these suitable error message is generated and propagated to the client to the requester so you don't actually need to treat special cases in your hand loss the you Cornish-Bowden would do this automatically for you um than for some more more cost of litigation for example you need some dependencies between the fields and and penalty can do custom callable data from President related to the corners multi-condition this about this in a moment then we define the finally we define our REST Service it would be called health service and available the path then we integrate our and lost gettin him respectfully with the created service and we passed ski masks and if we have a data we also possible so at this point we defined and therefore get imposed if a request for a car and local subsampled put again the corner samples this by its own and the error messages generated like 405 Method Not Allowed so doubted that these quite simplifies your life and especially to keep in mind that if you're the department applications for each of the candidates you need to during application configure include its past you need to have this line for each of the that you have applied to write the instead of doing this you only need to include coordinates your application with time and just defined services like shown I think it's much simpler but K
and I this would be an example of Robert framework task so we define the the 2 endpoints forget imposed the yeah Robert framework it's
keyword-based testsuites for mostly integration testing because as I mentioned earlier our business logic is somewhat between Python application and the engine itself and that's why we are mostly doing by integration tests so the user can combine it so their own keywords to implement more complex words and stuff so our tests with without of the bond with the these you with applications the entrance of the background and this exact wouldn't ask if our response to the input method is what it is as expected so this a Shall we
needed of running this test passes now at this point the engines are started local in my machine then the test existed acute and it has passed then it's generating
and lastly looking report wicked where you can see the walks which what what happened when you test if there were any failures in our case everything is great so we're happy to do OK
let's continue yeah also from where the event in our replication in the way that we use dispose distributes the use of logic of publication into different submodules so that we are making sure that different feature seconded to that could be used on put up to the cost of the to the customer so we have like for example module the integration model analytic subsystem and funded dependent on the customer demands where developing and should be in these models to the customer so they could be also served as 1 monolithic application or each of them running in there and promote and search separately in a separate image applications FIL so 1 of the challenges is how to keep because customer base is quite when we have of round it customers some of the upcoming customers so we need to keep our code similar but also we need to provide custom solutions for our customers because they demands could be different their systems could be different and the best example maybe the is quite inflexible it's quite slow sometimes and for those we need for to for example sometimes developed some cost included and discussed include would be hidden placed in a separate package and we're trying to keep our generic code base as generic as possible and for this we are implementing so-called plates methods in our code tinplate points and and the cost hooks which implement discussed the logic would be all right in genetic history at the wrong time and thus we are able to deliver the custom solutions of customers the example of such case as I mentioned earlier and earlier
is the CP integration and this shows an example of will exist in API for importing Calif so we are using the SOAP interface in this case but we are defined in the interface called costs of coupled transformer so the whole idea that excessive transfer method which would take the catalog in the 4 models which customer defines and some transformation and transforming into the internal acceptable formats and so on start then we have a generic implementations which actually is doing nothing about it's called default transformer it leaves in a generic of local based it's doing us it just assumes that the coke incoming payload is already in the interim lexical form and then this during the application would time it's registered by calling reducing utility ends in meantime In the caste customer-specific code in the custom package we're defining the catalog transformer sophisticated transform which actually some magical transformation whatever and brings the cat I'll quote the interim lexical form and then in the customer code this would override by registering this utility also during runtime and then including these custom component into the generic obeys the became will be already tailored to the customer this brings a benefit that these API and lost the old stay the same so they don't change and you don't have to switch your API between different packages they all still live in the generic code but it still gives you a possibility to implement custom solutions for your customer needs and the other
time to speak about our analytic subsystems so essentially schematically it would look like this we've got engine pairs the data which we're going to collect the analytics they please shot between the engines so we need to query each single those then merge this data and stored for later usage so workers they connect to the entrance of periodically query data how do this pronunciation and caches for later usage in the model done the metrics API will quote analytics again this is the pilot application which would then later reduced data and according to the incoming requests from all or a single page JavaScript application which then to be a motivator it would feel these data additionally using a number of regression framework produce the result and based this data than nice graphs and charts would be and there is a torque already we are using jungle basis scandal in applications manages workers it's possible to see if there are any pending tasks given the task should be restarted calls and look at how this recalling the fission process looks like ship so at this point I was
trying to some showcase the of
our analytic so basically this is our the demo system and the graph stands for web page impressions and it's so possible to see this time span for example 1 week using the data different time resolutions and resolution usually means called real-time the metric is so this you shows currently a time span of 1 week with the resolution of 5 minutes then we can see the resolution of 1 hour and even more coarse-grained resolution of
1 dating to the dance and stays the same but the total numbers represent different time resolutions the answer this is
how our gender internal looks like here is an overview on the completed task failed tasks that you can disabled metrics collection and for the time there is a and deployment happening of something this and on the right side of this is the configuration of the metajob itself so long on the left side on the left bar you can see this would be a good time resolution for which we collect the data and these call means called real-time the metrics collections should be OK and for the and this is the
last slide I given oriented to typical development process in company so there is no they all he makes his changes commits them to be called a new tool that you're using it then the goal gets reviewed after some time the changes March to begin and Jenkins keeps on you on the get repositories and after the code is merged it starts all the different tasks when you're tested during World Chile always trying to keep that our massive branch ready for being version released so that the sovereign what all OK you can then but the version of the package that it gets packaged into then a put into internally host of so my by
from server then the decrementation will be built that kind of that it would be ready for release and when the release time constant we can derive package so all the versions would be included by the build-out both training internal acoustics of X server and also from piety that will be combined into the RPM Package developing depending on the customer operating systems and all the guys are doing their magic putting foreign social for I will build 2 reasons the using do usually do in a halfway so 1st we have agreed 1 half of the cluster and then the 2nd 1 so this brings virtually no downtime and it's not visible to the end users for use in these systems OK so thank you for your attention Thank you for coming to this talk to the and the question here
past so I understand the reviews both gender and theory can you clarify what exactly that gender and what exactly pyramid so maybe you can share some experiences or what is better for which use case 1 of the strong sites with size I have something quite some experience with Django I think it's really nice framework but mostly I think that everybody loves it because it has its magical building of infinite and gender is only internally is for us it's not visible to anyone it's just for us to collar workers are the if there are any FIL does that we need to restart if there are any problems so it's only like an internal and apparently it's more flexible it's used to implement the RESTful API described and show the examples and this is what actually is visible to our customers systems so that sort if they have some legacy as so systems and they want to connect us they would be using our environment the thank you for the whole country that was 1 of the questions I wanted to ask so thank you and but that what and what is your development effort now the moment is on the analytic with on the on the data scaling deployment on a larger scale factor and all the more customers would be easy to do sort and you put the question until the development effort you have at the moment is on scanning the existing system or as you can come in with the new analytics new algorithms that so I would say that so we have 2 to teams 1 this it was blasting which develops the engine and of say that the model computation efforts are there in the park and in the Python team we're mostly working on world bringing the different metadata so we need to do different aggregations to optimize it usually this is like showed the where we can probably the most of the memory so it's quite memory-intensive always we're trying to use different techniques for now the Mongol relational is the fine but the work load is distributed somehow between when you features which are model by the customer and implementing more different kinds of analytics the use of like the charts which would be shown to the customer because those are the ones which are used by the business analysts and based on this data they're doing some decisions which can in fact play the income stuff this more questions writing against them thank you again
Menge
Unrundheit
Computeranimation
Resultante
Offene Menge
Bit
Punkt
Prozess <Physik>
Gruppenkeim
Computer
Kartesische Koordinaten
Echtzeitsystem
Element <Mathematik>
Computeranimation
Eins
Übergang
Streaming <Kommunikationstechnik>
Digitalsignal
Minimum
Funktionale Programmierung
Einflussgröße
Statistik
Datentyp
REST <Informatik>
Just-in-Time-Compiler
Systemaufruf
Vorzeichen <Mathematik>
Ähnlichkeitsgeometrie
Quellcode
Zeiger <Informatik>
Biprodukt
Ereignishorizont
Entscheidungstheorie
Scheduling
Datenverwaltung
Menge
Festspeicher
In-Memory-Datenbank
Information
Programmierumgebung
Faserbündel
Telekommunikation
Proxy Server
Subtraktion
Gruppenoperation
Analytische Menge
Zentraleinheit
ROM <Informatik>
Systemplattform
Mathematische Logik
Framework <Informatik>
Task
Hypermedia
Spannweite <Stochastik>
Bildschirmmaske
Software
Datentyp
Vererbungshierarchie
Biprodukt
Inhalt <Mathematik>
Spezifisches Volumen
Softwareentwickler
Speicher <Informatik>
Maßerweiterung
Programmiersprache
Linienelement
Softwarepiraterie
Systemverwaltung
Physikalisches System
Quick-Sort
Einfache Genauigkeit
Integral
Echtzeitsystem
Offene Menge
Mereologie
Hypermedia
Wort <Informatik>
Computerarchitektur
Compiler
Einfügungsdämpfung
Resultante
Einfügungsdämpfung
Punkt
Momentenproblem
Desintegration <Mathematik>
Kartesische Koordinaten
Baumechanik
Extrempunkt
Computeranimation
Metropolitan area network
Client
Verweildauer
Bildschirmfenster
Gerade
Softwaretest
Parametersystem
Datentyp
REST <Informatik>
Singularität <Mathematik>
Güte der Anpassung
Gebäude <Mathematik>
Wurm <Informatik>
Abfrage
Systemaufruf
Nummerung
Speicherbereichsnetzwerk
Ereignishorizont
Rechenschieber
Software
Dienst <Informatik>
Datenfeld
Rechter Winkel
Festspeicher
Server
Information
Message-Passing
Fehlermeldung
Zeichenkette
Server
Multiplikation
Klasse <Mathematik>
Interaktives Fernsehen
Implementierung
ROM <Informatik>
Term
Mathematische Logik
Framework <Informatik>
Code
Ausdruck <Logik>
Virtuelle Maschine
Message-Passing
Benutzerbeteiligung
Multiplikation
Datenverwaltung
Software
Perspektive
Front-End <Software>
Front-End <Software>
Stichprobenumfang
Datentyp
Programmbibliothek
Mobiles Internet
Zusammenhängender Graph
Speicher <Informatik>
Datenstruktur
Softwareentwickler
Ereignishorizont
Konfigurationsraum
Gammafunktion
Binärdaten
Videospiel
Validität
Plug in
Physikalisches System
Keller <Informatik>
Zustandsdichte
Offene Menge
Mereologie
Computerarchitektur
Speicherverwaltung
Zeitzone
Klumpenstichprobe
Baum <Mathematik>
Logik höherer Stufe
Softwaretest
Kartesische Koordinaten
Ein-Ausgabe
Mathematische Logik
Komplex <Algebra>
Menge
Framework <Informatik>
Computeranimation
Endogene Variable
Integral
Task
Softwaretest
Gewicht <Mathematik>
Code
Endogene Variable
Wort <Informatik>
Softwaretest
Tabusuche
Metropolitan area network
Virtuelle Maschine
Unterring
Punkt
Zeiger <Informatik>
Computeranimation
Subtraktion
Punkt
Single Sign-On
Unrundheit
Kartesische Koordinaten
Analytische Menge
Mathematische Logik
Code
Service provider
Computeranimation
Informationsmodellierung
Total <Mathematik>
Biprodukt
Bildgebendes Verfahren
Trennungsaxiom
Logarithmus
Varianz
Ähnlichkeitsgeometrie
Physikalisches System
Ereignishorizont
Speicherbereichsnetzwerk
Integral
Generizität
Bimodul
Emulation
Datenerfassung
Verkehrsinformation
Resultante
Schnittstelle
Subtraktion
Punkt
Prozess <Physik>
Hyperbelverfahren
Mathematisierung
Implementierung
Zahlenbereich
Online-Katalog
Wärmeübergang
Kartesische Koordinaten
Analytische Menge
Ungerichteter Graph
Transformation <Mathematik>
Framework <Informatik>
Code
Computeranimation
Homepage
Task
Bildschirmmaske
Informationsmodellierung
Front-End <Software>
Lineare Regression
Zusammenhängender Graph
Default
Schnittstelle
Linienelement
Softwarewerkzeug
Wurm <Informatik>
Einfache Genauigkeit
Rechenzeit
Objektklasse
Speicherbereichsnetzwerk
Auswahlverfahren
Vorhersagbarkeit
Integral
Moment <Stochastik>
Caching
Basisvektor
Dateiformat
Computerunterstützte Übersetzung
Demo <Programm>
Subtraktion
Schlüsselverwaltung
Graph
Physikalisches System
Web-Seite
Speicherbereichsnetzwerk
Analysis
Computeranimation
Metropolitan area network
Softwaretest
Diskrete-Elemente-Methode
Front-End <Software>
Personal Area Network
Bildauflösung
Gammafunktion
Metropolitan area network
Subtraktion
Softwaretest
Schlüsselverwaltung
Singularität <Mathematik>
Zahlenbereich
Registrierung <Bildverarbeitung>
Große Vereinheitlichung
Speicherbereichsnetzwerk
Computeranimation
Gammafunktion
Bildauflösung
Linienelement
Güte der Anpassung
Systemaufruf
Mathematisierung
Speicherbereichsnetzwerk
Computeranimation
Task
Physikalisches System
Metropolitan area network
Geschlecht <Mathematik>
Konfigurationsraum
Große Vereinheitlichung
Bildauflösung
Normalvektor
Abstrakter Syntaxbaum
Prozess <Physik>
Wellenpaket
Mathematisierung
Versionsverwaltung
Physikalisches System
Code
Computeranimation
Task
Rechenschieber
Softwaretest
Zustandsdichte
Server
Softwareentwickler
Web Site
Subtraktion
Momentenproblem
Analytische Menge
Computerunterstütztes Verfahren
Framework <Informatik>
Physikalische Theorie
Eins
Metadaten
Informationsmodellierung
Algorithmus
Softwareentwickler
Zentrische Streckung
REST <Informatik>
Gebäude <Mathematik>
Relativitätstheorie
Physikalisches System
Teilbarkeit
Quick-Sort
Entscheidungstheorie
Endlicher Graph
Unendlichkeit
Beanspruchung
Geschlecht <Mathematik>
Festspeicher
Programmierumgebung

Metadaten

Formale Metadaten

Titel Building a RESTful real-time analytics system with Pyramid
Serientitel EuroPython 2015
Teil 148
Anzahl der Teile 173
Autor Chaichenko, Andrii
Lizenz CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
DOI 10.5446/20080
Herausgeber EuroPython
Erscheinungsjahr 2015
Sprache Englisch
Produktionsort Bilbao, Euskadi, Spain

Technische Metadaten

Dauer 25:39

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Andrii Chaichenko - Building a RESTful real-time analytics system with Pyramid CeleraOne tries to bring its vision to Big Data by developing a unique platform for real-time Big Data processing. The platform is capable of personalizing multi-channel user flows, right-in time targeting and analytics while seamlessly scaling to billions of page impression. It is currently tailored to the needs of content providers, but of course not limited to. - The platform’s architecture is based on four main layers: - Proxy/Distribution -- OpenResty/LUA for dynamic request forwarding - RESTful API -- several Python applications written using Pyramid web framework running under uWSGI server, which serve as an integration point for third party systems; - Analytics -- Python API for Big Data querying and distributed workers performing heavy data collection. - In-memory Engine -- CeleraOne’s NoSql database which provides both data storage and fast business logic. In the talk I would like to give insights on how we use Python in the architecture, which tools and technologies were chosen, and share experiences deploying and running the system in production.
Schlagwörter EuroPython Conference
EP 2015
EuroPython 2015

Ähnliche Filme

Loading...