Add to Watchlist

Building a RESTful real-time analytics system with Pyramid

36 views

Citation of segment
Embed Code
Purchasing a DVD Cite video

Automated Media Analysis

Beta
Recognized Entities
Speech transcript
some hello and welcome everybody thinks for making it through the date of the last round of talks but yes so my name is on the I work at the company called set of 1 of this to make my Python experience around 5 years so far but this is the 1st time and giving this talk so I also hope for some the deck up to you from you what's in the end end yeah let's start so
1st of all I will give some words about the company who we are what we are doing then introduce the
architecture of our platform in a coarse-grained level I give some information call we use Python environment in general and to our software I will describe in detail our analytic subsystems and we finish with some you know general development process in our company so 1st of all we are companies called 1 up for sure short we are calling ourselves C 1 the company's relatively young it was established in 2011 needs based in Berlin the company's quite small it's safer now we're around 25 people but it's already quite international because we're coming from 9 different countries and the main product of the companies they are platform for doing debate content content recommendation real-time decisions on content access for users and of course analytics we are also developing our own programming language it's called COPA as you might correctly guessed that stands for so long programming language it's functional language and probably type the main customers of our company the media and publishing companies in Europe so and try to to represent the In infrastructure of our software in a layered level it's somewhat harder because in reality the quite often interconnected but this is how it looks and we go from bottom to top so 1st of all this 1st and maybe the heart of clarify our system it's in memory what we call engine and engine is a constant solution implemented in C + + this is some always fuel in-memory database databases it's a bit special it's not only are both storage it also provides some business logic so engines are usually come in pairs where 1 is the master and the 2nd is the replica and there are connected to each other values as you would you know and this is the point actually where all the recall time system happens and it source data in the form of events and streams in in this is end of the typical use case would be of the real time user's segmentation of for example when request comes in we can define already you to which user group user belongs and this is not task because it usually each user action going seem actually to the different groups so basically each action can take him to different groups and range is quite fast in this case I African compute the user group membership just within a couple of milliseconds and provide the result the next letter would be analytic systems this is sort of scheduling application of reading jungle and set of workers Douglas children of course for its admin panel and workers what they're actually do the connected the engines collect of systematic with similar text metrics and statistics they talk and stored for later usage by the upper levels and the upper level is varied actually use the pirate finally this is the level of RESTful API and this is somewhat of later because it's used for integration of all the required customer systems into our platform so basically it exposes API which are then used by the customer systems like a CPU and so on and so forth to interact with our system but it's of environment applications volume child could be served as 1 big melody complication running in several you measure processes and it's almost literal later itself political communication proxy and it's implemented in open rest framework basically it's a bundle of engine exon long-code will wrote our own extensions in and because it's super robust superfast but yeah let's go is that back and sometimes to be slow and all the rest was super-fast so part of the API is also implemented in this layer for example the ones for event collections these are the most frequently trigger the guy we get like 10 thousand requests per 2nd for example yeah it's implemented in these parts also it together with the engine is responsible for making these real-time decisions for example on content access and it's also responsible for request forward into different of obligations they're running in support he was Jefferson so
before installing software and customer side we are usually doing some assessment and somewhat we sometimes face challenges and the biggest challenges for example and that our biggest customer rephrase that we we need to use surf at least like 10 thousand requests per 2nd for this we kind of our system and the other depending on the customer were called the is expected in assessed this about can come in different ways the typical most typical ways is when we have 2 front-ends and to and because I engines and the biggest cluster so far it's up to 5 from terms which are running our Python applications also serine just create an open wrist applications and the back and could contain up to 9 in Paris so totally 18 machines 64 gigabytes from each and the data will be some part of the data is shot it's all over the cluster some part of the data is copied for availability reasons and some of these makes us events in memory providing super fast access to this data and it's given us possibility to sell the right hand column progressed per 2nd we also use 3 tho sorry to model replica sets the first one will be used as a storage for the application data for Python applications and the 2nd 1 is the persistence layer used by the engine internally so the logic is that engine heaps of data for sliding window of 30 days and then starts to back up his dating the persistence layer for the availability reasons so how does the you of pockets of perspective like so 1st of all this is you John as a web server Running is usually and promote then the parameters of of application server then we are using some plug-ins together with Parliament's notably this call there and corny there is the libraries for data civilization is a realization of we're using Jason but it is also suitable for passing from amount some basic validation of the incoming data could be also used invented in colander and then the coordinates is a plug-in from was a but it's actually simplifies our developers life to implement RESTful services it's also quite useful because if integrated with things and felt to generate but argumentation then we wrote a couple of proper some top of requests library but because we are interacting with the engine over HTTP and we we have just some classes which request to the interrupted our engine and then fire itself is built on top of this result component architecture and we are also reuse in these components in our code to implement so-called team played points I will talk about this in a moment and then the build system build out it has also mentioned already were using for you but application our workers management and the Robert framework is used for testing so hopefully this is readable this is an example of something multiplication of which is using the parliament called corny cynical and there and then I'm going to explain the best but it's on this slide so 1st of all we are defining the data schemas they describe the use of parameters that can lost with gold later expect and would be the costing parameters for for request-offer formulas or incoming data payload parameters they could be passed into it it as a specified types so the 1st you will be used in the get can learn it specifies 138 recalled username it should search for these long Bermuda indeed on query string and treated as a string parameter and we're saying that these if it's me seeing it could be discarded from the world so basically that means that after trying to access this bringing doing all and learn all it'll be neat missing we should keep in mind this and I yeah going to 2nd scheme which is used and both camera we're of describing some basic data structure consisting of the fields where each of those fields should be also to you'd like a strange we're saying that that they should be formed in the request body and at this point we can already use some basic validation for example we say that the field message should be from 5 to 20 characters land and of the full field should be 1 of the valid values Barbara
provided these coordinates of quite good interacts with with colander brighten this information during this data decentralization these basically will be already checked and these suitable error message is generated and propagated to the client to the requester so you don't actually need to treat special cases in your hand loss the you Cornish-Bowden would do this automatically for you um than for some more more cost of litigation for example you need some dependencies between the fields and and penalty can do custom callable data from President related to the corners multi-condition this about this in a moment then we define the finally we define our REST Service it would be called health service and available the path then we integrate our and lost gettin him respectfully with the created service and we passed ski masks and if we have a data we also possible so at this point we defined and therefore get imposed if a request for a car and local subsampled put again the corner samples this by its own and the error messages generated like 405 Method Not Allowed so doubted that these quite simplifies your life and especially to keep in mind that if you're the department applications for each of the candidates you need to during application configure include its past you need to have this line for each of the that you have applied to write the instead of doing this you only need to include coordinates your application with time and just defined services like shown I think it's much simpler but K
and I this would be an example of Robert framework task so we define the the 2 endpoints forget imposed the yeah Robert framework it's
keyword-based testsuites for mostly integration testing because as I mentioned earlier our business logic is somewhat between Python application and the engine itself and that's why we are mostly doing by integration tests so the user can combine it so their own keywords to implement more complex words and stuff so our tests with without of the bond with the these you with applications the entrance of the background and this exact wouldn't ask if our response to the input method is what it is as expected so this a Shall we
needed of running this test passes now at this point the engines are started local in my machine then the test existed acute and it has passed then it's generating
and lastly looking report wicked where you can see the walks which what what happened when you test if there were any failures in our case everything is great so we're happy to do OK
let's continue yeah also from where the event in our replication in the way that we use dispose distributes the use of logic of publication into different submodules so that we are making sure that different feature seconded to that could be used on put up to the cost of the to the customer so we have like for example module the integration model analytic subsystem and funded dependent on the customer demands where developing and should be in these models to the customer so they could be also served as 1 monolithic application or each of them running in there and promote and search separately in a separate image applications FIL so 1 of the challenges is how to keep because customer base is quite when we have of round it customers some of the upcoming customers so we need to keep our code similar but also we need to provide custom solutions for our customers because they demands could be different their systems could be different and the best example maybe the is quite inflexible it's quite slow sometimes and for those we need for to for example sometimes developed some cost included and discussed include would be hidden placed in a separate package and we're trying to keep our generic code base as generic as possible and for this we are implementing so-called plates methods in our code tinplate points and and the cost hooks which implement discussed the logic would be all right in genetic history at the wrong time and thus we are able to deliver the custom solutions of customers the example of such case as I mentioned earlier and earlier
is the CP integration and this shows an example of will exist in API for importing Calif so we are using the SOAP interface in this case but we are defined in the interface called costs of coupled transformer so the whole idea that excessive transfer method which would take the catalog in the 4 models which customer defines and some transformation and transforming into the internal acceptable formats and so on start then we have a generic implementations which actually is doing nothing about it's called default transformer it leaves in a generic of local based it's doing us it just assumes that the coke incoming payload is already in the interim lexical form and then this during the application would time it's registered by calling reducing utility ends in meantime In the caste customer-specific code in the custom package we're defining the catalog transformer sophisticated transform which actually some magical transformation whatever and brings the cat I'll quote the interim lexical form and then in the customer code this would override by registering this utility also during runtime and then including these custom component into the generic obeys the became will be already tailored to the customer this brings a benefit that these API and lost the old stay the same so they don't change and you don't have to switch your API between different packages they all still live in the generic code but it still gives you a possibility to implement custom solutions for your customer needs and the other
time to speak about our analytic subsystems so essentially schematically it would look like this we've got engine pairs the data which we're going to collect the analytics they please shot between the engines so we need to query each single those then merge this data and stored for later usage so workers they connect to the entrance of periodically query data how do this pronunciation and caches for later usage in the model done the metrics API will quote analytics again this is the pilot application which would then later reduced data and according to the incoming requests from all or a single page JavaScript application which then to be a motivator it would feel these data additionally using a number of regression framework produce the result and based this data than nice graphs and charts would be and there is a torque already we are using jungle basis scandal in applications manages workers it's possible to see if there are any pending tasks given the task should be restarted calls and look at how this recalling the fission process looks like ship so at this point I was
trying to some showcase the of
our analytic so basically this is our the demo system and the graph stands for web page impressions and it's so possible to see this time span for example 1 week using the data different time resolutions and resolution usually means called real-time the metric is so this you shows currently a time span of 1 week with the resolution of 5 minutes then we can see the resolution of 1 hour and even more coarse-grained resolution of
1 dating to the dance and stays the same but the total numbers represent different time resolutions the answer this is
how our gender internal looks like here is an overview on the completed task failed tasks that you can disabled metrics collection and for the time there is a and deployment happening of something this and on the right side of this is the configuration of the metajob itself so long on the left side on the left bar you can see this would be a good time resolution for which we collect the data and these call means called real-time the metrics collections should be OK and for the and this is the
last slide I given oriented to typical development process in company so there is no they all he makes his changes commits them to be called a new tool that you're using it then the goal gets reviewed after some time the changes March to begin and Jenkins keeps on you on the get repositories and after the code is merged it starts all the different tasks when you're tested during World Chile always trying to keep that our massive branch ready for being version released so that the sovereign what all OK you can then but the version of the package that it gets packaged into then a put into internally host of so my by
from server then the decrementation will be built that kind of that it would be ready for release and when the release time constant we can derive package so all the versions would be included by the build-out both training internal acoustics of X server and also from piety that will be combined into the RPM Package developing depending on the customer operating systems and all the guys are doing their magic putting foreign social for I will build 2 reasons the using do usually do in a halfway so 1st we have agreed 1 half of the cluster and then the 2nd 1 so this brings virtually no downtime and it's not visible to the end users for use in these systems OK so thank you for your attention Thank you for coming to this talk to the and the question here
past so I understand the reviews both gender and theory can you clarify what exactly that gender and what exactly pyramid so maybe you can share some experiences or what is better for which use case 1 of the strong sites with size I have something quite some experience with Django I think it's really nice framework but mostly I think that everybody loves it because it has its magical building of infinite and gender is only internally is for us it's not visible to anyone it's just for us to collar workers are the if there are any FIL does that we need to restart if there are any problems so it's only like an internal and apparently it's more flexible it's used to implement the RESTful API described and show the examples and this is what actually is visible to our customers systems so that sort if they have some legacy as so systems and they want to connect us they would be using our environment the thank you for the whole country that was 1 of the questions I wanted to ask so thank you and but that what and what is your development effort now the moment is on the analytic with on the on the data scaling deployment on a larger scale factor and all the more customers would be easy to do sort and you put the question until the development effort you have at the moment is on scanning the existing system or as you can come in with the new analytics new algorithms that so I would say that so we have 2 to teams 1 this it was blasting which develops the engine and of say that the model computation efforts are there in the park and in the Python team we're mostly working on world bringing the different metadata so we need to do different aggregations to optimize it usually this is like showed the where we can probably the most of the memory so it's quite memory-intensive always we're trying to use different techniques for now the Mongol relational is the fine but the work load is distributed somehow between when you features which are model by the customer and implementing more different kinds of analytics the use of like the charts which would be shown to the customer because those are the ones which are used by the business analysts and based on this data they're doing some decisions which can in fact play the income stuff this more questions writing against them thank you again
Pointer (computer programming)
Roundness (object)
Computer animation
Multiplication sign
Set (mathematics)
Greatest element
Scheduling (computing)
Group action
INTEGRAL
Decision theory
System administrator
Multiplication sign
Source code
Range (statistics)
1 (number)
Compiler
In-Memory-Datenbank
Real-time operating system
Mereology
Data management
Insertion loss
Hypermedia
Central processing unit
Software framework
Functional programming
Extension (kinesiology)
Physical system
Programming language
Process (computing)
Product (category theory)
Software developer
Computer
Real-time operating system
Bit
Digital signal
Representational state transfer
Measurement
Open set
Metric tensor
Data storage device
Telecommunication
Quicksort
Data type
Resultant
Point (geometry)
Read-only memory
Statistics
Similarity (geometry)
Streaming media
Event horizon
Open set
Read-only memory
Energy level
Proxy server
Subtraction
Computing platform
Computer architecture
Form (programming)
Task (computing)
Data type
Just-in-Time-Compiler
Information
Inheritance (object-oriented programming)
Copyright infringement
Element (mathematics)
Content (media)
Analytic set
Volume (thermodynamics)
Set (mathematics)
Cartesian coordinate system
System call
Local Group
Single-precision floating-point format
Sign (mathematics)
Word
Hypermedia
Computer animation
Integrated development environment
Software
Logic
Personal digital assistant
Fiber bundle
Cluster sampling
Building
MP3
Code
Multiplication sign
Time zone
Insertion loss
Parameter (computer programming)
Client (computing)
Mereology
Perspective (visual)
Front and back ends
Web 2.0
Video game
Software framework
Error message
Multiplication
Uniform space
Social class
Physical system
Chi-squared distribution
Metropolitan area network
Software developer
Moment (mathematics)
Sampling (statistics)
Representational state transfer
Front and back ends
Maxima and minima
Message passing
Data management
Data storage device
Configuration space
Right angle
Data type
Resultant
Point (geometry)
Read-only memory
Slide rule
Implementation
Server (computing)
Numbering scheme
Service (economics)
Civil engineering
Connectivity (graph theory)
Disintegration
Mobile Web
Virtual machine
Event horizon
Field (computer science)
Open set
Goodness of fit
Read-only memory
Term (mathematics)
Well-formed formula
String (computer science)
Software
Software testing
Data structure
Gamma function
Plug-in (computing)
Computer architecture
Data type
Multiplication
Validity (statistics)
Information
Server (computing)
Memory management
Interactive television
Stack (abstract data type)
Line (geometry)
Density of states
Cartesian coordinate system
System call
Event horizon
Computer animation
Software
Logic
Query language
Personal digital assistant
Window
Library (computing)
Computer worm
Complex (psychology)
INTEGRAL
Set (mathematics)
Code
Weight
Cartesian coordinate system
Word
Computer animation
Logic
Dependent and independent variables
output
Dependent and independent variables
Software testing
Software framework
Software testing
Task (computing)
Point (geometry)
Metropolitan area network
Inclusion map
Virtual machine
Software testing
Point (geometry)
Product (category theory)
Code
INTEGRAL
Logarithm
Multiplication sign
Scientific modelling
Similarity (geometry)
Event horizon
Medical imaging
Roundness (object)
Traffic reporting
Subtraction
Physical system
Module (mathematics)
Metropolitan area network
Analytic set
Generic programming
Cartesian coordinate system
Maxima and minima
Computer animation
Personal digital assistant
Logic
Internet service provider
Software testing
Separation axiom
Web page
Point (geometry)
Implementation
INTEGRAL
Transformation (genetics)
Code
Connectivity (graph theory)
Scientific modelling
Multiplication sign
Heat transfer
Interface (computing)
Graph (mathematics)
Storage area network
Number
Social class
Casting (performing arts)
Mathematics
Single-precision floating-point format
Torque
Utility software
Software framework
Computer-assisted translation
Subtraction
Form (programming)
Task (computing)
Run time (program lifecycle phase)
Default (computer science)
Process (computing)
File format
Linear regression
Basis (linear algebra)
Interface (computing)
Analytic set
Library catalog
Multilateration
Cartesian coordinate system
Front and back ends
Metric tensor
Cache (computing)
Computer animation
Personal digital assistant
Form (programming)
Resultant
Computer worm
Web page
Metropolitan area network
Graph (mathematics)
Demo (music)
Logarithm
Multiplication sign
Demo (music)
Image registration
Front and back ends
Storage area network
Higher-order logic
Maxima and minima
Image resolution
Computer animation
Software testing
Personal area network
Key (cryptography)
Normal (geometry)
Gamma function
Conditional-access module
Subtraction
Physical system
Image resolution
Metropolitan area network
Computer animation
Multiplication sign
Software testing
Key (cryptography)
Mathematical analysis
Gamma function
Subtraction
Storage area network
Number
Metropolitan area network
Gender
Multiplication sign
Mathematical analysis
System call
Storage area network
Metric tensor
Image resolution
Goodness of fit
Computer animation
Mathematics
Configuration space
Normal (geometry)
Gamma function
Physical system
Task (computing)
Slide rule
Curvature
Server (computing)
Process (computing)
Code
Software developer
Multiplication sign
Density of states
Wave packet
Revision control
Mathematics
Computer animation
Software testing
Task (computing)
Physical system
Read-only memory
Building
Divisor
Scientific modelling
Decision theory
1 (number)
Theory
Metadata
Workload
Software framework
Subtraction
Physical system
Algorithm
Theory of relativity
Scaling (geometry)
Computer
Software developer
Gender
Moment (mathematics)
Infinity
Analytic set
Representational state transfer
Integrated development environment
Personal digital assistant
Website
Quicksort

Metadata

Formal Metadata

Title Building a RESTful real-time analytics system with Pyramid
Title of Series EuroPython 2015
Part Number 148
Number of Parts 173
Author Chaichenko, Andrii
License CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
DOI 10.5446/20080
Publisher EuroPython
Release Date 2015
Language English
Production Place Bilbao, Euskadi, Spain

Content Metadata

Subject Area Information technology
Abstract Andrii Chaichenko - Building a RESTful real-time analytics system with Pyramid CeleraOne tries to bring its vision to Big Data by developing a unique platform for real-time Big Data processing. The platform is capable of personalizing multi-channel user flows, right-in time targeting and analytics while seamlessly scaling to billions of page impression. It is currently tailored to the needs of content providers, but of course not limited to. - The platform’s architecture is based on four main layers: - Proxy/Distribution -- OpenResty/LUA for dynamic request forwarding - RESTful API -- several Python applications written using Pyramid web framework running under uWSGI server, which serve as an integration point for third party systems; - Analytics -- Python API for Big Data querying and distributed workers performing heavy data collection. - In-memory Engine -- CeleraOne’s NoSql database which provides both data storage and fast business logic. In the talk I would like to give insights on how we use Python in the architecture, which tools and technologies were chosen, and share experiences deploying and running the system in production.
Keywords EuroPython Conference
EP 2015
EuroPython 2015

Recommendations

Loading...
Feedback
AV-Portal 3.5.0 (cb7a58240982536f976b3fae0db2d7d34ae7e46b)

Timings

  691 ms - page object