Use Python to process 12mil events per minute and still keep it simple (Talk)


Formal Metadata

Use Python to process 12mil events per minute and still keep it simple (Talk)
Title of Series
Part Number
Number of Parts
Dima, Teodor
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date
Production Place
Bilbao, Euskadi, Spain

Content Metadata

Subject Area
Teodor Dima - Use Python to process 12mil events per minute and still keep it simple (Talk) Creating a large-scale event processing system can be a daunting task. Especially if you want it “stupid simple” and wrapped around each client’s needs. We built a straightforward solution for this using Python 3 and other open-source tools. Main issues to solve for a system that needs to be both performant and scalable: - handling a throughput of 1 million events per minute in a 4 cores AWS instance; - following the principle of least astonishment; - data aggregation and how Python's standard libraries and data structures can help; - failsafe and profiling mechanisms that can be applied to any Linux service in production; - addressing unexpected behaviors of Python’s Standard Library; like reading from a file while it is written; - tackling a sudden spectacular cloud instance failure; The alternative to this system would be to adopt existing technology stacks that might be too general, add more complexity, bloat, costs and which need extensive work to solve your specific problem. Moreover, our approach resulted in over 85% drop on hardware utilisation.
EuroPython Conference
EP 2015
EuroPython 2015
Computer animation Lecture/Conference
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
so if I have my name is still the the I am a developer that this systems a company from
Romania and we do a lot of things that what we mainly concern ourselves with at the big data with the technologies and real-time the for those that do not know that and that really means this means that we must handle a lot of user events every 2nd which must be continuously persisted into the database so what is an event simply what's an event is represented by a simple gets request which is sent by user from a browser which reviews advertising and that is sent to the event processing system that was to analyze the data and then push it into the database and the amount of data that can be generated by a single user is going to be a single viewing of a single ads can see concerned about at the East about of thing events of the past these events are not send all at the same time which means that the user must most keep a continuous connection with the and that means that you have a lot of users which sends about the events are from what concurrent
climb this problem has been seen long before the stock it is called c can Our and it basically means how can handle 10 thousand concurrent connections on the same machine during my time the
company in other words averages of about 12 million request from mean that means about 200 thousands requests per 2nd that must be handled by a single server across so the question is how can you handle such a traffic and moreover how can handle such a traffic by using as little resources as possible so the
main solution would be tween demands of system response to each request and sends it directly to the to the database of this works for a small amount of data and is quite easy and fast implement but R is ultimately unmaintainable and consumes a lot of resources and other alternative solution would be the use of the Apache trio of graphs storm water and some people or some alternatives that however if you want to computer and to human these systems are as good as it can until coherent holds that takes time and this often on Python of or although some some work has been put to make it's a lot more by the the ones that that's where the people at forests so those about
initially when they have to implement such a system of all we have to ship and that meant that we implemented a simple solution you might 1 of which handle streams of the end of events and than sentimental database In order to check for data consistency consistency in order to be sure that we haven't dropped any event that's reached the server but was not inserted into the database rejected the axis logs of the web server and attractive that they are corresponds with the data from the database because all the data that you need all the events are there in the axis well so this leads to a simple idea why not use the access log as a simple cue for an event processing system the idea was that
when the request to reach the machine there is to the power system by the engine website the engine is web server resources in the problem so that would solve a lot of problems so between the axis along and the database there has to be another service which would be the analyzing transform it and then push it into the database so we may begin to think about the implementations of such a project and if it would be resilient and feasible enough for us to do it after some prototypes and some new ideas we came up with the queen structure and we can move this with a service that we have called local bank the now this is the whole
background which shows a simplified scheme of the data flow through a single virtual machine in the cloud as I said a single HTML request is sent by the user by a process of through a load balancer and then to this and Unix website in honor in order to ensure that the guy was easily solvable between the virtual machines we used the Amazon Web Services CBS elastic block blocks service of the sort the axis the and then the axis was run by our so this might look longer and then it is absurd to me into the database inside the log of the service there are actually 3 processes that were at the same time some so these
processes of the positive and the positive reason access log continuously just like a pair unique state and then analyzes these events and caches them into an internal cache structure this the sketch structure is using the pipe ensemble of bytes there are actually quite beautiful to use and very simple to use of after it has cashed his data for a fixed period of 2nd configurable of course it that makes this data and pushes it
into a motor processing humor that makes the connection with the observed so we have sort done what's the cue based on the data from that and and professors in the into the database besides pushing it into the database that certain actually evolves into a special file into all we call it the been log of every event that it has actually pushed into the database and with the offset of the axis along so that in the event of a crash in the event that you you actually want to reboot the system to restart this service with them no which point was the last point that was introduced into the database and could 3 starts from that point on the 3rd and final process the admin process this process checks periodically if if the other 2 processes are active if they are not it shuts down the coast is now why did we do that's why didn't we just we reboot the called service while there is buying the probability that if you have something that compression process that compression data processing services then you may have brought to the you have data but that's not to reach the database but he he he is there somewhere in the system then that's back but if you insert corrupted the back into the database that's really much worse so we try to avoid that at all costs of a number of functions of the administration processes is is they have seen the DXF files and the below log the axing function is a function which synchronizes the virtual memory of a fire with the actual disk contents of that file which means that the file has persisted after you make the extinct call that is extremely important if you have if you want to it to persist and the most important function is the service Status data at the configuration will port and that was done with a single simple protocol is just except every single request that comes and then serves adjacent samples this that data is also collected from the other 2 processes through some shared virtual memory that is controlled by Malta multiprocessing the loss of a simple redirect lot so 1 of the 1st issues that we thought about was how stable with the service be when its fate the axis law so it turns out that it's very easy and completely stable it's very easy to implement but there is 1 of the things you find problem the
offsets are hard to calculate efficiently and like I said we need to those offsets tool the output of an individual find some so our in order to do that by using buffeted text files is almost useless because if usable for text files you cannot know the exact opposite of a single line of text unbuffeted text files are really slow and would not help us so it's actually easier to open the file in when bite just read a number of bytes and the number of bytes that are contained in the light and have the opposite
so between the past and that sort of process have a multiprocessing Q the Q he's using pipe in the background the Unix file which is so and that quite a number of corrupted data we never had problems with but there is a problem with the data transfer speed because when you actually insert some they into the cube it actually spawns thread which then inserts that top are gradually from the buffer its internal buffers into the quite if this that does not have the DO your evolving going to have data in the buffer of which is not actually interested in the light so the connection between the processes and the absolute there is broken there are ways to minimize damages and I will talk about them in a later slide so how could that catastrophic pressure to handle security and efficiently where you that happens if you have a catastrophic crash then there are 2 files the essential which can be corrupted or incomplete axis we're going to be a lot and you can manage that somewhat by just it's sinking them as often as you can so that is result of the machine right or if you have a crash and it can recover the date you can't just go on and then the data that you have there OK
so is 5 and fast of doing just this space well yes actually with some performance optimizations C Python would get about 20 thousand through requests for 2nd on the same machine 1 of the for exports of Amazon Web service which from machine that means that you have all this processing power 1 machine for which of course and media bytes of RAM however also inserted into production we re-implemented the data processing system in Seattle that was quite easy to do it's just a week excluding testing and without any
process from experience and it's doubled the performance so it's a fun way to go well there is another problem but like I said we could have seen on the 2 ball essentially and we use the network virus storage offered by AWS the the effect of the performance of the from from the and we for the problem at some point that's of periodically once every 2 days or so the network that in the Amazon Web server and that meant that you have seen to look up from beneath the 0 . 1 seconds to almost 25 seconds which meant that the serviceable was blocked at that time but it was actually pretty rare and it did not affect the system too much so during 1 of the testing phases we have sort of strange behavior would be what are the possible was reading continuously from Texas low-quality catch data have been from the cash was full and get shorter was not having the same efficiency as it would have under normal conditions so what what's happening there was that the process the bosses was starving the feeder threat was starving the threads that's what actually that have the purpose of thinking that they die and then inserted into the pipe the only way we could management now tool for to avoid this problem 1 2 force the password for the sleep period we could not be fully fixed and it's just sort of problem for right so how can you maintain ho how easy it is the many things such as system a single machine that answers to this amount of requests like accident if 20 thousand requests per 2nd contains only 2 essential processes the engine exhaust which takes the requests and just means the configuration file and apart from low which also use a single configuration file those machines are small the only contain for costs and they're quite achieve their commodity hardware and the you can answer with the server across 12 meeting in the quest for by just using 15 virtual machine that means that there is no single point of failure when have machines authority is small so you have enough processing power to obtain such or such improvements and or if you know some machine then the system will not be affected as much as let's say if you lose a gigantic practical electoral machine with
governments Hominidae bytes of RAM are another thing to
say about maintenance is that these machines are not throttle to 100 per cent these machines powerful at the maximum of 50 % in the normal situation and let's say 75 per cent on the so this is this is the possibility of low of readability and part of freedom because if problem machine of 100 per cent then you will have some hardware failures and even if the peaks reach 100 per cent of system capacity that you have a you you have an event queue that is represented by the axis so file that access log file we just let people be appended to body and web server yantian extracts will have to continue to serve the request and if the pattern process does not so is not cannot be base with the engine it's web server it doesn't really have to know that you will have to be from from the point when the data from all reaches the gaussian machine to when it is inserted into the database by about so let's say a couple of minutes but to would be it will not be critical you will never lose
that you would just continue pushing it into the database when the peak is not and the same thing applies to the database connection to database of can actually do that quite much and or even you have at some point in time some problems with the database of this is that you have a cluster of databases and 1 of them several of them crash then you will have less right of performance and if the database connection lacks then get surgery will just like in getting the data from the queue but what about means and the data will actually for at itself into the cash the catch will just have the data of more and more and more on and after the database is restored to its full followed it would just continue to inserted into the database and you just go reach its 100 per cent OK so that was a thank you for your attention what questions to
me without if the I mean 1 question and
large you have to worry too but the 1st his way between your pass and you're upset you don't use of it and you are in a message queuing like Adam was 0 and q even told you of Q of that you have in the courses the iterative process sorta well 1st because it was easier to implement and because if you do not affect the performance as much as we thought so even if a even with the performance but not to you still reach 28 thousand requests per 2nd thousand requests for the limit of the engine exposure web server in on Amazon policy for x instance is about 20 to 24 it because for 2nd and we did not reach the limits and like I said we kept our services about 2 thousand 3 as 2nd so that we can to lower the possibility of harder phase so we could have also implemented a message during service without about implementing something parameters and we may implemented that in the future we don't know what's for now it's sitting right part well just doing is a 5 lines and it's a little more than that thanks I just ask about that I talked about how even if the you bunker presence classes of molecules whose because they're still demands and yes different and it's on each but with the death of virtual machine graphical what happens to the the log messages that in the axis along the chain which have yet to be stored in late late gone they're not going forever because the market index of and that access log will be kept on the years so 1 interpretation so what that means is that temporarily by the virtual machine is down and you can have access to the media such petition it we'll have some data that does not reach the database but if you're also supports is ready then to reconnect the assistant 1 of the longer instances maybe or another tool and just means thanks and you have the 2nd question why and you not using things like looks to actual equivalent to do your passing the extension between what's the most 2 bottles of what to do with manually by implementing were deleted is to just pick up information directly from the looks like a dedicated utilities men nonlinearity existing solutions to that sites job I think I'm not expecting notes that but I have many things about writing in low passing and what not and to me it seems that looks or something equivalent would be a very good solution to extract information information from your notes because is a structured information and poor you can inform extension and to read how to read text based so that means that you have a cluster of service which have the energy makes on each 1 of the machines and the that is the same thing it's just you want to read this looks Hollinger's instead of implementing something by the news is 1 of the intermediate is maintain the publishers with that means that it would be redirected to another customer service which actually has an extension of prayer but you can install against the accession the same machines as when you run your amendment yes that's sure however I'm not I'm not really sure how is the performance on on those services how we actually could implement this really quickly and had some really good performance and the fact that we could use this so I don't know but the Internet and how the performance of the thank you another person 1 location think it is


  479 ms - page object


AV-Portal 3.9.2 (c7d7a940c57b22d0bc6d7f70d6f13fde2ef2d4b8)