Bestand wählen

Scaling a Cloud Based Analytics Engine

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Erkannte Entitäten
my hello and this presentation is about stealing a cloud-based analytics and then it was originally titled mission impossible and that was a marketing person in that name not my favorite so I went back to basics and you probably don't know
me I've been around for quite a while I've been has best EDA for about 15 years now and that this is my 1st talk conference meaninglessness 0 intent and currently I'm a database manager for a company called I heard this project and speaking about here with actually for a company called message that's which was a start and that sadly didn't quite make it an but we did some really cool things will be or the kind of things that start still would you can start from nothing build things the right way and spend lots of cash not all companies have that that's so
a little bit about the platform and it was kind of a set of native so it was join and Rackspace not Amazon found on the the entirety of the company was cloud it was not just the database architecture wasn't just Olog everything that we did was so there was no actual physical hardware what we
found out well what I found out after the thing that type of environment is really matter and so what the cloud meant for us was that we needed small to mid-size instances because we need that many of them you don't want large instances for many different reasons but with your with your many instances you get lack of control of your physical resources so you don't know what else is running on your hardware you don't know what else to resource contending with any no that's typically not supposed to be a problem everybody knows actually it's yes there were many times at 4 o'clock in the entire becomes slow because everybody just got home from school they're playing a game on the other hand it wasn't wasn't time there is often no delay in expanding and so when you find your and use it museums which is fantastic that's the whole benefit of cloud and the sites not besides not having to pay for your hardware it's that you just get more it's very you expand the contracts at the same time you don't have the options over your storage so it's same thing over and lack of control of physical resources you don't necessarily know how your storage is attached to your data so the benefits tradeoffs of clouds redundancy in cloud extremely easy but it's absolutely necessary because it's not just a prediction that you might use your it happens and happens a lot and it's not just a possibility that you can lose your entire cloud because if your own hands on you have probably gonna and so these embedded straight ASR is things you have to architect around so unlike having physical hardware were you know your limitations it adds an extra layer and vulnerabilities edges to to
reiterate on our time those same principles that you build around cloud you know your necessity is building around them for cloud it's still works for hardware to key don't want over your resources you want the small small back that's you don't want large back that's you know what things taking over 24 hours and then skipping is a the resource limitations that you know you're you're making them because you they're still good rule of thumb when your physical harm to so and what is your data look like and in our case we were using passive transactional log data so we were on an e-mail service provider of sorts and the analytics that came with sending e-mail or what we were dealing with so how many messages per hour what channel where they were a marketing where they interact as transactional types and How do you store that volume of data not necessarily in flat files because we wanted the ability to home over it and you get into a database where you have full SQL to go over that so by the nature of that type of fat for logs from distributed sources there already distributed so we're not dealing with say a web application that has a single entry point for rights we have the ability to take every single entry point where these larger created and get them into a local warehouse that's fantastic that absolute and fantastic because 1 of the major issues that comes with an application is dealing with single master and in this case we dealing with distributed data that problems simply goes we have both long and term storage so short-term mid-term storage was the ability to go through everything analyzes quickly and archive at all but for legal reasons for whatever it is a common practice among data is the ability to to use the data that's hot and the store that data long term and by long-term here we were meaning 3 months not necessarily forever and that that's the rebel the eventual verses immediate consistency in the approach so we took care work because we didn't need immediate consistency again this log data so lots can get hold of him and the absence for while they take time to write so nothing of nothing about this type of system necessitated immediate consistency of that type of instance this wouldn't necessarily work but it was a it was a nice thing we've solved a lot of problems the important thing that we needed over all of this was the speed of retrieval because the endpoints of this data is API Hester statistics and that's a very common high demand thing for most companies is the ability to tell customers what you've done how fast you it and give them proof that what they're paying for is valid most marketing will just love that's the 1 thing consistently throughout my career that people have wanted is marketing wants data and this is the the eventual output of this is being able to tell your customer you have value to what you've got so I put this slide in here
because I realize that's not so bad not everyone is going to be high level here and for developers and for as long deviates the 2 hot things that we're dealing with in this hour all the P so online transaction processing this is the log data that's just going over and things that are 1 time actions that you need to get to this heavy on rights and versus the Analytical Processing only which is coming over all that data and writing your statistics from the
requirements in this case and were actually no data loss because after redundancy at no point in this if we have a single point failure that also meant that we needed the ability to change masters to cause replication pause rights to promote to any type of critical me this is a point where historically if you're not designing for this from the start it will be a major issue in the future that safety net that building redundancy into your system where the database goes away your system doesn't crash is a huge point all around what did you come across a point where you have a web application in your database goes down and you attach application completely stops yeah I she haiku or a cash this is this was from the start immediately and secondly we could not be involved again this was the output of the service so in no way can your analytics stop the court your actual service for running so if at any point the right slowed down or you wouldn't take answers there was no way that we can let that stop service again this is a the idea that the data itself is a service to the rest of the and the other thing that comes with this type of transactional data is out of sequence events so if avian becomes unresponsive and comes back later this valuable data there you don't wanna just throw that away it's also possible things just got pod or you have networking so the ability to take that data and folded it back in at proper point time was a requirement come along with that is you're going to have nots they haven't I've never seen a piece of software has anyone seen her perfect absolutely perfect piece of software with any degree of complexity that was free about yeah I have so to that degree what we needed was the the ability to go back in history and we restate analytics because you over the course of their work areas there were issues and then alpha always have that so that was a fundamentally possible thing to have so this really selects 1
I mean really the ability to go over logs to perform ad hoc queries to show data it really so why not felt slumped is really expensive and the more data you have the more expensive it gets that was not an issue here however this was a customer facing a cat but contract is something you want to throw in front of is a rest interface and inference pump and hand them back your customers was also defined analytic so these were preset things that we wanted to specifically search for allies condense and hot out immediately to a customer so the difference being your your internal debugging and long coming verses of but you're gonna show as its service and as a business class deliverable tools we chose
to use them not all you were the best races and but some of them were absolutely fantastic and check for server Automation which being found even cloud it's such a wonderful thing to have is everybody carefully not as many people as I would assume and shut allowed as to bring up servers in minutes and they were identical and they were fantastic and you could say the same for a profit or 1st engine to lesser degrees but some type of automation tools especially when you're working on and several different clusters especially across different locations incredibly useful liquidates and the covariances the the schema tool of choice for myself but for scanning management I assume it's barely known here yes local banks the now when he was used for skin management at the end you have a little basis is similar to I would suppose get about sets we use inside of it but it's specifically to out mergers changes updates rollbacks and keep consistency in your schema and it allowed for um shaft which set of the original database in the user's information the base would run on top of and set up your full so automated rollout with tracking and it made the ability to do do and single line scheme updates over however many mergers that you might have incredibly easy matches prize but it's not a more known to the cue that sat in front of In fact this was quite you it could have been anything that simply because we were using a scholar implementation and they could then every disk you it could then as a rabbit and you could then rescue it was simply a q the red looking back that cause more problems than it solved Rep manager was in place for remote and for redundancy with NASA safe filters out but I don't actually think necessary when use that this what you think of it because of the in and that's exactly what we did not the rate am I no longer use it in the fact that I'm using just because the return value was higher and then finally as best the start of this which I assume you all use this as the knowing that no not yet
I don't actually need and I had several conversations about that in everything that red manager David can either be done in batch or with just straight out of the city of Cologne I'm so unless you really need automated promote of masker and the case of the delivery really don't need it and even if you do that and but honestly I would rather wake up at 2 AM or 4 and then do a manual that have the automated I'm sure some people differently about that but if my database goes down the middle of the nite I wanna know it I want to see what's going on and it's more critical to me to be able to do hands-on giving it to a computer than failover and possibly have more severe side effects so writing the raw data to distributed 0 LTP and into this a little bit earlier so by nature all of that long all that log file transactional data created distributed architecture that could be per cluster and per and what post so very elegant very easy to and create redundancy within your historical because it's log
data it's not relational so actually nothing about this project was relational that was a rather shot a rather big shock for me being a relational database person this is the 1st product at any scale I had worked on that wasn't specifically tied into foreign keys and that data but it was rather elegant you're disparate types can be written 2 separate instances so if you have a lot of sending events and a lot of errors analog of updates all these things can go to different databases in different instances because there's a need to consolidate that that's done at a later level you don't need synchronous multi multi-master again because you're not dealing with data that has to be a relational so you're not occurring master expecting the same result from every other place you're holding results from your single instance for for your single type found the next piece of batch process right to copy because it was absolutely essential and that was 1 of those
things that you think well in searches slow will add a copy and you it'll be it'll be optimization for when we need it that was absolutely not premature optimization from the start at a very low volume when we encounter issues with copy who hasn't got who has not encourage yeah that's the 2nd place and we found that the system is absolutely so do work for all because the safety net for this was individual and individual inserts even at a very very small scale were drastically that so this is a a must have from the start when dealing with any type of volume and they must have a safety net so when copy breaks his copied as we reverted to individual inserts and but that but still so developers going through this would go through the the copy Lister copies find the find the line that broke because copies all or nothing so as soon as they got to the the logic was fine alignment broke throw it out go back to copy yeah yeah you get immediately right etc. everything below that line got thrown out so be it with a hammer test and passed and passed again because developers don't always make the right choices that goes without saying but in this case it's 1 of those logic things that as a DBA or someone familiar with how this works it makes perfect sense but if you're a developer moving on makes better sense in this case it's completely logical this is the thing I love
most about the rest and that's a big statement that's a big statement over 15 years the same software and inheritance is what I love most it really is so that the disparate event types as you're writing your types her type so to my sentence have a single parent and sensor such a volume that you need to shot and I'm assuming you I shot by time was the approach that we take and than a multitude of different ways depending on the data but this was right depend so the volume that we chose to start with was 1 child per day and that could easily up to a child per hour or child for a half hour or whatever your volume demanded but by keeping those children small you create ability to individual back that's the crucial more efficient and it creates sanity so those individual children could be archived off 1 at a time he didn't fascist stable and because the output that is let's say in 3 months you go back to a single child that was and you try and and that better to go over that again you found about but your schema has changed the backup fails so With this or 1 of the ways you can fix that is back up and its own database run didn't give the table itself at the current version and then back that they have to make sure quantities has the last thing you want when you have a critical mind and again you're up at 4 AM is your backup so you know that you plan for that you know that you're not backing up the entire database and it's not going to back up perfectly unless you're in the same state so the other thing about about inheritance constraints like in that sense the constraints there our type specific so if you on the date if you hand a timestamp constraint hi timestamp where clause to a constraint on dates intelligence will ignore that you'll end up going through your entire dataset so everyone familiar with constraints on child and why they work so well right make sure you use the same type otherwise it's going to take a very very long time are event ID was the other we the serial primary key which allowed us to close the shot and always explain analyzed the queries that are going to be going over the state of compressing the state need to work and in the past because this is your your primary right master aibika going on slaves you don't want that bogs down by something going over your entire dataset so this might seem redundant it might seem like elementary but explain analyze and making and make absolutely sure you're not going through the full dataset this is the thing that doesn't show up intense it will show up in a black box unless you're under scale because small-scale testing where you build up your test database run through it small dataset you're never going to find that so specifically look for explaining analyzed 40 load testing the only 2 places
so when using inheritance like this you need to make sure that you have a place right which means that in advance of the time when you're going to be using that shot it has to exist and we used ports for Qantas because we were in a style environment however you can use basic on it really does matter and code to go through create the actual shot you're creating an including all the index you check constraints and I think and I did not and optimize any of these and they're actually kind of simplistic but for the sake of just getting a lot of slides so there's no pre declared variables or other extreme optimizations the other important
part is closing the shot them and by that I mean while you're actively using a child you're running from base event ID primary key up until the time stop and if the query is using an ID to constrain itself if you never if you never determine your final for that time period again you're going to go through your entire dataset so when you're done with the shard nightly hourly when you have passed the time that it is a viable table 2 right information to you must close the shot so this was the function created to go through found select all of the tables pass out the names and then loops through everything that is not closed yet of and determine your minimum and your maximum and then tidy so you can replace the current constraint which just has a start with start date and an end date that the reasoning the logic behind this is actually a more
advanced than crowd and I thought so so star and Snowflake and analytics this was again because so the client facing analytics engine the cat had to be fast had to be 2nd pass so stars so that was basically the only choice that which added up also the bonus is being able to petition use table spaces in the light very basic playing and data warehousing techniques so again we
using courts it was a 3 part 3 part of a way to get data from ah transactions warehouse into analytics warehouse where we pull the aggregates and so reach out to all of our our transactional data update and add it into the base factor the just the very base that you want and in cascade so there are analytics
database our children as it assessed for pulling by sequential ID so position pull update look at the whole and update in a single transaction and saving the state so that you can go through and do analytics on how much you've done and how their your performances that was a pretty or something for us and so that we could determine what our resources were and what we needed to throw out a cluster so this is the code for
an aggregate sample this was done in PG 9 . 1 now we would use sound for data after this is also the site where earlier out currently on data wrappers with 9 . 3 and the ability to write back into your primary sources would have been far superior and the reason for that is to keep track of your ID all that was a point of contention where to keep track of your whole so whatever ID your last call stopped on is the point where your next 1 will start keeping track of that are keeping a work log in your transactional database if your analytics Davis goes down and another 1 takes over for it and say in a multi a multi I had situations where you have 1 cloud picking up for another if you cannot access your analytics engine keeping that on the transactional database is absolutely essential but if you keep it in your analytics it's easier to get to where it's just use your process so that was in the back and forth where we ended up actually just keeping in both places so
this attached on before when you aggregated data so DB-Link FTW from your transactional data only go into your finest back so if start so plate has 5 dimensions d you don't wanna aggregated to every single 1 you wanna go to your finest 1 and anyone continually condensed down so this is
where the query or code gets a little yet and updating into the base factor so you're event source and on I should have been the sup relations have however had the idea is that when you aggregate on that you create not a temporary table that you actually create a table in your analytics on database that holds all the aggregate data and at that point you select and you condensed from that table into your facts table is not a temporary table because temporary tables are provided this is the case that we wanted several workers to be able to come along and aggregated table back into something that we wanted things to be and if if this was a interrupted transaction we don't wanna have to re reaggregate that because that's the expensive part of this is going to you're you're transaction-processing grabbing all that data sending it back over the wire and then calling it into your back so that's the point where everything has to be set on this and solid that's the atoms in the so and
it gets worse this is where you take the base stack and and in 2 phases update everything down the line and answer all the new data down the line and that includes all of your data that might have come in delayed to get aggregated back into the proper time period which it came from or the proper whatever it came from in our case we have multiple different ways of aggregating based on channel sessions and data but it's the idea that historical data needs to get filed properly historically so what we found
was that aside from being a service that we provided to customers and analytics and marketing information Everything shut up here every little problem that we found the higher up you see the end here because it's not just for marketing data this is performance data this was something went wrong here and I could see that because we actually sent 55 thousand e-mails that 1 9 environment now we can the we did send text messages that's what happens when you have your text routed through so don't always trust your data and it's something looks funny don't assume that your problem don't assume that you have about how you might but don't make go immediately think maybe it's somebody else's problem of the chain because this is the final where you will find everything and it's also really kind of nice you saying no no my problem so putting
all the other issue but it looks like implants from R and transaction processing in this case
it was that you and 2 are analytic sorry to ar transactional processing databases and different types different copies of types
them into your 1st where and and out through your query so this was based system that can have 1 or more of LTP databases and then just a very simple very basic cluster configuration the next scalability point is when you're log files or when your transactions gets to be and when you originally wanted each of them can scale independently and still go back in the same analytics warehouse and the
further scaling of this is to go into separate warehouses so have multiple transactions and will analytics an all out through and through 1 or multiple free hats and ah case everything was based on data so when we started the harvester harvested gathered all of our data and the data had the flow so it with our spice and because our analytics on knew everything we had earlier and out through the and I can't take credit for that I just made a couple of
so questions yes and no 1 is the 1 that was on the requirements of how it's only for yeah I We reusing streaming replication and then at the time so the birds were around and but we were reading off the master as well the writing of the master and the slaves were for hot-swap a tendency not necessarily for reads writes them because they were small instances we were able to scale that way they say this was before so again the slaves were therefore come hotspot redundancy what we were doing with no data loss was primarily having Hewlett so Paul everything wrote to the queue and cues were read and so on it right through to Descartes not in memory only capture so we had people on it writing to disk across and it had slaves as well and then from that the copy batch and the postprocessing over any issues with their the Hornets to was not seen so it would retry and in the retry at the copy level as well so the entire copies who failed to revert back to inserts and in of sample once you have it there we were reading writing off of the amounts the master small enough and light and love that we don't necessarily have to go to the slave but if we did we had an instantaneous were not is taking some very very quick from and we could go back over the last year I would like to present the user with the development the I'm looking so I mean this was this was a year ago and at the time of the major things I dislike about red manager is that it has a 5 thousand filed when on this it assumes that you have a very long time between when you start your own sake to do it to do a slave to stop which in our case was absolutely not the case and you can set that lower body just it was yes you 1 of it was a manual basis so this this and system was tested up to 200 gigs and many it was but under load for for several days and during load tests and then for and production sh sadly the company went under before got to see that level of love so I don't know what the final numbers were found that they were they were not that and that the scale was spike and usually you can see it projecting enough at least in our case but that would be an easy thing to automate out because once you have the shackles tools of places that you have the have it's really easy to set limits on on right numbers and 1 on and off a new cluster yes you can all of the them have of the government I'm being a starter this was a company of 30 people and I wrote obvious still part of the thing and I their words to other developers and like you said it helps in the uh the REST API and writing the points and what not and which is how we ran in the copy that but all the yes felt so I was lucky enough to touch directly this is a perfect case for a DL and this was in you know this was in itself as a data access layer it wasn't something that had be grown to the point where developers wanted to add more statistics it would have remained sequestered off into its own access slavery had the rest points everything that would have been needed to be pulled outside of it would have come look the rest of the this is where no it well it was an invalid row where the the logic on that was the copy failed and copy is all nothing that was like requirements as set out in the in the writing of the answers so the fall back to that was to do individual row inserts developer who went through that went through cancer by answer starting from the top until he found the error once yeah I was not an actual but the copy it was a logical but I would not continuing on from the point of error and if you want to be the the tutorial yesterday the secondary tutorial on and I know that anyone there was a key way of handling copy that where I'm everything was parsed into a sequential ID column and a secondary column that all data just a straight copy into as 1 nice large block and the secondary piece of that was displayed to a rate based on some items that in your data so from that it's a two-part copy into that leaves a lot of the problems that thought that was a nifty little trick and it would have been a lot slower in this case so I wouldn't have code around it but I might have used that for for my safety net so my safety net being my copy into broke copy is an entire columns but that column and then try and copy those in FIL column morning as you from
Schreiben <Datenverarbeitung>
Service provider
Perfekte Gruppe
Ordnung <Mathematik>
Automatische Indexierung
Wiederkehrender Zustand
Computerunterstützte Übersetzung
Tabelle <Informatik>
Folge <Mathematik>
Mathematische Logik
Weg <Topologie>
Data Dictionary
Spezifisches Volumen
Cluster <Rechnernetz>
Ganze Funktion
Tabelle <Informatik>
Wort <Informatik>
Information Retrieval
Prozess <Physik>
Inferenz <Künstliche Intelligenz>
Natürliche Zahl
Komplex <Algebra>
Lineares Funktional
Physikalischer Effekt
Konfiguration <Informatik>
Automatische Indexierung
Projektive Ebene
Extreme programming
Web Site
Physikalisches System
Inverser Limes
Speicher <Informatik>
Physikalisches System
Design by Contract
Innerer Punkt
Umsetzung <Informatik>
Information Retrieval
Befehl <Informatik>
Obere Schranke
REST <Informatik>
Gebäude <Mathematik>
Güte der Anpassung
Flüssiger Zustand
Dienst <Informatik>
Rechter Winkel
Cloud Computing
Zurücksetzung <Transaktion>
Klasse <Mathematik>
Analytische Menge
Open Source
Inhalt <Mathematik>
Parallele Schnittstelle
Medizinische Informatik
Elektronische Publikation
Einfache Genauigkeit
Analytische Menge
Eigentliche Abbildung
Inverter <Schaltung>
Konfiguration <Informatik>
Kartesische Koordinaten
Funktion <Mathematik>
Folge <Mathematik>
Zentrische Streckung
Filter <Stochastik>
Prozess <Informatik>
Globale Optimierung
Verkettung <Informatik>
Gewicht <Mathematik>
Kombinatorische Gruppentheorie
Wrapper <Programmierung>
Relationale Datenbank
Matching <Graphentheorie>
Einfache Genauigkeit
Cloud Computing
Ganze Funktion


Formale Metadaten

Titel Scaling a Cloud Based Analytics Engine
Alternativer Titel Mission Impossible
Serientitel PGCon 2014
Anzahl der Teile 31
Autor Billington, Samantha
Mitwirkende Crunchy Data Solutions (Support)
Lizenz CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
DOI 10.5446/19087
Herausgeber PGCon - PostgreSQL Conference for Users and Developers, Andrea Ross
Erscheinungsjahr 2014
Sprache Englisch
Produktionsort Ottawa, Canada

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Scaling a 100% Cloud Native Analytics Engine Your mission, should you chose to accept it, create a data storage system that can handle 200 Gigs of data per day on cloud servers with heavy analytics. GO! The architecture plan of a real time logging system built to handle 200g/day of data and hand it off from mid-term OLTP storage into and OLAP postgres data warehouse. It was built with heavy reliance on inheritance, dblink, and streaming replication.

Ähnliche Filme