Bestand wählen
Merken

Optimizing queries for not so big data in PostgreSQL

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
and all everyone and I'm stephanie and offers the development agenda and as a the explained holidays an all-in-one tool for use analytics and feedback so we get loads of user data but is it really big data so how big is the
data so at Hargeisa we have about 400 million recordings we have 4 . 5 billion records in a single day even and I recordings coming in in 2000 spending it and who have an overall repressed request rate of 750 thousand requests the minute as well but is this really big data so what defines Big Data is basically that to start out blowing the normal duration of tools and we need to yeah going to a new set of tools which are more suited for for bigger database at the end that's all at 1 time would basically nearing this limits air but sometimes we we play a few games with posters as red because In when you're in the limit of either started or performance forsaken gave this so of the stock today is about the good decisions were made which story 98 and post this and I would like to share these with you and also all the things you have to figure out and production and which were quite surprising and adventures of times and basically the tools we use to handle these situations so we're going on a journey to get that and we've unimagined through going to start building a database from scratch as do everything and uh I would I would 1st of all to the good decisions which we made and I will also extended the tooling which would help us when we deploy alive OK let's go so the 1st thing we need to consider is normalization so text book data normalization is a two-step process by which we eliminate fees from a the which are not directly dependent on the primary key however in real life we need to achieve a balance between the consistency of the data and the historic and also its performance so we need to be we need to make decisions in a way which makes sense basically so prosperous offers us a bit about the best of both ways because we can have schema-less the date up inside the Jason scheme of viewed and we can also have relation of the words now what makes us choose 1 or the it is basically the detector so at what giant we have data relating to always uses subscriptions aware comments and in this case it's very important that the data remains consistent so we use the referential constraints of the database and you have very normalized tables it however when we come and get it uses data from surveys or was this is a little bit more dynamic and it's not so well defined so in that case we can afford to put in adjacent views and maybe and do order um the validation of the model level using at at the API level using Jason schema or using marsh meadows came up in this this would give us kind of the best of both where it's the cost of the the the storage doesn't need to verify the contents every time OK so know that would have been which more or less how it's going to wake the product owner needs to make a few changes so where are we going to add this new field so it's very nice to know before beforehand how the product is plans to avoid so we can plan future changes it's not always possible so the idea is that as the product of words we need to keep our models had the and to be able to achieve cold maintainability and readability and to people and to understand the journey of our models because I did not I database might still hold data from from previous definitions so the way we do we do this at jazz by avenging is so i'm boat when they are in normalized in this we can add virgin column and when they are nested inside inside Jason fields we can also add a vision field inside the Jason so now this this gives us basically in we have a new fields but we can also identify the behavior of of of the different types of variations of already at at the API layer OK so what's next now we are ready to use our the this from the cold because we're pretty pretty well defined which what we want to do
so should we was always ends or give give us a few benefits some of which are and that we hope we can get a higher level of abstraction so we don't need to think about this table being joined with this statement but we think that they're at a higher level more about the business logic and how to make the t the billing code so Andorrans also give us ing the forwards for the creation all items and frameworks are also a very nice benefit because some some frame which are very tightly knit toward for example if we speak about the Django admin interface it's very heavily based on the jungle and and it gives us nice to these for free so that we can basically view the data from prominently interface without having to write on the creation of Bayesian updating and deleting called OK when it comes to handling transactions know whether application there's for example flossed SQL alchemy which would which would manage in the transactions for us within the web request and ensure that they're committed at the end again this is heavily based on the on the SQL alignment when we advertise Lianne's there are some some cases where there are parts of items which are advertised as benefits much but may not be so so I want to put a little a little note here to be careful because 1 of the benefits being advertised is that you can switch between different backends so it's tempting to use post-processual production database and SQL like as a test database for example because you can it it's says allowed within the however and this may lead to discrepancies and you might get prices when you deploy your article to production so it's best to keep your test environments as close to a production environment as possible another 1 is automatic migrations Satan aware and supportive limited migrations however if you have a very if you have a heightened would say very which is receiving lots of requests the the SQL generated by the or Ford migration might not be optimal SQL in the sense that it might it might cost blocks on table is and this would cause the request time despite so you might need to realize that SQL to make it more performance to do apply to the dying to run indexes concurrently so don't just trust the migration called generated by DOI and but check of it's what you need are in this regard OK so when we speak about elections what about performance oriented will give us a performance hit so how how do we handle the
situation should I use the procedures well at Hollinger reusable all instances procedures and whenever when we require a performance edited performance would look into stored procedures however there are of course points in this 1st of all we spread business logic from basically a single module tool to a different places it's the right half on the database inside the distant procedure and half in the web server in the API so it makes things less see the bell and also stored procedures young within a single transaction so if you have lots of things going on in them they might increase your chances of that looks so the good practices are if you going to start procedure sky dense more than what you like and version them as well so you be if something goes wrong with the new version of the of two-step procedure you'd be able to write about code easily and July in the previous version of this procedure the OK so I think now we're ready to go live right we made some performance considerations and we have we're happy with our modern so yet now that's running on life things are different because we cannot predict how a requests are going to use the database so once we we're running life we identified to have still queries so how are we going to go about knowing which I just go queries I hope for your sake that have monitoring set up so in this case you would be able to see which web requests on the island are performing as well as you like em but this doesn't necessarily pointers to the creative because the web request might might be doing so much so much so many things so in this case we basically pullout produced that statement from our toolbox and this will give us the corner frequency and the timing of the currently and increased to be able to help us identify which 1 is the so 1 so once we know the country for we don't basically explain analyzed
face is the next so we would like to use so when we run explain what happens is we get back a creative plan which will be used by post to actually give us a results back when we and explain analyzes we will actually lie on the creative plan and to get the exact timing that it's going to take when we use the buffers fighters then it will tell us how many pages for this leading basically to give us the data back so were there are there are 2 numbers which which I've given to us for about and if it's if there is a high hit in the shared pages it means that atleast the and the results coming from cash however if there is a high number of pages in the red but there's this means that it's coming from this so it is also select as a generator will the number of buffers hits should never be too high because in that case you might be going over too much they thought and you might need to optimize it in Satan wastes other things to look out for our nested loops and for scans and these might require indexes to basically In improved than and the and also take a look at how you're creating is using how you basically creating your data if you're always using to specific fields you might benefit from using a composite index and who by which the index takes both fields into consideration and if you or if you're doing some computation to work to a field before creating fought for it and you might also benefit from having a computed index which stores the computed value within the index it's excellent you doing some transformation to the data before it in the query you might consider putting the transformation in the index to create OK so now we're that's lighter with our queries but what if the post this is not choosing the optimal again so if the if the query plan is still not performant enough is post this for well
let's start educating the creative planet but at this stage have a big disclaimer to make he's a very smart guy so he's usually right so we need to be careful here and how the quick let me go to that how the greedy planner actually gets his His decisions so that when we're analyzes on a table so it's not explain analyze now it's it's the analyzing undertaken the Query Planner statistics get updated so the query plan ever get around 100 rose from the table and base its creative planning these however if you have a table in which columns have very irregular data distribution you might benefit from giving the created planner more statistics to base its decisions on so that it can come up with a better created plan so in this case we can use that stable statistics and the and basically increase the number of statistics and so that the and I can't can take better decisions but other than that he's usually at so now we've done quite a lot improvement to our created plans however it some table joins are still very expensive when running the requests so what can we do about that basically having materialized views and the material I this can help us in this stage so materialized views are created by then creating a materialized view based on a greedy so the result of the creating is computed once and stored in this view and whenever we need to update 222 quality fresh to to have it updated with fresh data this makes use of sense whenever you need to generate reports daily or already in the course you you can you can take a bit of time to refresh the data however when the data needs to have to be updated continually it can be expensive to quality free from the materialized view especially if it's a big crazy In that case we can get around this by building our own meticulously the this which I basically they burst to repopulate at the same time that appropriated the normal normalized statements so let's give an example so if I'm entering subscription details and I need to populate the account table the customer table and the subscription table I would also in the same transaction try and inputs and um to try and input data also wanted the material table which would combine the data from the sky this so that when I requested from the material A-stabil I don't have to join and this gives me the performance benefits another way of doing this is by having a together on 1 of 2 on 1 of the normalized tables which would then predates to the materialized said OK so with that quite a few tricks so far however aggregations do designed especially mentioned would you select count and I would request where if you always know that the results returned are going to be of limited size it's fine to do it but usually you might get surprises because people tend to use the API is in ways which you cannot predict so it might give you a performance hidden set increased so the idea is that though it's hard to cash indications because we always want an updated value we might not require down to the 2nd updated value so in this case we can basically catch the result of the aggregate for 5 minutes and fresh every 5 minutes at todai-ji giant we do this by having a web request selectively update the catch so work the web address for me to take a look and say this and I'm in charge of updating this value now because I I can see it's going to expire sum and then so that we avoid the cash stampede problem when everyone tries to update and expired value at the same time another way to do which is by by having a background lighter take care of this so it's not going the let requested set also for this foot at the moment of speaking about obligations when we need to be done but when we're continuously updating counts by by actually implementing fields inside the database it also makes sense if our basically basically if our performance allows it in if we can use like 5 minutes of data and it's not bad and we might also want to catch that 1st and sink to the database at at at at the beginning interviews to avoid creating too many to many updates enforce this course if you create update to the very high Postgres on be very happy OK so there's something still layer and in our database which is which might cause trouble
intermittently which is a bit west of so welcome to the blocks and M. so I explained that looks in this way imagine you have 2 kids and they each have a choice and of course each wants the other kids story without giving up its own and this is what happens when we have a deadlock there are 2 transactions going on and each has any source which is required by the other transaction but neither of them want to let go of their kind he uses so in the case of the kids you would have uh at their into come and take take 1 of the twenties and give it to the loudest kid In the case of deadlocks Postgres does mediate their job and 1 of the queries with the and 1 of your users will get a fever request it Due to the nature of deadlock since there is there not very predictive and repeating the request would probably make it work so it's not too bad for you and users however it the symptoms I that greedy slowed down when we have this looks so and if this if if the total increases it can read book things down so we need to take care of them OK 1 of the ways tool to take care of this ensuring that we're using only samples this variation because the date and sometimes they they give out updates which can prevent certain classes of deadlocks but mostly we need to take care a bit about how we write our code so
already I you probably can't see this greedy but I make the slides available late I and this is a very useful greedy for identifying deadlocks and that uses that produce that activity view and the Purdue locks and within the greedy so these are 2 tools provided to us by post this to help us identify which transactions are using which locks and if we see that they're being held for too long we can possibly identified that there's a problem OK so at this stage with a
we need to speak about David and next so this as I mentioned before it explores isn't to have with updates so what happens when we have table and indexed note specifically Table of gold is that when searching for results suppose has to jump over a lot of rows which have been deleted or updated but have not been but the space has not been attentive to post this year so basically air and explore what is happens due to how Bozkurt's managers transactions and then and the and basically can be fixed by running the vehicle for however I mean that they can fall in is that we would hold an exclusive lock on the table so it might not be advisable to do this in production environments so there are a number of extensions and which can be used to improve the situation and to remove the number of dead goes play dead rows and basically give the free space back to post this so 1 of the symptoms of having David wrote is that your at discouraging at the state space in your database goes down really fast faster than you would expected at the end of a torque band rates of data coming into our database so just a quick quick grants you'll produce squeezes something you set up once and it takes care of blogs and continuity because you configure the level at which the the blow meet still needs to be controlled but you contact a specific is is a device for a toast Davis which I basically tables which have blobs inside them and would would affect storage told other tables under the hood and the however it needs to be done specifically every what with with a parameter which specifies the number of pages it's going to clean up and the PGT Pekin just beyond once whenever 1 of the need arises basically OK and we use that produce that doctors to actually identify with will have and exploiting the notes so we have to put quite a few problems on it's by asking by by my asking you to solve them until they don't like it so let's speak a bit at the end on how to keep these background workers had the so ideally background workers speak their jobs from a q so that if 1 of the background America's dies that the job can be rescheduled also we should keep the the duration of the job predicted then and doll variative inside the background OK so that we can distribute or sell the tasks between multiple savers so when we do with this and there isn't 1 machine and the world for a for a not for a top our task basically but we can we can basically and monitor and profile the background like easy and other tip is to use oversight care assess when we you obtaining a large amount of data out and this also incurs a performance hit on the database but it would protect memory within the background where k so they're devices don't use it in a web request but basically use it in inside the bedroom it this is the cheat slides because it's not about post guess but of mentioned caching and caching can be done in 2 different ways we can either cash small tables within the web request within the web application but we need to be careful with memory in this case on this we can we can use fission story uh caching stories like this fermentation to keep workers data for us the last thing I want to mention is how to handle deletes we haven't spoken about the thought and the when when they realize that I databases getting for and we say listen and we have loads of data which we haven't used in nature and can we delete that 1st of all a deleting it all at once we're not make posters too happy and the 2nd of all your users might not be informed of this data might be removed so planning that data retention policy upfront will help in this regard that you as soon as the data is no this doesn't is not covered by the retention policy if it's 1 day all there you can start deleting data which is at least 1 day old and in that way you you balance the deletions by doing then the himself by having to do them all at once OK
that was a all from my hands on was to have a little time for questions but if not feel free to drop by to the boat therefore deaths from a Ji here and we're mostly there breaks because there are many interesting sessions you want to attend this event but feel free to grab outside of the distorted assistance and fear the and we would have time for questions went the I thank you very interesting all your abstract mentions shopping imposed as the ball was wondering what's your variance with yet and OK I left it out because of time but they can't speak a bit about that as well when we came tool to basically think about how to how to expand our prescriptions dilation 1 of the 1 of the decisions we could have done is basically shy being because our aid since our data is basically organized by sites so on our features are ordered it's all heat maps are specific to 1 side recordings specific to 1 side so our site ID is a very is a very good thing to side by basically however when we actually came to the migration would and to performance problems because the the transform of the data and the changing of the format of the data required In was was basically too much for our current bull's Chris installation air because we we needed to do some changes so the tip is if you're data selected deciding you need to think about its energy even though the because the migration will bite 2 basic the FIL that have yet very few you mentioned the losing Jason structures in most of this in I'm using them where all Austin and you see the problems with performance over it the other way have you do is you have like the leads to use something need more oriented lost all smaller software for example Moengo or cages regard logo and use most of us with but it's more of the thumb with Jason structures in and posters has been something that's fine for itself from a performance perspective it's of the problem however I also need to say that we don't created a nested structures so much so that we know what the Jason we're interested in this is all is requested by DAPI as the entire feet we we don't say stage by by stuff which is within so if the answer is we the answer I'm giving you is within this context however from if you take if you have a look at benchmarks you see that hostess and longer life I'm dabbled very performance in the city guide so no we did not have the to to use Monge for this thank you we kind of more questions my FIL and it works in the field of so the UN Charter which is that it's very hard to do after the on the human log that 1 you get more successful than you know the the the and the only thing you draw on a monthly basis but 1 of these lamps 1st of all it's a it's a good problem to have this with the same training data point so ourselves whenever we we need to spell out and to to basically take any of this stuff and monitoring and the people keep Dennis as to what's going on it's both in the database and in your in your in your web request and yet when when you get stuff going a little slower and when you see following when you see things like getting implemented a slower but just little by little known to divide because if it it may grow a bit too much at some point so you should start providing guarantees which keep keep tabs on the bottleneck because the the more you know about a bottleneck in the more you 1 and control a so as to how to fix it and it doesn't take you by surprise last question not In less than that only once again
TVD-Verfahren
Retrievalsprache
Bit
Prozess <Physik>
Summengleichung
Sondierung
Computeranimation
Übergang
Intel
Umkehrung <Mathematik>
Dämpfung
Maschinelles Sehen
Sichtenkonzept
Pay-TV
Datenhaltung
Nummerung
Biprodukt
Kontextbezogenes System
Bitrate
Entscheidungstheorie
Softwarewartung
Software
Datenfeld
Menge
COM
Facebook
Ordnung <Mathematik>
Tabelle <Informatik>
Rückkopplung
Nebenbedingung
Subtraktion
Relationentheorie
Mathematisierung
Automatische Handlungsplanung
Analytische Menge
Abenteuerspiel
Datensatz
Informationsmodellierung
Spieltheorie
Datentyp
Inverser Limes
Inhalt <Mathematik>
Softwareentwickler
Speicher <Informatik>
Widerspruchsfreiheit
Videospiel
Relativitätstheorie
Validität
Last
Wort <Informatik>
Normalvektor
Punkt
Rahmenproblem
Versionsverwaltung
Kartesische Koordinaten
Mathematische Logik
Code
Framework <Informatik>
Computeranimation
Übergang
Benutzerbeteiligung
Migration <Informatik>
Front-End <Software>
Inverser Limes
Zeiger <Informatik>
Diskrepanz
Schnittstelle
Softwaretest
Videospiel
Befehl <Informatik>
Abstraktionsebene
Datenhaltung
Systemverwaltung
Speicher <Informatik>
Einfache Genauigkeit
Abfrage
p-Block
Frequenz
Biprodukt
Kontextbezogenes System
Frequenz
Modul
Algorithmische Programmiersprache
Transaktionsverwaltung
Rechter Winkel
Automatische Indexierung
Mereologie
Server
Programmierumgebung
Instantiierung
Tabelle <Informatik>
Resultante
Distributionstheorie
Bit
Gewichtete Summe
Freeware
Momentenproblem
Minimierung
Automatische Handlungsplanung
Zahlenbereich
Transformation <Mathematik>
Computerunterstütztes Verfahren
Zählen
Computeranimation
Homepage
Puffer <Netzplantechnik>
Loop
Benutzerbeteiligung
Prozessfähigkeit <Qualitätsmanagement>
Inverser Limes
Speicher <Informatik>
Tabelle <Informatik>
Befehl <Informatik>
Statistik
Sichtenkonzept
Pay-TV
Datenhaltung
Verhandlungs-Informationssystem
Abfrage
E-Funktion
Ein-Ausgabe
Sichtenkonzept
Chipkarte
Entscheidungstheorie
Transaktionsverwaltung
Puffer <Netzplantechnik>
Datenfeld
Menge
Automatische Indexierung
Ruhmasse
URL
Normalvektor
Portscanner
Tabelle <Informatik>
Retrievalsprache
Verklemmung
TVD-Verfahren
Bit
Sichtenkonzept
Natürliche Zahl
Klasse <Mathematik>
Abfrage
Quellcode
p-Block
Code
Computeranimation
Rechenschieber
Transaktionsverwaltung
Prozess <Informatik>
Stichprobenumfang
Auswahlaxiom
Lesen <Datenverarbeitung>
Resultante
TVD-Verfahren
Bit
Punkt
Web log
Natürliche Zahl
Web-Applikation
Raum-Zeit
Computeranimation
Übergang
Homepage
Datenmanagement
Prozess <Informatik>
Gruppe <Mathematik>
Kontrollstruktur
Notepad-Computer
Elektronischer Programmführer
Analytische Fortsetzung
Benchmark
Umwandlungsenthalpie
Distributionstheorie
Automatische Indexierung
Parametersystem
Datenhaltung
Strömungsrichtung
Kontextbezogenes System
Bitrate
Entscheidungstheorie
Rechenschieber
Transaktionsverwaltung
Festspeicher
Dateiformat
Programmierumgebung
Tabelle <Informatik>
Aggregatzustand
Web Site
Wellenpaket
Thumbnail
Mathematisierung
Zahlenbereich
Maßerweiterung
Task
Virtuelle Maschine
Benutzerbeteiligung
Datensatz
Multiplikation
Software
Perspektive
Migration <Informatik>
Speicher <Informatik>
Maßerweiterung
Datenstruktur
Varianz
Tabelle <Informatik>
Videospiel
Speicherschutz
Cursor
Mapping <Computergraphik>
Moment <Stochastik>
Energiedichte
Last
Caching
Basisvektor
Zeitdilatation

Metadaten

Formale Metadaten

Titel Optimizing queries for not so big data in PostgreSQL
Serientitel EuroPython 2017
Autor Mifsud, Stephanie
Lizenz CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
DOI 10.5446/33803
Herausgeber EuroPython
Erscheinungsjahr 2017
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Optimizing queries for not so big data in PostgreSQL [EuroPython 2017 - Talk - 2017-07-13 - Arengo] [Rimini, Italy] Hotjar’s user recordings count above 400 million, with supporting tables containing 4.5 billion records. This 5TB data fits nicely into Postgres and doesn’t quite merit the full big data suite of tools. However, at the rate of 1000 recordings per minute, and overall request rate of 750K per minute, the penalty of inefficient queries and updates can quickly cause nasty performance spikes if not thought out well. This talk is about the challenges we faced at the lower end of big data: the good decisions which helped keep our application running and other lessons we had to learn the hard way Considerations for Database Design Design entities for the domain Balance normalization with performance Sharding later has big migration costs, consider designing for this early Speak to the database from your Web Application Why use ORMs and at which level of abstraction? Stored Procedures are fast, should we have more of those? Bringing data closer to the application Materialize Views Defer aggregations Application Level Caching Handling Operational Troubles Explain(analyze, buffers) is your friend Detect and manage Index Bloat Reduce Deadlocks Reducing Impact of Background Maintenance Jobs Keep impact on database low with cursors and streaming Plan data retention policies early, so cleaning can be an ongoing proces

Ähnliche Filme

Loading...
Feedback