Logo TIB AV-Portal Logo TIB AV-Portal

Performance and Scalability Enhancements in PostgreSQL 9.2

Video in TIB AV-Portal: Performance and Scalability Enhancements in PostgreSQL 9.2

Formal Metadata

Performance and Scalability Enhancements in PostgreSQL 9.2
Alternative Title
Performance Improvements in PostgreSQL 9.2
Title of Series
Number of Parts
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Bigger servers, bigger problems The upcoming PostgreSQL 9.2 release features a large number of performance enhancements by many different authors, including heavyweight lock manager improvements, reduced lock hold times in key hot spots, better group commit, index-only scans, better write-ahead log parallelism, sorting improvements, and a userspace AVC for sepgsql. In this talk I'll give an overview of what was changed, how it helped, lessons learned, and the challenges that remain.
man goods Computer animation time single Databases bits Video Regular Expressions scalability
presentation schedule Actions concurrent unit indicators clients shape area different cores systems man relation Blocks Development effects bits maximal Transactional meetings category data management message-based CPUs buffer sum sort progress point Slides addition share labor diagonals varieties component patched Mass drop scalability van number caches Hacker terms level structure standards Turing Graph scale information law Content volume gute lines limitations raids powerful Symbolic Query case different operating life band table family states time sheaf scans part energy rates memory analog box series area curve algorithm virtual measures several entire testing Right Results since RPG record current track server factor lines machine sign in versions several operations testing Gamma message-based absolute values optimal tasks addition multiple response validation projects plan inclusion Indexable caches redundancy Computer animation Blog
point implementation Actions bottom multithreaded factor real time machine capture indicators loss clients Mass part events Arm scalability number versions workloads mathematics Average Hardware testing Gamma optimal systems man Graph sin web pages effects Transactional median lines non-existence Benchmarks CPUs processes Computer animation case orders disk Right progress Results
web pages indicators Bugs versions structure normal Tuning conditions systems area man mapping sin web pages Transactional non-existence Indexable words Computer animation Query case crashes Right table tuple
concurrent time sources scans indicators part energy fraction Auswahlverfahren memory single model Weak classes regression Blocks Transactional instance variables entire processes buffer orders configuration testing Right pattern Results spacetime point web pages Vacuum Slides statistics factor Ionic machine help Mass transfer events number product workloads goods terms testing selection optimal Home Page scale information Ionic heaps plan smart potential Indexable words Computer animation case Query speech game table tuple
laptop functionality directions views help part font events second subset fraction Representation testing series optimal classes area man Arm information neural network Computer animation Query case limited set life table
Actions Context building randomization time singularities sets scans indicators water functions Arm Sans computational mechanisms estimates box circle series area man algorithm generate relation argument bits Hot scans wave means real vector pattern Right sort record spacetime web pages point Slides functionality fields number elements specific operations Indexable selection loop form data types key polygons expression operations plan lines applications Indexable words loop Computer animation case versions table Windows
laptop point implementation randomization states integrators time singularities machine powerful number caches terms information implementation Cats Security optimal systems data types script real forces The list plan bits variables caches CPUs processes Computer animation real vector Query provide systems
randomization Context system call states code time sources sets loss counting exclusion rates configuration CPUs cores systems script generate Development Kleinste-Quadrate-Sch├Ątzern Transactional Automata variables Shoots CPUs testing Right sort variations Results modes services files machine branch theoretical scalability number goods Hacker profiles level testing structure report platforms overhead Graph scale information NET Content applications system call Computer animation case functions
hard Actions NET time part perspective proper different Synchronous circle systems Development fit effects Transactional scans measures buffer different Right sort figure systems write Results source point files real patched theoretical scalability number period workloads specific causal terms level testing Graph expression Content extreme applications system call potential single causal Computer animation case Blog Synchronous band objects
to you not remind you still
neural net OK so when I say hello that's my subtle cues that don't start talking about because there are and so thank you all for coming out and this is my talk on performance and scalability enhancements and PostgreSQL online to I told people that I told people that I really was thinking of attending Alexander hothouse talk on regular expression searches the session it'll be the thought maybe better come to this 1 I was afraid I was going to be the only 1 here is because I wouldn't be pretty good time but I'm glad to see you're all here and so I just wanna go up into a little bit more detail about some of the things that that that the PostgreSQL all community including me but not limited to me that were able to do in PostgreSQL and to to the emperor performance and scalability and and so were 2 by way of introduction of men and from across I'm a PostgreSQL all major contributor and his 1st book metric and and also a single database architect that the price today and so we
are we improve performance and a lot of different areas I post rescue 0 9 2 1 1 1 of the areas and the 1 that makes the best graph and is in the area of a concurrency high concurrency systems with many use all trying to do things at the same time and but we had a variety of other and meet performance improvements as while an index only scans are another feature which I'll talk little bit more about as we go through the family and did some great work on parameterized plans and there has been some work done by a couple different people putting Alexander for about the fighting in the other room on indexing of several people work on sorting out that was led by Peter data back their back and you also orderable durable lot of the work on on power consumption along with family which because it's possible review about whether power consumption qualifies as a performance improvement but I thought it was so the end and a couple of miscellaneous things a mention briefly at the end of this this is a little bit more of a hacker oriented conference have also added a few slides to the and from the from previous time I gave this presentation talking about some of the techniques that I found helpful as I was working on performance for this release and kind of understanding of what the performance problems were and what areas need improvement so if for well for those of you in the audience who are PostgreSQL developers or just generally interested in benchmarking the little bit at the end and things that I've learned about measurement over the course of this relates so I'll talk about that at the end of so this is a graph that I made back in September of last year and this is essentially showing what happens when you hear those question well with lots of really small queries user really only queries they're doing primary key lookups on a small table so you have basically 1 table it's not that they are and you just trying to do the a primary key look up upside it really really fast and these results were taken on a 32 core AMD Opteron 6 128 box and and what you can see is that in PostgreSQL online 1 of the performance topped out around 24 hours and and after that it it degrades bed and then around to clients it levels off this is a 32 core server so what we would like to see ideally is performance go up on a straight diagonal line until it gets the 32 and then turned flat and stay flat across the rest of the graph because obviously once with use policy used in the server which we would hope to do with the minimum possible number of clients see any improvement after that I and in fact you can really achieve that because if you pile on a massive number of clients that is far larger than the number of CPU you have inevitably you're going to have some degradation so you see that all these clients select out at varying rates as you go further out in 9 1 it's a pretty gradual curve this curve is very are a very on flat actually I'm not very high but it's very predictable you're going be right about there so the 1st one I have that I committed to post 1st all to develop I was actually the 1 that got the biggest bang for the buck in terms of improving performance on this particular task and then it got from that blue 1 here after the green line there and and what it a what essentially was happening there is that on the blue line we were having a lot contention while trying to figure out that we didn't have a lot contention so all these different brands were trying to access the same table and the same index on the table at the same time and so they all needed to lock the table but not in a way that conflicted with each other they only had a lot that was strong enough to guarantee that for example nobody else could drop the table while they were in the middle of reading sitting in a real strong lock it in interlocked excluded other people trying to read data from the table they just needed a lot excluded a very forceful operation like a like a drop or an altercation or something like that and so there was no actual locking conflict there was no problem with all these people proceeding at the same time but they didn't know that they spend a lot of time trying to verify that no what conflict with presence of the bad blocks act but it is basically made that common case where the table what don't conflicted made the unintended and as you can see that you get a pretty significant performance boost was better that of the of the of the red line came from some further improvements in that vein which I'll talk about a little bit more on the next slide 1 of the things that I think it's easy to miss on this slide is you know you look at the improvement in here 32 clients and the 32 for machine and it's a factor of more than it so it's a huge improvement and you might say well that's great but I don't have a 32 core server but if you look down here even in the vicinity of a course even though it gets have lost on the scale of the graph that bond between blue and the red and green is actually pretty significant and so you want your unit dignity in past years and it's interesting and core server that was similar to this and you know you're not gonna see multiples but you definitely can see double-digit percentages of it you know you know even for a box which is not that big a bite by today's standards mark and the other thing to notice is that the 1 thing about that green line that initial had made the absolute value a lot higher on this test but you can also see that green line is slanting down scary quickly as you add additional flights the lines basically flat the green line has a much deeper negative slope beyond 32 and I don't really pay much attention to that until somebody found a case where they actually made this green line drops down so fast that it was below the blue line and I said no I felt better see if I can figure out what's going on that's so we made a couple different prevents that that that passed on so again this is how we did it out of the main part of it was the 1st full point I'm adding a fast enough the lock manager to allow only relation what about that men like manager in most cases and then extended that have to allow of virtual transaction ID last also bypass them in what manager and and we are we are we did some optimization of something called validation messages which are only sent when something DDL operation happens and we were spending a lot of effort checking out whether there were injured validation messages present and the answer was almost always null and but we still spend a lot of energy checking on that so we now have spent much less energy and checking for sure but it's it's it's a very small improvement if you've only 1 part 1 client but when you have always components it starts to matter
alright I say something about scaling up to 32 course this graph that actually is was taken on a 64 core system and you can see that was done later it was April I posted this on my blog I after I gave the 1st version of this talk and i it became my most popular blog post ever which I found kind of ironic since I've written in 10 minutes and that felt that I really wasn't expecting any to debate anybody could be that in practice but it turned out to be people not being able to scale up to 64 course was pretty cool admittedly this very simple past but but but it seems that if you've got the latest version PostgreSQL version of Linux and get everything set up right you get all the way up to 64 cost and do lots and lots of really small read-only queries over quickly and 1 interesting thing that you may have noticed on the other graph and it's even more visible here is you've got this kind of dish shape here between like 1638 and you really like that to be a diagonal line right and you're looking for linear scalability not scalability that starts being less than linear and it turns around its miraculously more than like you kind of feel nervous about that and it turns out that you know the the the 64 core value here is something like 63 . 7 times the single was also in the end at the limit is linear but there's this place in the middle where it looks kind of not that linear it's closed but it's not there I I mentioned this and Greg Smith mentioned to me that he seen this effect on tests that he's run as well and and and I subsequently some other people should be some test from a different operating systems and and I have reason to believe that this saddle shape here in between in and 32 cost is actually an artifact of the way that the Linux scheduler does process scheduling and which I would like to prove and I'd like to that of it is that I like to convince you might to see if they can do something about that but that has now I'm sure but so that in addition to all of in addition to optimizing for higher reconstruct currency there are also some good optimizations in this release for high right and current state of the art of colleague at that time anyway of mine minor at Enterprise DB band a bond I had this idea of taking the sum of the members of a data structure called the PG Proc data structure on and moving them out of the main array and put them in a separate array which would be smaller and then he said when you know if you take this very heavily access the data of based away and put it into a smaller array I you'll see a performance improvement and I said that's the most ridiculous thing I've ever heard of in my life so why are you wasting your time I'm that please pick a project that has symbols of success it works great if not and that's fantastic so I guess that just underscores to me that it's a good idea to test these things that it turns out it sort of makes sense I was just amazed that the effect is actually large enough to be noticeable but it's very noticeable but it turns out that you know I I knew this place appreciate the importance of the CPU's move data around between the different seed user multi-CPU systems in chunks of like 64 bytes 128 bytes sometimes even 256 bytes they're called cache lines in the entire cache line is moved as you from 1 CPU to another CPU as the different CPU's access the data so if you've got a really important data mixed into the same cache line with some unimportant data you move around a lot important data along with the important data and it turns out that if the ratio of reported data that important data is low and it cost you something material in terms of performance this is obviously not going to be true for data structures that have less heavily trafficked traffic than our RPG properties structure which gets very very hot and but when you have something I that gets that hot and then and there that it does matter quite that so that was was 1 of the biggest improvements of that we got an analog credit problem for for persisting with that idea in the face of my stubborn insistence that it had chance to work and we also did some improvements and various improvements either a bunch of different people who work on this what I was 1 of the few others will mention and around and around and see log c log is a data structure that we used to keep track of which transactions are committed in which transactions are awarded in which transactions are still progress and so when a transaction commits that we have to go to see what the market committed and when we need to know whether some transactions that were interested in and committed war or or bid and we also have to consult the law for that information and so there were a series of optimization and this this turns out to be another 1 of our big intensive points so I we I Simon non-standard raids had a patch that he had been cooking up and I posted some benchmark results showing that there was an issue in this area he said 0 I have a patch which fixes that problem and show that the dead and so we know that the 1st bullet point in that 2nd section here we improve the responsiveness of the practice called wall writer and so that we get and commit records for asynchronously committed transactions and if more quickly than we did before and it turns out that if you're using asynchronous that's that can actually improve performance significantly and we also increase the number of sea water well 1st of all we had previously had uh a coded limit of 64 K of memory for caching that and now it's a it's a an adaptive algorithm better captured for crimes that value as so there's more memory available for this cash which does help and we fix of what I would consider to be pretty much of what in the US LRU buffer replacement algorithm which could cause the whole system to grind to a halt with everybody waiting on an I O to complete when that I'll was basically irrelevant to what anyone was actually trying to get to fix the bug and that other than I we eliminate some redundancy log lookups during index scans and assignment and Peter and he also worked on improving but piggybacking of Wall flush which results in better group commit behavior there was talk on that earlier today so I want to the labor the point too much on but that was a very very significant improvements for certain cases of and we also had an optimization to reduce the volume of our write-ahead log information generated copy data into a table on which was done by our a collocate so
this doesn't this graph here doesn't reflect all of those things those optimizations and the results were done in February this has got the mean optimizations it doesn't show the benefit of the group commit because this was done with asynchronous commit but you can see mostly because I have a slow disk subsystem on machine I was using benchmark this I don't have the flat back right capture anything like that but but you can see that the that the throughput of the the system is not linear but it is that so you can see that between 9 1 and then this side of the in process in process version of 90 that I tested and you know it at 24 clients 32 point you have you know something close to 50 per cent improvement on this this is just a straight pitch events test I median of 3 3 minutes runs the kind of average of the noise on and off and so clearly we still have a way to to good ways to go before we really get this to scale linearly and all the way out to as many CPU should be can afford to buy other were getting I think we made some good progress in this release and hopefully will will continue to nibble away at it and an extra few thousand transactions per 2nd pretty on this graph is ah commit scalability so this is showing the effect of the work on that was done on her commits and I I had a little trouble River replicating this with the stock each event and so this is a test that up here pointed me at it basically just as a single insert certain a transaction by itself and so it's the maximally committee bound workload that you can come up with and so you are unlikely to see this much better in a real world use case but on the other hand is really does show you how much better or the new implementation is that the old implementation of this green line again is is 9 1 and this red line is the gene 9 2 as of March that much of a change in this picture so that so you can see that you know down with the with smaller numbers of clients things much together here at the bottom of the graph and is not a huge benefit but as you start to ramp up the number of clients and particularly as you start to get like 250 clients was kind sweet spot on this particular machine which is a US 16 physical 64 hardware threads that out of around 250 clients and you have a massive improvement in performance something you know something on the order of factor of 6 0 8 times but faster than than it was in 9 1 so that's definitely a that that part
I only ask if there's
just any questions on anything that I have gone over so far before and after lunch at a certain persons OK is that was very clear from the sleep OK and
it's only stands on the on major performance feature as in 9 2 1 sensate for we've had this data structure called of visibility maps and and that's basically what it does is it first one being asked for every 8 K table page on so it's very very small and k page of visibility now covers half a gigabyte of of tables based on and the visibility map dataset out only if we know that every tuple on that page that is visible to all current and future transactions at least until somebody modifies that and so on the the the reason why we have and that's only stands in earlier releases even though we had this 1 is because the visibility map was not crash safe and and actually as I as I recently discovered also just have some final race conditions so cost possible to have for the visibility have have their hours and and in those older releases we only use the visibility map as a way of accelerating back to which it does very well so if you're on a release party for I would definitely recommend that you upgrade unless you hear the word vacuuming like 0 yeah never have a problem with that but but in 9 2 we have properly fleshed out all of the all of the bugs that made the visibility now potentially inaccurate and that means that we can use it I'm not only for vacuuming which where it's not a disaster if we occasionally failed back in something that we should back and work or even less of a problem of we occasionally that's something that we don't really need to know it means we can actually use this for answering queries where we really need to be sure that we are going to only give people right answers all the time and so the idea is pretty simple I have all the data we need to answer the user's query is available from the index to fall that instead of reading the table page which is probably visibility and if we find that the pages all visible of then we don't actually need to read the table page I we can just return the the data directly from the IndexTable because we note that that that that is going to be visible to our stance but if we find pages and all visible we have to go check because it could be be that that tuple was inserted by transaction ordered of recently been deleted words recently inserted and might not be visible to our system matches and so this is optimizing for pretty common case where you have a table with that if you have a table is very very heavily updated all the time and you're not going to get much finer it from index only stands on the other hand you may have heard government they have much worse performance problems in other areas so I will see how this works out I think that this version of index only stands and I think it's definitely a big step forward over are what we had in the past I also expected to be fine tuning over the next couple of releases before we really get to do all the things that we like it to do on so here's an
example of this is a test case that I will perform my 1st big MacBook Pro here a high-performance machine industry and its get for data bytes of RAM I usually set buffers the 400 megabytes most because it has to 0 it's an electron number so that I uh initialized a but grossly oversized speech event instance scale factor 1000 from which is far too large test with comfortably on a machine with only boarding flights of brand and and I a a 5 minutes time Arafat that select only test with the aid of concurrent workers and and it's just bang it with a read-only primary key lookups using a primary key index on the table and then I created a covering index I created a 2nd index on the table alongside the primary key index that includes not only the column that is being used to look up the value but also the college at the top so this query select a balanced from PG pension can learn AID equals some value so I put both AID which is what we're using the look and a balance and the 2nd index on and interestingly I found out that the 2 indexes were exactly the same time the single column index the 2 column index were the same size down to the point they reach 2 thousand 142 megabytes which are conveniently enough is smaller than around our where is the size of the whole dataset is larger than that so the idea here is here's and setting up a test case holy moly beating the indexes and I'm able to fit the entire test case in memory whereas if I had access to the table I would have to good desk and things would presumably be much slower so what happens over 5 and with the events configuration I got 63 transactions a 2nd but with a covering index I got 300 to transactions cent now the thing about index only stands is that I've seen widely varying test results with this ranging from people who said this hardly helped me at all to 1 case where our perhaps that United States was 700 times faster and I think he had a test case where the index that ensure buffers and the table then even fit in RAM and so you know that that as with many of these performance improvements you know some people are going to see massive gains other people gonna see no games at all it you might even be 1 of those really lucky people who had some kind of regression and it's actually slower or but but but hopefully not bursts of this is he's he's he's over what is lot of all of but through the whole thing is the OK so yeah so you'll see widely varying indexes that was widely varying results in this depending on exactly how you test you know might traditional advice to people when I give my planet talk indexes are slower than you think they are I think there are I think that advice probably still applies but now we at least have the potential for somebody to make a covering index and have that help which had 0 chance of working in any previous away so that people get yes or no because I made Bruce modified EEG operated so that when you PG approach 9 to a new tube visibility which may not be great in terms of the fact that you'll have to do a full table vacuum up all those but it seems better than your queries will start returning wrong answers as yet index only scan the Shannon the explain plan and explain analyzes will also tell you the number of times that we had better tuple from the heat in the process of doing this and so you can actually tell it whether doing index only stand and b whether it's managing to actually be index only or whether it's having to go and check the heap anyway because the pages of an recently modified and back again 1 of the big weaknesses here is that on a vacuum doesn't know that vacuuming to makes index only scans were better so it automatically going about your table because it knows that getting rid of that tuple this work it's not going back in your table because it knows this setting visibility that that's up to make this optimization work better is also for important so if you're relying on index only scans and a little concerned that remains to find that people say 0 I better get match manual backwards in order to get the benefit of where that hasn't been necessary for a few releases because of improvements in backyard and unfortunately nobody had an affront to its to address that for this so yeah so
yeah yeah so it turned out that the index on just a idea was the same size as the and that acts on AID and a balance I believe that just means that there was a pattern space in a single column index that not by the 2nd column so we didn't actually really worth wasting were wasting some patterns based on single column index and the way you only I I think 1 of the other things going this cannot work for some people but the back that adding more columns to the index makes them bigger and this is a big huge pitfall every of both pressrelease that memory is that you have make index bigger than your table of the worst case then you know scanning the index doesn't necessarily save that much over his reading the data tables so I think you know I have big hopes for this feature but you know I wouldn't I I wouldn't expect that this is just magical faster source and where you just like you know he's like bored out and and you databases like lightning speed I think it's going to be a good tool for people who have the right workload which means mostly read only on and a dataset that is bigger than RAM and there's more and more people who have it is it's that fit in RAM because datasets keeping it getting bigger but there's still a lot of people who can fit the data in a terabyte of rare and you can buy that now it's not cheap but you can buy there's even a lot of people think that it is to 128 gigabytes of RAM which you more people at work for us and so there are there are you know that there's definitely is definitely not the beyond the and all it's nice but and it's only now apply in certain situations and then you're going to have to especially in this 1st release words a little rough around the edges are you probably don't have to fiddle with the little the figure out whether you can benefit of and yellow who so the index blocks the index-only scan is to look at yes those are stored and shared buffer just as they would do for regular based on that you so so the question of how to set buffers is a really good 1 that I don't have time to tell you about it out right now I do have a long walk post I have a Accenture buffers and wall on Wall offers and so got our house that once but that come and look at my post on this topic and it's got everything that I know and if you know something that I don't then please let me know because I want to know more than you get yes no questions have the whole world and so that the people who also just like every other just just what so so there's an index scan and there's a sequential scan and there's a bitmap index scan and now we have this new way which is called an index only and we just estimate the cost of each technique and we pick the 1 that we think is going to be cheap so existing costing model for indexing and gasses have the index pages were going have to read and how many he pages were going to have to read and as users that with the casting variables to estimate the cost this is the same thing except that the number of the pages that we think we're going to have to read is going to be smaller because we we have a new piece of information in PG class which tells us what fraction of the blocks we think are all visible and we use that to estimate how many factors for going to scare you over and that's totally not only scant see that the thing about this is you're absolutely right and the only thing I can really say in defense of the name is there is other products out there and they all have this exact same problem I still call index only step so it really should be called hopefully index scan I didn't give it turns out not to be and that's only it's going to be worse right like if it turns out that we have that's all of it tuples and we also wasted this energy checking the visibility that which is so you have to do the regular thing anyway it's going to be slower you can't do more work and have be faster and so you know I think and I think this is where as we get some experience with this technology will be able to better able to adjust the coffin model to figure out which cases this is going to be was that women were kind of losing hopefully packed up some of the cases that lose right now and make them more later on but it's just a complicated enough feature that they were going to get a perfect the 1st transfer would require more smarts than any of us have so on it depends on what problem so again I don't wanna get into too many questions in this talk but questions about tuning statistics targets and I have a planner part where I talked about that kind of stuff but as not this talk so we had to talked to the afterwards but I'm afraid if I don't move on we won't get through the rest of the slides and since I made the slides and I think the good and I was like have time to tell you what they say so if they have everybody let's say differ anymore more and it's only scan questions for the and and I'm going to talk about some other stuff which you may think is totally boring and maybe you're right that haven't the slightest so but let's
get a little wrong direction which
was happening OK so I probably did some work on a feature called parameterized plans and by did some work actually did all on all that I worked on this release I have help with a couple of other people from a couple of other people but basically just crank arm and again like some of the other things we've talked about it's only going to benefit you in a limited set of cases but if you hit 1 of those cases you can UV happy so this the same as the pros before passing each event class assignment 1 thousand on and I came up with a pretty artificial test query here but I joined ITER left join between generates series 1 and 2 copies of the PG and you can't stable on so what this is really intended to represent is a lot joining between all small cable which in this case is generate series of 10 reciprocal function and to large cables that are joined to each other in this case happens to be the same table joined itself for the sole purpose of making the for Atlanta have to think hard but in general it could be any 2 large tables trying to each of the important part is that you've got an inner join under the noble side of a life to it if you know what that means the probably have but yet so the other left join in on the noble side of it which means the font size further down in the query in this case you've got another joint happening inside there maybe you have just because you have view that there's something like this 1 or 2 have been complicated queries on then in earlier releases approaches the only way we can execute this query joined the 2 big tables to each other and then do the joining to the small table after we join the big tables and that turns out to be slow because joining big tables to each other so that you can pull up a very small fraction of them is not efficient and so Tom region something so that now we can do is instead of doing the joint between the 2 big tables once and during the area of the 2 tables to each other we repeatedly join very small subsets of the big tables to each other and pull of just the information that we have and so on this you know this somewhat artificial test case but I think it actually is representative of the kind of part of incremental see when this kicks in real-world situations by 9 1 and this takes repeatedly take up about 4 minutes and 10 seconds to execute on my laptop on G 9 to the 1st execution took 580 ms and then after that it took about 1 so it was a little faster and so you can get it if you if you have these kinds of queries where you're hurting because you don't have this optimization is is huge the reason why we haven't had this optimization so of course because it don't have these kinds of queries so that's why this is the case the got last right not a big surprise there but what if you have a this is you and so here's the
plan and 9 1 1 and we basically do a full index Hannibal tables to emerge joint between them and then to emerge jointed generate series in Africa and in 9 til I we scan the output of the set returning functions and then that drives this nested loop here well we repeatedly and that's the end of each of the relations so in the 1st place and we end up reading the entirety of the big table twice or inheritable but that also is a vector in the 9 to plan we just rows that we need surgically just ignore all rest so it's all other we have
the miscellaneous improvements in indexing on 1 of them which I've actually seen come up in the field is if you have an expression of the form some indexed column operator usually equals any or a lot about how we could not previously handle that with plane and it a bit less than that you could get a bitmap index scan for that planet but not playing and that skin but it turned out that there were cases where that sort of thing so the context that making a point and stand I will get better selectivity estimation now for some of the operators overlaps that contained is contained by and we had a number of improvements to just indexing I just indexes now build more quickly than they did before and there about quality B-tree indexes don't really have a problem with quality mean when you make a B-tree index you basically just take all your data and put it in sort of water that's in the just indexes are used for things like spatial indexing where the way you index your data is you try to group together and put bounding boxes around points that are in the same region of space but all those data elements on 1 page so that's kind of that kind of thing that humans are good at and computers have suck at and because it's hot right I mean if you have a random set of points and you have no knowledge about how the distributed and you have to figure out the most efficient way to group them into related areas and especially if you have things that are more than a single point like you circles lines of polygons or something you have to figure out a set of reasonably tight bounding boxes that are outlined in Article queries efficiently patterns that it becomes harder so we've had some improvements in the algorithm that not only decrease the index build time but also tend to result in a better quality of that's where you have to do less groveling around in the index to determine whether or not the key that you're searching for is actually present and and what pages it might be so that's definitely call and we have a new are index had as well called as he just and that's pretty much everything I know about on the at high was slightly more about that but not much as features but just in general can index anything you can imagine it and you can think about how to index it you can index at the desk and as the just as a more specialized thing for indexing the things that sort of occupy a particular point in space so you can use just to index i ns or polygons or circles or points as just you can index points because of the point doesn't have any dimensionality to it and they can also be used for some text search applications and I have not yet been able to demonstrate a performance improvement of text searching case over each and so I think maybe there's more work to be done there maybe that's that's not really what this is intended for and I've heard that if you have the right cases but this is pretty cool but for sure there yet or the very first one on the slide I I think it doesn't really matter whether the indexes multi-column a single column just a matter of whether you have a constructive that work there so we could talk more about wave specific case the drawing the for words of sorting this out this work was really or something that nobody was and nobody many people were not that excited about computer kept beating us and saying to this is a disgrace for sorting is too slow to speed up to pass so he got up and out I with some of them some other people as is often the case but we have this new infrastructure are called sort support I and it's basically a fattening axis and we have some very nice very general mechanisms and actually another theme of performance in this release is sometimes you for your general mechanisms of the window and stole some very specific acts to alleviate problems in your hot spots and and so that's basically what that done was sorting this release with with Peter kind of leading the charge on that and really insisting that this was a problem we need to care about that so we've seen some some some improvements and in sorting speed in this release was lazy and I don't have much more sorry but we did have some
improvements I fear was also the driving force behind our getting our power consumption down on and he did a lot of work on this timeline also did some work on this on in PostgreSQL blown 9 . 1 there are approximately 11 . 5 auxiliary process wake up to 4 seconds whereas as a snapshot last nite side to recast this from last time I did this fact has changed a bit as of last nite at approximately 0 . keyword regulatory process wake up for sigh and system that's been idle for a few minutes now if you're looking at this and going why should I care and that was my reaction to but it turns out that if you are a big hosting provider and you have a lot of virtual machines floating around and you have a lot of virtualized happens was pressed floating around and they're all random leading CPU for no reason even though the users are doing anything it costs you money and you don't like that because you probably a business and not just virtualizing things because you're generous so this is this is pretty cool I think that you know there are an increasing number of people out there who roku is 1 example of and and there are others who were just warning you give zillions of copies of of scripts and so having the system to be able to query ES down into a variable power state but you know when when it's not in use as a very good thing and you know something like this is improvement of more than 20 acts and I was testing this last nite on a virtual machine on my laptop that's for that will point 4 number comes from and you know what are now the VMware tool this is actually the top the top producer wake up something that machine when it's otherwise idle and wake up 4 2nd at but there's not that many things at above PostgreSQL I can guess that at some point 5 years from now people are going to say you have all points low-income per 2nd that's completely unacceptable you can have more than 0 . 0 4 where cast for 2nd but you know from now were way off of the bucket list of the people who are upset about these kinds of problems or people like red had broken with that of other people were running wild cats process so I mean
there's a of a bunch of other random performance improvements as there are in almost every release and we we have an improved playing catch now the 2 is the danger of getting a bad query plan when using prepared queries more work is probably still be there but but but but but we made a start on in this release it's been a long time for an insider many users a CpG SQL which is a post-Christian Linux integration now has a use access vector cash which makes it only really slow instead of far more slowly than you can possibly believe that there's probably some more room for optimization there but it's definitely better and pay any price for security and we have got to faster array assignment now on the LPGA SQL due to some improved caching and Washington walked implementation on educating did not conform to the best practices document published by Cage now does that and that's about all I can think of in terms of performance improvements in terms of I like to just take over the
last few minutes to talk over some things that I learned while working on on some of these performance improvements testing other people's performance improvements testing theories about what would or would not improve performance I learned a lot from this release and learn a lot from other people post-processual community of Peter Simon brag about Tom but a lot of people who can turn generation of people who can turn beyond the things that I should be looking at and I also learned a lot about just by doing a lot of benchmarking staring at the results and going otherwise it like that and so here's some things I learned a man find interesting but here again and 1st lesson of Plano OPG branches of pretty good test and in some in some cases it very artificial test but sometimes artificial tests are good because they take some particular aspect of the platform just to let it and you can get a pretty good sense of how fast a machine is select only PG bench tests at about 5 minutes and that you really need to do like 40 minute runs out if you're doing all right test because there's a lot more throughput variability and right test there is of all read test and I found that it was pretty important to repeat every test about 3 times the number of so that you could identify outliers and because sometimes you get a result which is randomly higher or lower than that the normal result for that test and that that that on read-only tests there was very little variation from the test and the rate at which transactions were being processed it basically just success comes along and right only tests the throughput as a function of time during the past shoots up and down by huge amount and you can use the PG branch minus l option to record the latency associated with processing every transaction this services processes during s and I found that quite helpful because you can then construct a graph that shows the number of transactions per 2nd on the vertical axis and time on the horizontal axis you can actually see the throughput rate going up and down from the test there's a couple problems with that 1 is you have to write a script the gravel through these enormous files that the feature-based myself spits out the other is that the files stillborn baby that actually lowers the performance of the system significantly from writing all this instrumentation that there's probably some room for for improvement there and there's a source level often called LW What's that's if you turned on by it dumps all this information showing you information about elderly walk acquisitions and contention I found that very helpful I also found that I even better than the other I counted how many times I had to us begin to acquire the spinlock protecting each light weight loss and for those in the audience were not hackers lightweight lots of a sort of fixed size set of locks that we used to protect shared data structures and post press and they have a shared node exclusive mode and they're protect and their internal state is protected by spend lot and if you need to wait for a lot you sleep using some of and so late woodlots are actually where most of our contention problems are and unfortunately we don't have a good set of user visible tools to really like know what the convention is happening the elderly watch that's then you can turn on for debugging purposes that can track down
and to find and I found that CPU profiling BG proviral profile is useless because they have too much of a low profile as a lot less than the profits the way too much further from the other hand is a new Linux tool and it is also it has very little overhead almost no which is amazing I have no idea how it's possible to gather call graph information on a 64 core system running people tell out without slowing the system down but apparently somebody else is smarter than me because it works out I perfect that the the big problem with just plain old prefer report is that it's not a really great way to measure scalability because typically when post-processing scaling well it's because you got lot contention and friends are going to sleep asleep they're not using CPU time so CPU profile looks exactly the same as it would if you didn't have a problem and so you can find anything that was on context which performs better be to prepare for court minus the CF minus gt you can actually see which call paths are causing the system to go to sleep and that tells you where you want contention as I'm not sure how useful this is as a user but as opposed to a school developer looking for things to optimize that was pretty great but I also wrote wrote a bunch of throwing custom instrumentation which was also just take a random piece of code you're wondering why is this code slope despicable and you get kind of day calls in there and subtracting all automata and then you drill down another level and once in the paper and I
know that the more general lessons I learned apart from specific kinds of testing extreme workloads like PG bands are not necessarily a great thing to do as a way of you know by answering the question how will this performed on my real world workload because it's an artificial extreme workload in some particular way not necessarily the most demanding workload although she'd write right only PG that right writing PG bench test is pretty extreme in terms of I often the result will be worse than what you really get much worse in some cases in which you can see in real world applications against developing it's great because it exacerbates the problem that real users say and turns them even larger problems which are which can be seen more clearly in text and a lot of problems are quite easy to facts not all but many problems require these effects once you understand what's really happening but there are a number of cases where took me months and months of testing and fiddling with different things and trying different approaches before actually understood what the real problem was affecting at least 1 notable case I committed effects to fix the problem and it turned out that problem was imaginary and fix was working for a largely unrelated reason I figured out 3 months later when fixed the real problem and so you know figuring it out is often the hard part it's very useful to measure system performance 1 most black he's not just TPS that's useful but also latency frequency what contention duration walks calls you find different problems when you look at system performance in different ways and you know sort of is sometimes hard how should I look at the most important problem from GPS perspective often with things that thousand what all's it's not always easy to decide what to work on but the more data you can get the better off you are another kind of interesting developer level thought of that these LW walks we have and very they have a feeling that they're actually very poorly suited to many the synchronization tests that we need to do inside posters throughout a don't get really have a theory on what would be a good place for so what's next
I love of a replacement is still mostly single-threaded and I don't have time to talk about this because I'm might time on that there is a graph on my blog and 1 of the older post actually shows a pretty clearly you get much better scalability of your work ensure buffers scalability deteriorates significantly when local do not fit injured offers on wall intuition single-threaded this is particularly nasty just after checkpoint and thinking this is something that's going to get fixed 69 . 3 since eighties and cooking oil patch of that disease system as sink take an amazingly long time completely 10 that's something that a single really small files so every place in in the system of nearly US sanctions of potential causes performance problems and have to spend some more time trying to enforce things out and a final point where there's a of proper log which is used during snapped NDCG snapshot acquisition it's also taken during transaction commit other more other bottlenecks we for fix the worst this when books and I think if he gets the wall and circle of problems that this is going to move so far ahead of everything else in terms of how that it is that we can just give book pondering any other performance work and figure out the solution to this problem was so that's all got a lot of time so I'm not going to ask for questions in big group here because that's not very people who wanna go back from for the next but all hang around appearing before a proper expressions effects and had the objective to banks