Performance and Scalability Enhancements in PostgreSQL 9.2
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 20 | |
Author | ||
Contributors | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/19029 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
PGCon 201210 / 20
1
3
6
7
9
11
12
14
15
16
19
20
00:00
VideoconferencingMetropolitan area networkScalabilityRegulärer Ausdruck <Textverarbeitung>DatabaseBitMultiplication signSingle-precision floating-point formatGoodness of fitXMLUML
01:42
AreaPrice indexPower (physics)Inclusion mapValue-added networkGrand Unified TheoryMetropolitan area networkAdditionMessage passingSoftware testingProxy serverGamma functionCache (computing)Line (geometry)Flow separationTuring testArmWeb pageSineExistenceNormal (geometry)Crash (computing)Chemical equationStorage area networkPort scannerLoop (music)Mathematical singularityOperator (mathematics)Subject indexingData typeRevision controlParameter (computer programming)Physical systemReal numberInformationImplementationOverhead (computing)CountingFunction (mathematics)Least squaresSystem callBefehlsprozessorSynchronizationDifferent (Kate Ryan album)CausalitySingle-precision floating-point formatWeightSoftware testingPhysical systemRevision controlLine (geometry)ScalabilityMultiplication signData managementMessage passingFigurate numberTerm (mathematics)Insertion lossDatabase transactionVector potentialWeb pageComputer file2 (number)Patch (Unix)ImplementationNumberDrop (liquid)WorkloadTable (information)Operator (mathematics)Buffer solutionBlogGroup actionSingle-precision floating-point formatGraph (mathematics)TheoryPerspective (visual)Task (computing)SynchronizationThread (computing)Chemical equationSubject indexingCache (computing)FrequencyContent (media)Mathematical optimizationMereologyCurveProcess (computing)Bit rateReading (process)AdditionScheduling (computing)Core dumpPort scannerInformationClient (computing)Sound effectBefehlsprozessorShape (magazine)MassResultantCASE <Informatik>Maxima and minimaQuery languageLoop (music)Cartesian coordinate systemDiagonalMedianLevel (video gaming)Sinc functionVirtual machineWritingPlanningFunction (mathematics)Food energyCommitment schemeMiniDiscRead-only memoryMeasurementReal numberTheory of relativitySoftware developerLoginVirtualizationPoint (geometry)BitSoftware bugAreaDatabase normalizationRight angleCausalityVariety (linguistics)Standard deviationSpherical capSlide ruleLimit (category theory)Doubling the cubeVacuumDigitizingSemiconductor memoryServer (computing)WordScaling (geometry)Row (database)TupleMemory managementCondition numberVulnerability (computing)Dependent and independent variablesSpacetimeMultiplicationSheaf (mathematics)Power (physics)DivisorSeries (mathematics)Block (periodic table)Lattice (order)Electric generatorData storage deviceArtificial neural networkExecution unitFitness functionPlastikkarteTrailDirection (geometry)Order (biology)Online helpComputer hardwareStatisticsProjective planeEnterprise architectureEndliche ModelltheorieNetwork topologyVideo gameMultilaterationDifferent (Kate Ryan album)Type theoryEntire functionElement (mathematics)QuicksortCuboidEstimatorData miningRegular graphSystem callProduct (business)Concurrency (computer science)Default (computer science)Selectivity (electronic)Raster graphicsArithmetic progressionNoise (electronics)ExpressionFraction (mathematics)Vector spaceVariable (mathematics)Data structureState of matterElectronic mailing listRandomizationSet (mathematics)DatabaseCodeLatent heatInternet service providerExtreme programmingShared memoryLaptopForcing (mathematics)Graphics tabletMechanism designHacker (term)SubsetLimit setRepresentation (politics)Home pageProfil (magazine)RoundingPolygonCircleView (database)Context awarenessFinite differenceAsynchronous Transfer ModeGoodness of fitAbsolute valueSource codeComputer configurationScripting languageAlgorithmNeuroinformatikBounded variationOutlierFunctional (mathematics)Computing platformInstance (computer science)Branch (computer science)Event horizonSpeech synthesisFamilyInformation securityTunisINTEGRALTraffic reportingComputer-assisted translationProper mapExclusive orHeat transferLinear regressionGame theoryDirected graphPhysical lawConnectivity (graph theory)Social classException handlingCasting (performing arts)AnalogyValidity (statistics)Presentation of a groupIBM RPGCategory of beingComputer fontBenchmarkMotion captureWater vaporArithmetic meanVolume (thermodynamics)Key (cryptography)WindowPerturbation theoryWeightConfiguration spaceWaveMappingService (economics)NP-hardShooting methodMusical ensemblePattern languageSymbol tableAutomatonFlow separationSummierbarkeitWage labourOcean currentForm (programming)Field (computer science)Object (grammar)Greatest elementRAIDMathematicsAverageArmBuildingComputer animation
Transcript: English(auto-generated)
00:22
Okay. Am I still muted? No. Okay. So when I say hello, that's my subtle cue that I want to start talking now. Hello! All right. So thank you all for coming.
00:41
This is my talk on performance and scalability enhancements in PostgresQL 9.2. I told people that I really was thinking of attending Alexander Korotkov's talk on regular expression searches this session. They told me they thought maybe I'd better come to this one.
01:00
I was afraid I was going to be the only one here because I think that's probably going to be a pretty good talk. But I'm glad to see you're all here. So I just want to go into a little bit more detail about some of the things that the PostgresQL community, including me but not limited to me, were able to do in Postgres 9.2 to improve performance and scalability.
01:28
So that's what I'm going to do. By way of introduction, my name is Robert Haas. I'm a PostgresQL major contributor and PostgresQL committer and also a senior database architect at EnterpriseDB.
01:43
So we improved performance in a lot of different areas in PostgresQL 9.2. One of the areas, and the one that makes the best graphs, is in the area of concurrency, high concurrency, systems with many CPUs all trying to do things at the same time.
02:06
But we had a variety of other neat performance improvements as well. Index-only scans are another feature which I'll talk a little bit more about as we go through. Tom Lane did some great work on parameterized plans.
02:22
There has been some work done by a couple different people, including Alexander Korogov, who's talking in the other room, on indexing. Several people worked on sorting. That was led by Peter Jagan, who's back there in the back. He also drove a lot of the work on power consumption along with Tom Lane, which I guess it's possible to
02:43
argue about whether power consumption qualifies as a performance improvement, but I thought it was neat, so I threw it in. And a couple of miscellaneous things I'll mention briefly at the end. This is a little bit more of a hacker-oriented conference. I've also added a few slides to the end from the previous time I gave this
03:00
presentation, talking about some of the techniques that I found helpful as I was working on performance for this release and kind of understanding what the performance problems were and what areas needed improvement. So, well, for those of you in the audience who are PostgreSQL developers or just generally interested in benchmarking, I've got a little bit
03:22
at the end about some things that I've learned about measurement over the course of this release, so I'll talk about that at the end. So this is a graph that I made back in September of last year, and this is essentially showing what happens when you hit PostgreSQL with lots of really small queries.
03:47
These are read-only queries, they're doing primary key lookups on a small table. So you have basically one table, it's not that big, and you're just trying to do primary key lookups on it really, really fast.
04:03
These results were taken on a 32-core AMD Opteron 6128 box, and what you can see is that in PostgreSQL 9.1, the performance tops out around 20 cores. And after that, it degrades a bit, and then around two clients, it levels off.
04:23
This is a 32-core server, so what we would like to see ideally is performance go up on a straight diagonal line until it gets to 32, and then turn flat and stay flat across the rest of the graph. Because obviously, once we've used up all the CPUs in the server, which we would hope to do with the minimum possible number of clients, we're not going to see any improvement after that.
04:46
And in fact, you can't really achieve that because if you pile on a massive number of clients that is far larger than the number of CPUs you have, inevitably you're going to have some degradation. So you see that all of these lines slant down at varying rates as you go further out.
05:01
In 9.1, it's a pretty gradual curve. This curve is very flat, actually. It's not very high, but it's very predictable. You're going to be right about there. So the first patch that I committed to PostgreSQL 9.2 Devel was actually the one
05:22
that got the biggest bang for the buck in terms of improving performance on this particular test. And it got us from that blue line there up to the green line there. And what essentially was happening there is that on this blue line, we were having lock contention while trying to figure out that we didn't have lock contention.
05:46
So all of these different threads were trying to access the same table and the same index on that table at the same time. And so they all needed to lock the table, but not in a way that conflicted with each other. They all needed a lock that was strong enough to guarantee that, for example,
06:02
nobody else could drop the table while they were in the middle of reading it. So they didn't need a real strong lock. They didn't need a lock that excluded other people trying to read data from the table. They just needed a lock that excluded a very forceful operation like a drop or an altered table or something like that.
06:21
So there was no actual locking conflict. There was no problem with all of these people proceeding at the same time, but they didn't know that. They spent a lot of time trying to verify that no lock conflict was present. So the fast blocks patch basically made that common case where the table locks don't conflict. It made that case uncontended.
06:41
And as you can see, you get a pretty significant performance boost out of that. The red line came from some further improvements in that vein, which I'll talk about a little bit more on the next slide. One of the things that I think is easy to miss on this slide is, you know, you look at the improvement here at 32 clients on this 32 core machine, and it's a factor of more than three.
07:06
So it's a huge improvement. And you might say, well, that's great, but I don't have a 32 core server. But if you look down here, even in the vicinity of eight cores, even though it gets kind of lost in the scale of the graph, that bump between the blue and the red and the green is actually pretty significant.
07:22
So you are, you know, Jignesh gave a test, and he actually did some testing on an eight core server that was similar to this. And, you know, you're not going to see multiples, but you're definitely going to see double digit percentage improvement, even on an eight core box, which is not that big by today's standards anymore. The other thing to notice is that one thing about that green line, that initial patch made the absolute value go a lot higher on this test.
07:47
But you can also see that this green line is slanting down scary quickly as you add additional clients. This blue line is basically flat. The green line has a much deeper negative slope beyond 32 cores.
08:01
And I didn't really pay much attention to that until somebody found a case where they actually made this green line drop down so fast that it went below the blue line. And I said, oh, I probably better see if I can figure out what's going on there and fix it. So we made a couple different improvements that got past that. So again, this is how we did it.
08:21
The main part of it was that first bullet point, adding a fast path to the lock manager to allow weak relation locks to bypass the main lock manager in most cases. And then we extended that to allow virtual transaction ID locks to also bypass the main lock manager. And we did some optimization of something called shared invalidation messages, which are only sent when some kind of DDL operation happens.
08:46
And we were spending a lot of effort checking whether there were any shared invalidation messages present. And the answer was almost always no, but we still spent a lot of energy checking on that. So we now spent much less energy checking for shared invalidation messages.
09:04
Very small improvement if you've only got one client, but when you have all this concurrency, it starts to matter. All right. Did I say something about scaling out the 32 cores? So this graph actually was taken on a 64 core system.
09:22
And you can see that it was done later. It was done in April. I posted this on my blog after I gave the first version of this talk. And it became my most popular blog post ever, which I found kind of ironic since I'd written it in 10 minutes and hadn't even spell checked the thing. And really wasn't expecting anybody to be that impressed.
09:44
But it turned out to be people thought being able to scale out to 64 cores was pretty cool. Admittedly, this is a very simple test. But it seems that if you've got the latest version of Postgres and the latest version of Linux and you got everything set up right, you can get all the way out to 64 cores and do lots and lots of really small read-only queries very, very quickly.
10:06
One kind of interesting thing that you may have noticed on the other graph, and it's even more visible here, is you've got this kind of dish shape here between 16 and 32 clients. And you'd really like that to be a diagonal line. I mean, you're looking for linear scalability, not scalability that starts out being
10:24
less than linear and then it turns around and it's miraculously more than linear. You kind of feel nervous about that. I mean, it turns out that the 64 core value here is something like 63.7 times the single core result. So in the end, at the limit, it's linear, but there's this place in the middle where it looks kind of not that linear.
10:45
It's close-ish, but it's not there. I mentioned this and Greg Smith mentioned to me that he's seen this effect on tests that he's run as well. And I subsequently, some other people showed me some tests from a different operating system.
11:03
And I have reason to believe that this saddle shape here in between 16 and 32 cores is actually an artifact of the way that the Linux scheduler does process scheduling, which I would like to prove and I'd like to, and if it is that, I'd like to convince the Linux guys to see if they can do something about that.
11:24
But hasn't happened yet. I'm sure. So in addition to optimizing for high read concurrency, there were also some good optimizations in this release for high write concurrency.
11:47
A colleague, at that time anyway, of mine at Enterprise DB, a guy named Pavan, had this idea of taking some of the members of a data structure called the pgproc data structure
12:03
and moving them out of the main array and putting them in a separate array, which would be smaller. And he said, well, you know, if you take this very heavily accessed data out of this big array and put it into a smaller array, you'll see a performance improvement. And I said, that's the most ridiculous thing I've ever heard of in my life.
12:25
Why are you wasting your time on that? Please pick a project that has some hope of success. It works great. It's fantastic. So I guess that just underscores to me that it's a good idea to test these things. It turns out, I mean, it sort of makes sense.
12:42
I was just amazed that the effect was actually large enough to be noticeable, but it's very noticeable. It turns out that, you know, and I knew this, but I didn't appreciate the importance of it. CPUs move data around between the different CPUs in a multi-CPU system in chunks of like 64 bytes or 128 bytes or sometimes even 256 bytes.
13:07
They're called cache lines, and the entire cache line is moved as a unit from one CPU to another CPU as the different CPUs access the data. So if you've got your really important data mixed into the same cache line with some
13:21
unimportant data, you have to move around all that unimportant data along with the important data. And it turns out that if the ratio of important data to unimportant data is low enough, it costs you something material in terms of performance. This is obviously not going to be true for data structures that are less heavily trafficked than our PGProc data structure, which gets very, very hot.
13:44
But when you have something that gets that hot, then it doesn't matter quite a bit. So that was one of the biggest improvements that we got, and a lot of credit to Pavon for persisting with that idea in the face of my stubborn insistence that it had no chance of working.
14:05
We also did some improvement, some various improvements. There are a bunch of different people who worked on this. I was one, and there were a few others who I'll mention, around CLOG.
14:22
CLOG is a data structure that we use to keep track of which transactions are committed and which transactions are aborted and which transactions are still in progress. So when a transaction commits, we have to go to CLOG and mark it committed. And when we need to know whether some transaction that we're interested in committed or didn't, we also have to consult CLOG for that information.
14:47
So there were a series of optimizations, and this turns out to be another one of our big contention points. So Simon Riggs had a patch that he had been cooking up, and I posted some benchmark results showing that there was an issue in this area.
15:05
And he said, oh, I have a patch which fixes that problem, and sure enough, it did. So that's the first bullet point in that second section there. We improved the responsiveness of a process called the wall writer so that we get commit records for asynchronously committed transactions on disk more quickly than we did before.
15:24
And it turns out that if you're using asynchronous commits, that can actually improve performance significantly. We also increased the number of CLOG buffers. We previously had a hard-coded limit of 64K of memory for caching that.
15:41
Now it's an adaptive algorithm, but it caps at four times that value. So there's more memory available for this cache, which does help. We fixed what I would consider to be pretty much a bug in the SLRU buffer replacement algorithm, which could cause the whole system to grind to a halt with everybody waiting on an
16:02
IO to complete when that IO was basically irrelevant to what anyone was actually trying to do. So we fixed a bug in that algorithm. We eliminated some redundancy log lookups during index scans. Simon and Peter and Heike all worked on improving the piggybacking of wall flush, which results in better group commit behavior.
16:27
There was a talk on that earlier today, so I won't belabor the point too much. But that was a very significant improvement for certain cases. We also had an optimization to reduce the volume of write-ahead log information
16:46
generated when you copy data into a table, which was done by my colleague Heike. So this graph here doesn't reflect all of those optimizations. The results were done in February.
17:02
This has got the main optimizations in it. It doesn't show the benefit of the group commit because this was done with asynchronous commit. But you can see, mostly because I have a slow disk subsystem on the machine I was using to benchmark this. I don't have a flashback write cache or anything like that. But you can see that the throughput of the system is not linear, but it is better.
17:25
So you can see that between 9.1 and this in-process version of 9.2 that I tested, at 32 clients, you have something close to a 50% improvement on this.
17:42
This is just a straight PG-BEN test. A median of 330 minutes runs to average out the noise. And clearly we still have a ways to go before we really get this to scale linearly all the way out to as many CPUs as you can afford to buy.
18:02
But we're getting there. I think we made some good progress in this release and hopefully we'll continue to nibble away at it and an extra few thousand transactions per second is pretty nice. This graph is commit scalability. So this is showing the effect of the work that was done on group commits.
18:23
I had a little trouble replicating this with stock PG-BENCH. So this is a test that Peter pointed me at. It basically just does a single insert in a transaction by itself. So it's the maximally commit-bound workload that you can come up with. So you are unlikely to see this much benefit in a real-world use case.
18:44
But on the other hand, this really does show you how much better the new implementation is than the old implementation. This green line again is 9.1 and this red line is PG-9.2 as of March and I don't think much has changed in this picture since then.
19:02
So you can see that down with smaller numbers of clients, things are mushed together here at the bottom end of the graph. There's not a huge benefit, but as you start to ramp up the number of clients, and particularly as you start to get up to like 250 clients, was kind of the sweet spot on this particular machine, which is 16 physical cores, 64 hardware threads out at around 250 clients.
19:28
I mean, you just have a massive improvement in performance, something on the order of a factor of six or eight times faster than it was in 9.1. So that's definitely pretty.
19:46
Let me ask if there's just any questions on anything that I've gotten over so far before I sort of plunge into the next section. Questions? Okay. Either that was very clear or everyone's asleep.
20:03
Okay. Index-only scans, another major performance feature in 9.2. Since 8.4, we've had this data structure called the visibility map. And basically what it does is it stores one bit for every 8K table page.
20:22
So it's very, very small. An 8K page of visibility map covers half a gigabyte of table space. And the visibility map bit is set only if we know that every tuple on that page is visible to all current and future transactions, at least until somebody modifies the page.
20:45
So the reason why we didn't have index-only scans in earlier releases, even though we had this, is because the visibility map was not crash safe. And it actually, as I recently discovered, it also just had some plain old race conditions.
21:02
So it's possible for the visibility map to have errors. And in those older releases, we only use the visibility map as a way of accelerating vacuum, which it does very well. So if you're on a release prior to 8.4, I would definitely recommend that you upgrade, unless you hear the word vacuum and you're like, oh, yeah, I never have a problem with that.
21:24
But in 9.2, we have hopefully flushed out all of the bugs that made the visibility map potentially inaccurate. And that means that we can use it not only for vacuuming, where it's not a disaster, if we occasionally fail to vacuum something that we should have vacuumed,
21:42
or even less of a problem if we occasionally vacuum something that we didn't really need to. It means we can actually use this for answering queries, where we really need to be sure that we are going to only give people right answers all the time. So the idea is pretty simple. If all the data we need to answer the user's query is available from the index tuple,
22:04
then instead of reading the table page, we just probe the visibility map. And if we find that the page is all visible, then we don't actually need to read the table page. We can just return the data directly from the index tuple, because we know that that tuple is going to be visible to our scan.
22:25
But if we find that the page isn't all visible, then we have to go check, because it could be that that tuple was inserted by a transaction that aborted, or it's recently been deleted, or it's recently been inserted, and it might not be visible to our MPC snapshot. So this is optimizing for a pretty common case,
22:42
where you have a table that's fairly static, right? If you have a table that's very, very heavily updated all the time, you're not going to get much benefit from index-only scans. On the other hand, you may have much worse performance problems in other areas. So we'll see how this works out.
23:01
I think this version of index-only scans, I think it's definitely a big step forward over what we've had in the past. I also expect it to be fine-tuning over the next couple of releases before we really get it to do all of the things that we'd like it to do. So here's an example.
23:21
This is a test case that I whipped up on my trusty MacBook Pro here, a high-performance machine, I assure you. It's got 4 gigabytes of RAM. I usually set shared buffers to 400 megabytes, mostly because it adds some two zeros, and I like round numbers. I initialized a grossly oversized PGBench instance, scale factor 1000,
23:48
which is far too large to test with comfortably on a machine with only 4 gigabytes of RAM. And I ran a five-minute select-only test with eight concurrent workers,
24:04
and it's just banging it with read-only primary key lookups using the primary key index on the table. Then I created a covering index. I created a second index on the table alongside the primary key index that includes not only the column that is being used to look up the value,
24:25
but also the column that's being returned. So this query is select a balance from PGBench accounts, where AID equals some value. So I put both AID, which is what we're using for the lookup, and a balance into this second index. And interestingly, I found out that the two indexes were exactly the same size.
24:44
The single-column index and the two-column index were the same size down to the byte. They were each 2,142 megabytes, which, conveniently enough, is smaller than RAM, whereas the size of the whole dataset is larger than RAM. So the idea here is I'm setting up a test case
25:02
where hopefully by only needing to access the indexes, I'm able to fit the entire test case in memory, whereas if I had to access the table, I would have to go to disk, and things would presumably be much slower. So what happened? Well, it was about five times faster. With the default PGBench configuration, I got 63 transactions a second.
25:23
With the covering index, I got 302 transactions per second. Now, the thing about index-only scans is that I've seen widely varying test results with this, ranging from people who said, this hardly helped me at all,
25:40
to one case where Tom Brown said, yeah, my test case was 700 times faster. I think he had a test case where the index fit in shared buffers, and the table didn't even fit in RAM. So, you know, as with many of these performance improvements, some people are going to see massive gains,
26:01
other people are going to see no gains at all. You might even be one of the really lucky people who hit some kind of regression, and it's actually slower. But hopefully not. Bruce? Yes.
26:34
Yeah, that's true.
26:51
Fair enough. Okay, so yeah, you'll see widely varying results in this,
27:00
depending on exactly how you test. You know, my traditional advice to people when I give my query planner talk is, indexes are slower than you think they are, and I think that advice probably still applies. But now we at least have the potential for somebody to make a covering index and have that help, which had zero chance of working in any previous release,
27:22
so that's kind of cool. Yeah? You mentioned that the new- Yes.
27:44
No, because I made Bruce modify pgUpgrade so that when you pgUpgrade to 9.2, it nukes your visibility map, which may not be great in terms of the fact that you'll have to do a full table vacuum of all of those, but it seems better than your queries will start returning wrong answers.
28:05
Awesome. Yes, index-only scans are shown in the explain plan, and explain analyze will also tell you the number of times that we had to fetch a tuple from the heap in the process of doing the scan,
28:24
so you can actually tell A, whether you're doing an index-only scan, and B, whether it's managing to actually be index-only, or whether it's having to go and check the heap anyway because the pages have been recently modified and you haven't vacuumed yet. One of the big weaknesses here is that AutoVacuum doesn't know
28:41
that vacuuming makes index-only scans work better, so AutoVacuum is only going to vacuum your table because it knows that getting rid of dead tuples is important. It's not going to vacuum your table because it knows that setting visibility map bits to make this optimization work better is also important,
29:02
so if you're relying on index-only scans, I'm a little concerned that we may find that people say, oh, I've got to schedule manual vacuums in order to get the benefit out of this, where that hasn't been necessary for a few releases because of improvements in AutoVacuum. Unfortunately, nobody had enough round-to-its to address that for this release.
29:24
Can you give us the next size ID? Yeah, so it turned out that the index on just AID was the same size as the index on AID and A balance. I believe that just means that there was padding space
29:42
in the single column index that got eaten up by that second column, so we didn't actually really, we were just wasting, we were wasting some padding space on the single column index. One of the cases where we have to use the input,
30:00
in some other cases it might be that the index size will be different when we add it. Oh, yeah, I think one of the other things that's going to cause this to not work for some people is the fact that adding more columns to the index makes them bigger. And this has been a huge pitfall of every Postgres release in memory, is that if you make your index bigger than your table, in the worst case,
30:24
then scanning the index doesn't necessarily save you that much over just reading the data out of the table. So I think, you know, I have big hopes for this feature, but I wouldn't expect that this is just magic go-faster sauce,
30:40
where you just like, you know, you just like pour it out and your database is like lightning speed. I think it's going to be a good tool for people who have the right workload, which means mostly read-only, and a data set that it's bigger than RAM, and there's more and more people who have data sets that fit in RAM
31:00
because data sets keep getting bigger, but there's still a lot of people who can fit their data in a terabyte of RAM, and you can buy that now. It's not cheap, but you can buy it. There's even a lot of people who can fit it into 128 gigabytes of RAM, which even more people can afford to buy. So there are, you know,
31:22
this is definitely not the be-all and the end-all. It's nice, but it's only going to apply in certain situations, and you're going to have to, especially in this first release, where it's still a little rough around the edges, you're probably going to have to fiddle with it a little to figure out whether you can get a benefit and how big it's going to be.
31:44
So the index blocks that the index-only scan needs to look at, yes, those are stored in shared buffers, just as they would be for a regular index scan. So the question of how to set shared buffers
32:06
is a really good one that I don't have time to tell you about right now, but I do have a long blog post on how to set shared buffers and wall buffers. So go to arhast.blogspot.com and look at my post on this topic, and it's got everything that I know.
32:22
And if you know something that I don't, then please let me know, because I'd love to know more than I do. Yep. Yes. Any other questions?
32:46
What you have mentioned, is that if the index is there, so people will try to go to the index-only scan. Well, so just like every other, so there's an index scan, and there's a sequential scan,
33:00
and there's a bitmap index scan, and now we have this new fourth way, which is called an index-only scan. And we just estimate the cost of each technique, and we pick the one that we think is going to be cheapest. So our existing costing model for index scans guesses how many index pages we're going to have to read,
33:21
and how many heap pages we're going to have to read, and it uses that with the costing variables to estimate the cost. This is the same thing, except that the number of heap pages that we think we're going to have to read is going to be smaller, because we have a new piece of information in PG class, which tells us what fraction of the blocks we think are all visible,
33:44
and we use that to estimate how many heap patches we're going to get to skip. Our index totally not only scans. See, the thing about this is you're absolutely right,
34:03
and the only thing I can really say in defense of the name is there's other products out there, and they all have this exact same problem, and they still call it index-only scans. So it really should be called hopefully index-only scans, because if it turns out not to be index-only, it's going to be worse, right?
34:22
Like, if it turns out that we have to fetch all of the heap tuples, and we also wasted this energy checking the visibility map, which says, oh, you have to do the regular thing anyway, it's going to be slower. You can't do more work and have it be faster. So I think this is where, as we get some experience with this technology,
34:44
we'll be better able to adjust the costing model to figure out which cases this is going to win and where it's going to lose, and hopefully patch up some of the cases that lose right now and make them win later. But it's just a complicated enough feature that to think we were going to get it perfect the first time through would require more smarts than any of us have.
35:13
It depends on what the problem is.
35:23
So, again, I don't want to get into tuning questions in this talk. The question is about tuning statistics targets. I have a query planner talk where I talk about that kind of stuff, but that's not this talk, so I'd be happy to catch up with you afterwards. But I'm afraid if I don't move on, we won't get through the rest of the slides.
35:41
And since I made the slides, I think they're good, and I want to have time to tell you what they say. So, if it's okay with everybody, let's defer any more index-only scan questions to the end, and I'm going to talk about some other stuff, which you may think is totally boring, and maybe you're right, but hey, I made the slides, so let's go.
36:01
Oh, wrong direction. Wait, what's happening? Okay. So, Tom Lloyd did some work on a feature called parameterized plans, and if I did some work, I actually mean he did all of it. A lot of stuff I worked on this release, I had help with a couple of other people, from a couple of other people, so I basically just cranked this one out.
36:22
And again, like some of the other things we've talked about, it's only going to benefit you in a limited set of cases, but if you hit one of those cases, you're going to be really happy. So, this is the same MacBook Pro as before, same PGBench minus I minus S1000, and I came up with a pretty artificial test query here.
36:43
I joined, I did a left join between generate series and two copies of the PGBench accounts table. So, what this is really intended to represent is a left join between a small table, which in this case is generate series, so it's a 10-row set returning function,
37:05
and two large tables that are joined to each other. In this case, it happens to be the same large table joined to itself for the sole purpose of making the query planner have to think harder, but in general, it could be any two large tables joined to each other. The important part is that you've got an inner join
37:21
under the nullable side of a left join. Okay, if you don't know what that means, you probably don't have this case. But, yeah, so if you have a left join, and then on the nullable side of it, which means the side that's farther down in the query in this case, you've got another join happening inside there.
37:42
Maybe you have this because you have a view that does something like this, or maybe you just have a big complicated query. Then, in earlier releases of Postgres, the only way that we could execute this query, join the two big tables to each other, and then do the join to the small table after we've joined the big tables.
38:04
And that turns out to be slow, because joining big tables to each other so that you can pull out a very small fraction of them is not efficient. So Tom rejiggered some things, so that now what we can do is, instead of doing the join between the two big tables once,
38:24
and joining the entirety of the two tables to each other, we repeatedly join very small subsets of the big tables to each other, and pull out just the information that we need. And so on this somewhat artificial test case, but I think it actually is representative of the kind of improvement
38:44
you'll see when this kicks in in real world situations, on 9.1, this repeatedly takes about 4 minutes and 10 seconds to execute on my laptop. On PG9.2, the first execution took 580 milliseconds, and then after that it took about one.
39:06
So it was a little faster. So yeah, if you have these kinds of queries where you're hurting because you don't have this optimization, this is huge.
39:20
The reason why we haven't had this optimization sooner, of course, is because a lot of people don't have these kinds of queries, so that's why this is the case that got fixed last, right? Not a big surprise there, but boy, if you have it, this is huge. So here's the plan in 9.1. We basically do a full index scan of both tables, do a merge-join between them, and then do a merge-join to generate series.
39:42
Not very good. In 9.2, we scan the output of the set-returning functions, and then that drives this nested loop here, where we repeatedly index scan each of the relations. So in the first plan, we end up reading the entirety of the big table twice,
40:01
or the entirety of both big tables, if two separate big tables. In the 9.2 plan, we just pull out the rows that we need surgically and just ignore all of the rest of that data. So it's a lot faster. We've had some miscellaneous improvements in indexing.
40:22
One of them, which I've actually seen come up in the field, is if you have an expression of the form, some index column operator usually equals any array blah blah blah. We could not previously handle that with a plain index scan. A bitmap index, you could get a bitmap index scan for that plan,
40:43
but not a plain index scan. And it turned out that there were cases where that sucked. So Tom fixed that, now you can get a plain index scan. We've got better selectivity estimation now for some of the array operators. Overlaps is contained by.
41:01
We've had a number of improvements to GIST indexing. GIST indexes now build more quickly than they did before, and they're a better quality. Bee tree indexes don't really have a problem with quality. I mean, when you make a Bee tree index, you basically just take all your data and put it in sorted order, and that's it. But GIST indexes are used for things like spatial indexing,
41:23
where the way that you index your data is you try to group it together and put bounding boxes around points that are in the same region of space, and put all those data elements on one page. So that's the kind of thing that humans are good at, and computers kind of suck at. Because it's hard, right?
41:42
I mean, if you have a random set of points and you have no knowledge about how they're distributed, and you have to figure out the most efficient way to group them into related areas, and especially if you have things that are more than a single point, like you have circles or lines or polygons or something, and you have to figure out a set of reasonably typed bounding boxes
42:01
that can allow you to answer queries efficiently, that turns out to be kind of hard to do. So we've had some improvements in the algorithm that not only decrease the index build time, but also tend to result in a better quality of index, where you have to do less graveling around in the index to determine whether or not your key that you're searching for is actually present,
42:22
and what pages it might be on. So that's definitely cool. And we have a new index type as well called spgist. That's pretty much everything I know about it. I know slightly more about it than that, but not much. spgist, just in general, can index anything.
42:42
I mean, if you can imagine it and you can think about how to index it, you can index it with gist. spgist is a more specialized thing for indexing things that sort of occupy a particular point in space. So like you can use gist to index lines or polygons or circles or points.
43:01
spgist, you can index points, because a point doesn't have any dimensionality to it. It can also be used for some text searching applications. I have not yet been able to demonstrate a performance improvement in the text searching case over a Petri tree.
43:21
So I think maybe there's more work to be done there, or maybe that's just not really what this is intended for. I've heard that if you have the right cases, this is pretty cool, but unfortunately I don't know what they are. The very first one on the slide.
43:52
I think it doesn't really matter whether the index is multi-column or single column. It just matters whether you have a construct of that form.
44:06
We could talk more about what your specific case that you're worried about is afterwards, maybe. Sorting. This work was really something that nobody was...
44:24
Well, not nobody. Many people were not that excited about, and Peter kept beating on us and saying, this is a disgrace. Our sorting is too slow. Speed it up. So he got it done. With some help from some other people, as is often the case,
44:42
but we have this new infrastructure called sort support, and it's basically a fat trimming exercise. We have some very nice, very general mechanisms. And actually another theme of performance in this release is sometimes you need to throw your general mechanisms out the window
45:00
and install some very specific hacks to alleviate problems in your hot spots. And so that's basically what got done with sorting this release, with Peter kind of leading the charge on that and really insisting that this was a problem we needed to care about. So we've seen some improvements in sorting speed in this release.
45:25
I was lazy, and I don't have a benchmark, sorry. But we did have some improvements. Peter was also the driving force behind getting our power consumption down, and he did a lot of the work on this.
45:41
Tom Lane also did some work on this. In Postgres 2009.1, there are approximately 11.5 auxiliary process wake-ups per second, whereas as of a snapshot last night, I had to retest this from the last time I did this talk because it changed, but as of last night, there are approximately 0.4 auxiliary process wake-ups per second
46:04
on a system that's been idle for a few minutes. Now, if you're looking at this and going, why should I care? That was my reaction too. But it turns out that if you are a big hosting provider and you have a lot of virtual machines floating around
46:21
and you have a lot of virtualized copies of Postgres floating around and they're all randomly eating CPU for no reason, even though the users aren't doing anything, it costs you money. And you don't like that because you're probably a business and not just virtualizing things because you're generous. So this is pretty cool.
46:42
I think there are an increasing number of people out there. Heroku is one example, and there are others who are just running gazillions of copies of Postgres. And so having the system be able to quiesce down into a very low power state
47:01
when it's not in use is a very good thing. It's something like, it's an improvement of more than 20x. I was testing this last night on a virtual machine on my laptop. That's where that 0.4 number comes from. And now VMware tools is actually the top producer of wakeups on that machine
47:24
when it's otherwise idle, about 10 wakeups per second. But there's not that many things above Postgres, so I can guess that at some point, five years from now, people are going to say, you have 0.4 wakeups per second. That's completely unacceptable. You can't have more than 0.004 wakeups per second.
47:42
But for now, we're way off of the bucket list of the people who are upset about these kinds of problems, or people like Red Hat and Heroku and other people who are running a lot of copies of Postgres. There's a bunch of other random performance improvements, as there are in almost every release.
48:05
We have an improved plan cache now that reduces the danger of getting a bad query plan and using prepared queries. More work is probably still needed there, but we've made a start on it in this release.
48:21
It's been a long time thrown in the side of many Postgres users. SEPGSQL, which is a PostgresQL SE Linux integration, now has a user space access vector cache, which makes it only really slow instead of far more slow than you can possibly believe. There's probably some more room for optimization there,
48:42
but it's definitely better. Hey, any price for security? We have got faster array assignment now in PLPGSQL due to some improved caching. Our spinlock implementation on HP Itanium did not conform to the best practices document published by HP.
49:01
Now it does. And that's about all I can think of in terms of performance improvements in time two. I'd like to just take the last few minutes to talk over some things that I learned while working on some of these performance improvements, testing other people's performance improvements,
49:20
testing theories about what would or would not improve performance. I learned a lot from this release. I learned a lot from other people in the PostgresQL community. Peter, Simon, Greg, Tom, lots of people who kind of turned, Jignesh, people who kind of turned me on to things that I should be looking at. And I also learned a lot just by doing a lot of benchmarking and staring at the results and going,
49:43
oh, why is it like that? So here are some of the things I learned. You may or may not find them interesting, but here we go. First lesson, plain old PGBench is a pretty good test. In some cases, it's a very artificial test, but sometimes artificial tests are good because they take some particular aspect of the platform
50:04
and just stretch it to the limit. You can get a pretty good sense of how fast your machine is on a select only PGBench test at about five minutes, but you really need to do like 30-minute runs if you're doing a write test because there's a lot more throughput variability on a write test than there is on a read test.
50:23
I found that it was pretty important to repeat every test about three times so that you could identify outliers because sometimes you get a result which is randomly higher or lower than the normal result for that test. I found that on the read-only tests, there was very little variation throughout the test
50:45
in the rate at which transactions were being processed. It basically just sits there and hums along. On write-only tests, the throughput as a function of time during the test shoots up and down by a huge amount. You can use the PGBench minus L option to record the latency associated with processing every transaction
51:06
that the system processes during the test. I found that quite helpful because you can then construct a graph that shows the number of transactions per second on the vertical axis and time on the horizontal axis, and you can actually see the throughput rate going up and down throughout the test.
51:21
There's a couple problems with that. One is you have to write a script to grovel through these enormous files unless L spits out. The other is that the files are so darn big that it actually lowers the performance of the system significantly from writing all of this instrumentation data. So there's probably some room for improvement there.
51:42
There's a source level option called LWLockStats. If you turn it on, it dumps out all of this information, showing you information about LWLock acquisitions and contention. I found that very helpful. I also found it even better if I counted how many times I had to spin
52:03
to acquire the SPIN lock protecting each lightweight lock. For those in the audience who are not hackers, lightweight locks are a sort of fixed-size set of locks that we use to protect shared memory data structures in Postgres. They have a shared mode and an exclusive mode,
52:21
and their internal state is protected by a SPIN lock. So if you need to wait for a lock, you sleep using a semaphore. So lightweight locks are actually where most of our contention problems are, and unfortunately don't have a good set of user-visible tools to really let you know where this contention is happening,
52:40
but the LWLockStats thing you can turn on for debugging purposes to kind of help track it down. I found that CPU profiling via gprof or oprofile is useless because they have too much overhead. oophile has a lot less than gprof, and it's still way too much. Perf, on the other hand, is a new Linux tool, and it is awesome.
53:02
It has very low overhead, almost no overhead, which is amazing. I have no idea how it's possible to gather call graph information on a 64-core system running full tilt without slowing the system down, but apparently somebody else is smarter than me because it works.
53:22
The big problem with just plain old perf record is that it's not a really great way to measure scalability because typically when Postgres isn't scaling well, it's because you've got lock contention and threads are going to sleep. For sleep, they're not using CPU time, so a CPU profile looks exactly the same as it would if you didn't have a problem.
53:42
So you can't really find anything that way. Context switch profiling is better. If you do perf record minus ecs minus g, you can actually see which call paths are causing the system to go to sleep, and that tells you where your lock contention is. I'm not sure how useful this is as a user, but as a PostgreSQL developer looking for things to optimize,
54:02
that was pretty great. I also wrote a bunch of throwaway custom instrumentation, which was awesome. Just take a random piece of code. You're wondering, why is this code slow? You just stick a bunch of get time of day calls in there, and subtract, and you're like, oh, it's all in that chunk. And then you drill down another level and rinse and repeat.
54:26
Some more general lessons I learned, apart from specific kinds of testing, extreme workloads like pgbench are not necessarily a great thing to do as a way of answering the question,
54:40
how will this perform on my real world workload, because it's an artificial extreme workload in some particular way. Not necessarily the most demanding workload, but although writing pgbench test is pretty extreme in terms of IO, often the results will be much worse in some cases
55:00
than what you're going to see on a real world application. But again, for developing, it's great, because it exacerbates the problems that real users say and turns them into even larger problems, which can be seen more clearly and therefore fixed. A lot of problems are quite easy to fix, not all, but many problems are quite easy to fix. Once you understand what's really happening,
55:21
but there are a number of cases where it took me months and months of testing and fiddling with different things and trying different approaches before I actually understood what the real problem was. In fact, in at least one notable case, I committed a fix to fix the problem, and it turned out that problem was imaginary and the fix was working for a largely unrelated reason
55:41
that I figured out three months later when I fixed the real problem. So, you know, figuring it out is often the hard part. It's very useful to measure system performance along multiple axes, not just TPS, although that's useful, but also latency, frequency of lock contention, duration of lock stalls.
56:01
You find different problems when you look at system performance in different ways. You know, it's sort of, it's sometimes hard to tell, should I look at the most important problem from the TPS perspective or should I look at things that are causing the lock stalls? It's not always easy to decide what to work on, but the more data you can gather, the better off you are. Another kind of interesting developer-level thought is that these LW locks we have,
56:23
I'm starting to have a feeling that they're actually very poorly suited to many of the synchronization tasks that we need to do inside of PostgreSQL. I don't yet really have a theory on what would be a good replacement for them. So what's next? Well, buffer replacement is still mostly single-threaded. I don't have time to talk about this because I'm like out of time,
56:42
but there's a graph on my blog in one of the older posts that actually shows it pretty clearly. You get much better scalability if your workload fits in shared buffers. Scalability deteriorates significantly when your workload does not fit in shared buffers. While insertion is single-threaded, this gets particularly nasty just after a checkpoint.
57:01
I'm thinking this is something that's going to get fixed in 9.3 since Heike's been cooking on a patch. On a busy system, fsync can take an amazingly long time to complete, like 10 seconds sometimes, to fsync a really small file. So every place in the system where we do fsyncs is a potential cause of performance problems,
57:23
and I hope to spend some more time trying to flush those out. And a final point, we have this lock, proc array lock, which is used during MPCC snapshot acquisition. It's also taken during transaction commit. The more other bottlenecks we fix, the worse this one looks.
57:41
I think if Heike gets the wall insert lock problem fixed, this is going to move so far ahead of everything else in terms of how bad it is that we can just give up on doing any other performance work and figure out a solution to this problem. So that's all I've got. I'm out of time, so I'm not going to ask for questions in the big group here
58:02
because that's not fair to people who want to go to the bathroom before the next talk. But I'll hang around up here, and if people want to come up or ask questions or catch me in the hallway, I'll talk to you. Thanks.