We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Query Planning Gone Wrong

00:00

Formal Metadata

Title
Query Planning Gone Wrong
Title of Series
Number of Parts
25
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Production PlaceOttawa, Canada

Content Metadata

Subject Area
Genre
Abstract
An analysis of common query planner failure modes The PostgreSQL query planner does an excellent job with most queries, but no query planner is perfect. In this talk, I'll analyze some of the commonly-reported cases where it fails to produce an acceptable plan, based on posts to pgsql-performance and other cases I've encountered on the job.
Equations of motionEnterprise architectureServer (computing)DatabasePlanningQuery languageDatabaseArchitectureServer (computing)Fairness <Informatik>Enterprise architectureSlide ruleShared memoryQuery languagePlanningSoftware developerComputer animation
Query languageSoftware developerTraffic reportingMomentumError messageSystem programmingOperations researchDatabaseDatabase transactionPhysical systemMainframe computerMenu (computing)Inflection pointPort scannerTable (information)FreewareMultiplicationLemma (mathematics)Mobile appArithmetic meanEmailModemSqueeze theoremMathematicsCondition numberDigital filterLimit (category theory)Order (biology)Computing platformPattern languageEstimationElement (mathematics)Selectivity (electronic)Lattice (order)TupleStatisticsBound stateHistogramSlide ruleInheritance (object-oriented programming)Generic programmingAnalogyCross-correlationHill differential equationInequality (mathematics)Price indexQuery languageMultiplication signPerspective (visual)DistanceSpacetimePoint (geometry)Right angleTraffic reportingRow (database)FrequencyDisk read-and-write headEntire functionEstimatorVideo gameComplex (psychology)PlanningPort scannerError messageElectronic mailing listDifferent (Kate Ryan album)Social classFunction (mathematics)EmailCountingNumberTotal S.A.Design by contractCausalityPosition operatorBinary multiplierHypothesisVariable (mathematics)Cross-correlationStatisticsComputer architectureFraction (mathematics)Run time (program lifecycle phase)Representation (politics)BitFigurate numberComplete metric spaceCASE <Informatik>QuicksortTerm (mathematics)Amalgam (chemistry)Set (mathematics)Table (information)Partition (number theory)ResultantData miningType theoryForm (programming)Selectivity (electronic)WordSystem callMiniDiscMechanism designInheritance (object-oriented programming)Category of beingVacuumSound effectMathematicsExtension (kinesiology)Subject indexingStatement (computer science)Combinational logicIncidence algebraLevel (video gaming)Mathematical optimizationWater vaporTupleMorley's categoricity theoremWeb pagePhysical systemKey (cryptography)ExpressionComputer configurationDatabaseGoodness of fitLimit (category theory)Sampling (statistics)CodeOrder (biology)Decision theoryBasis <Mathematik>LinearizationPairwise comparisonMultiplicationCuboidLoop (music)GenderInformationData storage deviceAreaBookmark (World Wide Web)Pattern languageSlide ruleInstance (computer science)Process (computing)IterationFormal languageRandomizationSemiconductor memoryContext awarenessDegree (graph theory)WindowHyperplaneBit rateProgrammschleifeCondition numberLengthInsertion lossStrategy gameSoftware testingArithmetic meanParallel portFunctional (mathematics)Computer fileGroup actionSign (mathematics)Operator (mathematics)1 (number)Data structureLibrary catalogState of matterStaff (military)Standard deviationSummierbarkeitReverse engineeringIdentity managementPointer (computer programming)Phase transitionMereologyMatching (graph theory)Online helpAsynchronous Transfer ModeSoftware developerThread (computing)Characteristic polynomialFlow separationComputer hardwareRootSoftware bugBefehlsprozessorParameter (computer programming)TouchscreenLogical constantHookingSource codeMultiplication tableBlack boxAverageBus (computing)Data compressionFuzzy logicWell-formed formulaIntegrated development environmentLogicExpected value3 (number)AlgorithmNetwork topologyAttribute grammarDefault (computer science)Link (knot theory)Internet service providerSinc functionFreewareReading (process)StapeldateiData typeStructural loadOperating systemQueue (abstract data type)ScatteringInferenceComputer animation
Transcript: English(auto-generated)
Hi, how are you? I'd ask you if you can hear me, but it seems like the answer is yes. Okay. Wow, someone figured out how to make this mic work between the earlier sessions and this one. So that's good.
So, my name is Robert Haas. I am the chief architect for the database server and enterprise DB. I've been involved in PostgreSQL development for four or five years now, getting close to five years. And today, I'm going to be talking to you about query planning gone wrong.
Just a couple of people commented with some... a few people asked me earlier in the day with some apprehension. Are you coming to my talk? And I think they were afraid,
for reasons that I can't really understand, that I might heckle them. So Turnabout is fair play. So please do feel free to ask questions or just heckle. The only thing is that I would ask is if you're heckling more than
anyone else in the room, then you might need to heckle less so that other people also have a chance to get in their fair share of heckling the speaker. So, I usually like to start with a slide on why I'm giving this talk. So here it is.
In 2010, I gave a talk here at PGCon called PostgreSQL Query Planner. And in that talk, I talked about how the query planner actually works from a user's perspective. If you are a user and you're typing in queries, what is happening? What is the query planner doing for you? Why do you even need a query planner?
And that talk went over pretty well. In fact, it's the most popular PostgreSQL talk that I've ever written. So I guess first time was the charm and it's all downhill from there. But I did get a very common question when I gave that talk to which I didn't have an entirely satisfactory answer,
which is, okay, it's nice to know about how the query planner is supposed to work, but what do I do when it doesn't actually work that way? How do I fix my query? How do I fix the query planner? How do I get the query planner to do what I want? And I didn't really have what I considered to be a satisfactory answer to that question,
because having been using PostgreSQL now for well, getting close to 15 years now, I fixed a lot of broken queries in that time. I've worked around a lot of query planner mistakes and deficiencies, but I didn't really have a good sense of
what are the things that typically come up? If you were going to talk for, I don't know, say 45 minutes or an hour about common failure modes of the query planner, what things would you want to be sure to cover? I didn't really have a sense of what the most common failure modes were.
In 2000, it didn't stop me from writing a talk by the way, but wasn't nearly as good as the first talk. In 2011, Tom Lane gave a rather well attended talk on hacking the query planner, and he sort of attacked the same topic from a developer perspective. Instead of talking about
how the user perceives what the query planner does, he talks about how the source code is organized and what the different phases of processing are and what, from the perspective of the system, the query planner is doing. And he included in that talk a plea for help
to improve the query planner, which may have fallen on deaf ears. But again, I at least felt that the open question there was what should we be improving? And I decided I wanted to find out. So I did a bunch of what might loosely be called research, and that's mostly what I'm going to be talking about here today.
So I hope that some of you will find that interesting. We'll see. So here's what I did. I read hundreds and hundreds of email threads on PostgreSQL performance over a period of nearly two years now.
And I disregarded all of those that were not about query performance problems. So if the user was saying which kind of hardware is best, or my system in general is slow, or has anybody tried doesn't, so I just ignored all of that for purposes of gathering the data that I'm going to be talking about in this talk.
And then, since I'm a super genius and I know everything, I decided what I thought the root cause of each complaint was. So you can see this is a totally objective methodology, and there's no room for error here. I skipped a very small number of things where I couldn't form an opinion
that I thought was worth anything, and I wrote down in some detail what I thought the cause of the problems were, and then I just counted how many times each of those causes came up. So as I already sort of said, there's a lot to hate about this methodology.
First of all, PostgreSQL performance, the PostgreSQL performance mailing list is not the world, much as I had to admit that. And so the problems that are reported there are not necessarily representative of all the problems PostgreSQL users commonly encounter or encounter in general.
And in particular, I think that confusing problems are more likely to get reported on that mailing list, because if something happens and it's easily fixed, even if it's super annoying, it may not actually get reported to the mailing list. And then, too, some classes of users are much more likely to report problems to the mailing list than other classes of users. If somebody has a support contract, they're likely to complain to their support provider rather than to the mailing list.
And then, of course, the other problem with this methodology is me. You know, most obviously because I'm looking through all of these problem reports, I might not have correctly identified the cause of each problem because I'm stupid.
Or more subtly, I might have chosen to describe the cause of the problem in a way that was different from the way that somebody else would have described or classified the same thing, because there's some fuzziness in here. And probably many of you can think of other problems with this methodology, but I think for all of these defects,
I found the results of this to be pretty interesting. They didn't necessarily match up with my expectations, and I think they may be useful to some of you either as users in knowing what to expect or as developers in thinking about what problems we might want to try to improve on
from the coding side. So, with those disclaimers, statistically speaking, why is my query slow? I wrote down
the causes as I saw them in a fairly high degree of granularity, but that led to a very long list that I can't put on one slide. So what I did is I grouped those causes into buckets, and so here's the high-level summary of what I found out. I had 168 data points,
and this is how I grouped them. There were 23 cases in which I thought the problem was caused by your settings. So either you did something wrong in PostgreSQL.conf, or you needed to execute some DDL that you didn't execute, or maybe you needed to change an operating system setting. I think that only came up once.
And then there was the category of people who had essentially unreasonable complaints. There were 23 cases where I decided that the problem was really
that the user was complaining about the slowness of something that could never really be all that fast, right? So there was one complaint that I remember pretty well that basically said, you know, I have this query that returns 10 million rows. Why doesn't that go faster? And you know, of course, you could have a query that was returning 10 million rows
that was slower than it had to be, but in this particular case, the issue appeared to be pretty much, well, you're joining two tables, and you're getting 10 million rows, and that just takes about the amount of time that it's taking, which is not to say there's nothing we could do about any of these problems in these categories. I mean, there's always room for improvements in efficiency.
If you made the database 10% faster overall, then these things would be 10% faster, too. But in many cases in this category, the user had some consternation about something because they didn't really understand the level of effort that was required to give them the results that they asked for.
And my third category is actually an amalgamation of a couple of subcategories. Again, to make it fit on this slide, I'll go into a bit more detail on this later, but my third sort of major category was we're bad at that. So in this category, I placed everything that I felt could be fast in some hypothetical database system,
which was designed around making that specific thing fast, but which was not particularly fast in PostgreSQL, either because we haven't implemented that feature yet, or we thought about implementing that feature and decided that it would penalize other things too much, or sort of deeply rooted pieces of the architecture that would make the system a very different kind of system than it is.
So again, I'll go into those. That category in particular is kind of all over the place. There were a lot of different things that sort of fell into that bucket. Then there was planner error.
And as you can see, planner error accounted for half of the total number of cases that I looked at. And that basically meant bad decisions about the cost of one plan versus another plan due to limitations of the optimizer.
So I didn't include things that someone, like usually Tom, admitted were bugs. I have a separate category for bug. So in this planner error category, I'm not talking about something, whether it was some formula that somebody typoed and as a result of that, we got a bad plan. I'm talking about the optimizer was working as designed, but it wasn't giving the right answer for that query.
And I also just wanted to say one thing about the distinction between settings and planner error. If you have like a really bad row count estimation, many of you probably know that bad row count estimations are the causes of a large number of query planner problems.
And we'll see that in more detail as I dig into these numbers a little bit more. But if it was something like a bad row count estimation, I put that under planner error rather than under settings, even if you could put in some crazy settings that didn't match reality, that sort of hacked around it.
Whereas if the problem was that the user settings really didn't match the characteristics of their system, then I put that under settings. Again, this is somewhat subjective, but hopefully it will become more clear as I dive down into this in more detail. There were 14 cases where I attributed problems to bugs.
Mostly I attributed those problems to bugs because Tom said, that's a bug, I'm going to go fix it now, which was usually followed by Tom fixing it. And then there were three reports that I couldn't attribute to anything other than the user. Like the user typed in a query thinking that they were asking for one set of rows,
but the query they actually typed did not mean what they thought it meant. And so somebody said, why don't you run this other query instead that does something different? And the user was like, success! So in that case, other than the fact that SQL is perhaps a confusing language,
that's not really something we can do anything about on the PostgreSQL side. That's a problem that exists outside of the technical world. So when I originally wrote this talk, after displaying this coarse-grained information,
I tried to do a top ten list, and it didn't work very well because it bounced around to a hundred categories and it was confusing. So what I'm going to do instead is go category by category and break down the things that fell into each category, and hopefully that will be more clear. It felt more clear when I was rehearsing the talk. Believe it or not, I actually rehearsed this talk.
I hope you can tell. So settings. The numbers in parentheses throughout the talk are the number of reports. So 23 overall reports for settings, and then breaking down is shown here. So the really recurring theme under the settings category was,
you've got to get your planner cost constants correct. That was a third of the total reports in this category. And one of the highest overall causes of problems. And so generally, the issue here is people had databases that were fully cached in memory,
or almost fully cached in memory, and the default planner cost constants don't really assume that the database is fully cached in memory. And so in many cases, what you need to do is lower sec page cost and random page cost in order to actually model the behavior of your system.
Now, there were also some suggestions, I believe all of them from Kevin Gritner, that you should raise CPU tuple cost. And I could not determine, based on looking at the reports on the mailing list, to what extent raising CPU tuple cost was more or less effective than lowering sec page cost and random page cost.
I have no particular reason to doubt Kevin's belief that that's a good thing to do, but I just didn't have enough data points here, or couldn't get enough information about what actually happened when the user tried the different suggestions to know for sure whether or not that was a good thing.
That could be. The sample size is somewhat small here. So the question is, is there some kind of way we can auto-tune these settings?
That's a good question. I don't know of one.
I believe I've seen a report, although it wasn't included within the data that I sampled for this, I believe I've seen one report of somebody who actually needed to raise it. In almost all cases it needs to come down. But I don't actually know of anyone who's really tried to write a tool to do what you're talking about.
I think I remember Peter saying that it had been tried at some point in the past and it hadn't really worked, but I don't know of any recent attempts to do the kind of thing that you're talking about, so I can only speculate on how well such a thing might work.
I think a lot of us who have been through the process of tuning this have found that you can usually converge on it with a few iterations of cranking the knob broadly, but I don't know of a more systematic way than that to do it.
I can do that. The only thing I'm slightly worried about is either embarrassing or angering people who may feel that my categorization of their problem as, for example, user error is not a categorization which they entirely welcome, but let me think about that and talk about it with a couple of people and maybe I can do that.
I think certainly for some of them there's no problem, but there may be a few cases in which people might not prefer to be publicly humiliated.
Right. Yeah, that's actually a good point. I've been thinking about starting a Wiki page with this on the PostgreSQL.org Wiki, so maybe that's a good thing for me to go do. Yeah, Jan.
So the question is, you know, Jan is offended by toasty compression. It's slow. That may or may not be related to his having written the decompression code.
In his defense, I'll say that any decompression algorithm is going to be slower than not decompressing things. Now the benefit is that they take up less space. So I have certainly seen other cases where having to decompress things turned out to stink,
because having them smaller was less important than having them not need to be decompressed. I don't remember the exact details of this one, but I will be happy to hook you up to a link with the relevant thread
and then you can tell me that I've totally mischaracterized what the problem was in that case. Tom, stop jumping to the three slides down the road.
Peter. Because there is a setting, the question is why did I put indistinct estimates aren't accurate on large tables.
I put it on settings because there is a setting that you can change to adjust that. Well, okay, well, look, I mean all of these things are, arguably all of these are things that we're bad at, right? Like the cost for the at sign at sign operator being too slow, too low.
Well, who created that operator and put it in the system catalogs? Oh, right, that was us. So maybe we should go change the default value of that, right? And workmen being too low or the statistics target being off, I don't have the foggiest idea how to tune sec page cost and random page cost in a sensible way.
I do have some ideas about how we could tune the statistics target, because I'm pretty convinced that you want it to be bigger on large tables and smaller on small tables. So, you know, I think there are actually things that we can do in the database that would reduce the need to tune many of these settings, if not all of these settings.
But, you know, as we'll see as I get farther in the talk, these are actually the good cases. Because however much you might be unhappy about having to, for example, change the storage type of your table so that a particular column doesn't get compressed when it's toasted,
you'll be even sadder if you have one of the problems where no setting that you can change has any hope of curing the problem. So, you know, at least there's something you can do about these.
Right, yeah, so the comment was sometimes sec page cost and random page cost,
the correct value might actually be different for different tables. I remember Kevin talking about that a couple years back in the context of needing the reporting user to use different settings than the interactive user, because the properties were different. There is a per table space setting of this value.
The guy who added that option is a real nice guy. He's giving this talk. I didn't find any reports of anyone benefiting at all from the fact that I put that setting in. I hope there's someone out there who benefited from the fact that we have a setting per table space setting for that.
Per table we don't have, maybe we should, but hey, if nobody's even using the per table space setting, then again, maybe we shouldn't. I'm using it for table spaces. Oh, you're using it for table spaces. Woohoo! Well, see, this gets back to sampling bias, right? I can only tell about the things that people complain about. So it could be that that setting works great and everybody just uses it
and nobody complains about it because it's perfect. And then I have no idea. I think that's probably an optimistic assumption, but I have no idea. Okay. So, next category. Just plain slow.
These first two causes here may be issues that you've run into at one time or another in your environment. We had six complaints from people who were surprised to find that the more data you have, the longer it takes to process that data.
And we also had some complaints that boiled down to discs are slower than memory. And lest I make these people silly if their names should come out an hour later, the things that they were actually complaining about weren't phrased in a way that made the silliness of the problem quite so obvious, right?
So one of the complaints was of the form, the first time I execute this query after rebooting, it's really slow. And then the second time I execute it, it goes faster. Well, maybe a novice question, but it turns out, of course, the first time you were reading the data from disk and after that it was cached, right?
So these weren't necessarily phrased in a way that made it sound quite as silly as the way that I've made it sound here, which is because I'm a jerk. Generally, everything else was something that was only reported once in my sample. You can kind of read through the things here.
You know, replanning isn't free. This was a case where somebody was repeatedly executing the same query using execute instead of not replanning it every time. Scanning more than tables is scanning slower than fewer tables. Somebody thought they were proving some other point, but what actually happened is that they
had a query where join removal was pulling out one of the tables in some situations. And the query that looked at fewer tables ran quicker because it only had to look at two tables instead of three. The only thing besides those top two sort of obvious ones that came up more than once is clauses involving multiple tables can't be pushed down.
So if you have a condition in your where clause that involves comparing data from one table to data from another table, you're not going to be able to evaluate that clause until the two tables have been joined.
If you have something like where a dot x equals one, you can evaluate that while you're scanning table A. But if you're comparing a dot x plus b dot y equals three, you're not going to
be able to actually evaluate the truth or falsity of that condition until you've joined A and B. And you're going to have to pick your join strategy for A and B knowing that you have to get every row that might possibly satisfy those conditions. So there's no real way around that. It's just going to be less efficient.
Any other heckling? I mean, questions? No questions. All right. So we're bad at that. So I divided this into three subcategories. It was a little bit scatter shot.
The first subcategory was plan types we can't generate. Tom will perhaps be gratified to know that parameterized paths were something that came up more times than anything else in this category. Parameterized paths are this new 9.2 feature that allow us to consider certain kinds of paths that we couldn't before.
There were seven cases where I attributed the slowness of the query to lack of the proper parameterized path. Two of these seven, unfortunately, were actually post-9.2 complaints where the query planner was able to parameterize the path, but it wasn't able to come up with a parameter.
Well, no, sorry. They were cases where the query planner was new enough to know that such a thing as parameterized paths existed, but it wasn't able to come up with parameterized paths in the particular situation that would have been needed to make this user's particular query fast.
I did not go back and try to figure out whether the older reports would have been fixed by 9.2. Merge append deals with situations where you have a partitioned table and you're trying to do an order by limit on the partitioned table.
So select star from giant partitioned table, order by some column that has indexes on it, limit 10 or whatever. And then return 10 rows. So if we were to do 9.1 prior to 9.1, we would actually take all the data in all of the partitions and sort it, ignoring the indexes, sort all of it, and then return 10 rows.
People didn't like that very much. So with 9.1, we got merge append, and that fixed it. You're able to actually use the indexes on that table and combine the data from the index on this table, this partition
with the index on that partition and come up with the results without having to scan the entirety of every single table. So that was good. The other plan type that I found that we can't currently generate was the batch sort of data already ordered by leading columns.
So you imagine that you have a nested loop or a merge append, sorry, a nested loop or a merge join or something like that that's producing data that is ordered by column A. But the user has included order by A comma B in the SQL query that they've written.
Well, what we do today is we ignore the fact that the data is already partially sorted and we just throw it all into a big bucket and we sort it all again by A and B even though it was already sorted by A. And in some cases, this really stinks because some people have queries of this form where A is mostly
unique already and so only in a minority of cases is there more than one row in the group. So sorting smaller groups is a lot more efficient than sorting larger groups. Particularly, this would be great to have in cases where there are limits and you don't
actually need to sort all of the groups, but you could just sort the first few groups. It's the type of plan we can't currently generate. Then, executor limitations, architecture, there wasn't really a lot of rhyme or reason to this stuff.
These are a bunch of sort of limitations. You see also the sort of sample bias here, right? The only thing in these other categories that came up more than once was no parallel query which came up twice. I'm reasonably certain there are more than two people in the world who care about PostgreSQL having parallel query,
but everybody kind of already knows or almost everybody except two people already knows that we don't have parallel query and so people don't complain about it every week because you kind of know we don't have that. So, I'm not sure the rest of this is that interesting, but I'll pause in case anyone wants to say that
one of these things is interesting or ask a question or complain that their favorite thing isn't included in this category.
After-trigger queue size, so the problem there was the user had a query that got really slow because the after-trigger queue got enormous because of something that they were doing. I think it was a per-row trigger and the query was processing a lot of
rows and the after-trigger queue just gobbled up crap loads of memory and that was bad. I don't remember whether the trigger was good. I don't think it matters.
All of these things on here, even the things that only came up once are incredibly brutal if you're the one person that they happened to.
It was specifically autovacuumed not smart about inherited tables. Remember, I was only looking at query performance issues, so what happened in this particular case is somebody had a query against a table that had inheritance children.
And the statistics, and Tom added a feature in which release? 9.1, 9.2, where Analyze actually gathers two sets of statistics on inheritance parents.
It gathers one set on just the inheritance parent and another set on the inheritance parent plus all of its children. So this happens whenever you run an Analyze on the parent, but autovacuum doesn't know that it needs to run an Analyze on the parent. Autovacuum will run an Analyze on the parent when there have been enough changes to the parent itself
to justify analyzing the table, but it will not analyze the parent because of changes to the children. So if you do a manual Analyze, then the problem goes away and you're happy, except for the fact that you kind of wish you hadn't had to do that manually.
No, we do not update the statistics on the parent when we Analyze the child. But Steven, Steven, Steven, it's an amalgamated set of statistics, right, across all the children.
So there's no way to tell when you Analyze only one child what should be changed in that aggregated set of statistics. You'd have to go back and, so what we need is some mechanism to fix that complaint. I love how we're digging into the things that there's one report of. But, you know, what we'd have to do to fix this complaint is have something where when we Analyze,
maybe when we Analyze the children, we somehow sort of file a credit against the parent and say, hey, you know, you kind of need to think about doing an Analyze on the whole inheritance structure as well. Chris? No, actually I had no reports of that in any category.
Also, this is PGSQL performance, not PGSQL novice, that might matter too. Josh?
Oh, I sort of had something like that in just plain slow.
Linearly scanning an array is order N in the length of the array, and it really is. But I only had that once. Okay, yeah, wow.
Alright, so, planner errors. This is the big category. 83 reports of planner errors. Alright, so I'm not going to give anybody any credit at all for guessing something in this bucket because there's 83 things in this bucket. But anybody want to take a guess at what the, there were two things, there
were two things in this category that came up much more often than anything else. Stephen? No. Yes, Kevin got the first one. Correlated statistics.
We'll talk about, I got a couple, I got to actually have a whole slide on that one. Anyone want to guess what the other really common one was? More specific. Didn't use the right index? Nope, not didn't use the right index. Didn't use anything? Nope. Why don't we have hints?
No, that's not a cause, that's a solution. Function cost?
Function cost, nope. Nope. What's that? Just indexes. Nope, not just indexes. Common table optimization, optimization expense, that one's an honorable mention, but not one of the big two.
Nope, nope. Never even came up. Alright, those were all fascinating guesses, but like most of my guesses, they were wrong!
So there's going to be some groaning when you see it because you're all going to go, oh yeah, that does happen a lot. But before I get to that, let me just break it down a little bit more. So 83 planner errors, here's how they broke up at a slightly higher level of granularity.
There were 28 what I call conceptual errors and 55 what I called estimation errors. So what I meant by conceptual error is that the planner wasn't even able to think about using the plan that the user actually wanted.
If the user had typed the query in in some other way, using some other wording of the same query, the query planner would have gone, oh, okay, I know how to do that. But the query planner was not sufficiently powerful to transform between things that we as human beings can recognize are equivalent. You've got to remember, people are really good at recognizing that two things are equivalent.
If we take Kevin and we spin him around or we show me half of him or we beat him up a little or put him in a different shirt, I'm still going to know it's Kevin every time. The query planner is not nearly as good at recognizing equivalent queries as I am at recognizing Kevin.
Or it's also not as good at picking on Kevin as I am. So there were 28 what I call conceptual errors and then there were 55 estimation errors. And this probably accords with your experience. It definitely accords with mine. Almost all of those were row count estimation errors.
So what I mean by an estimation error is the planner thought about doing the query the right way and decided not to because it thought that the actual best plan was more expensive than some other plan. It either overestimated the cost of the correct plan or underestimated the cost of some other plan.
And in the overwhelming majority of cases, the problem was that it misestimated the row count of some portion of the plan tree. It got wrong the number of rows that were going to be returned by some scan, join, or aggregate.
In the other seven cases, it got the number of rows right but nevertheless drew the wrong conclusions about the cost of that plan. Stephen? No. Nope. Not that. Yep. Lack of statistics for two tables. No.
That actually came up I think three times. Correlations between tables. So Kevin's earlier guess was correlations between columns.
That was a really big one. Correlations between column and the same table. The query planner doesn't know about that. Much less common but also on the radar screen, we're knowing about correlations between two columns and completely different tables, which is obviously a much harder problem. Fortunately it didn't come up as often, but it was on the list.
Alright, so. What's that? Row with estimates did not come up once. Alright. The second one is the one that you guys didn't guess.
Select star from foo where a equals one, ordered by b, limit n. Almost that exact query with slightly different values for foo a, 1, b, and n showed up 11 times.
This turns out to be a really common problem. Let me talk about the first one, the other one that came up even more frequently first, just in case there's somebody in this room who has not yet had the pleasure of fighting with this issue. Selectivity of filter conditions involving correlated columns is estimated inaccurately.
Suppose we want, typically these problems weren't like two columns. Like most of these reports, somebody had this where clause on columns of a particular table that was like 160 characters long. They were like, where a equals 1 and b is less than 3 and b is more than minus 7 and
c is in 1, 5, 9, and 13 and d is this and e is greater than f and g and h. And it's an obscenely long list of qualifications on the same table. So everyone who has proposed a solution to this problem that handles two quals but does not handle eight can go stand in the corner.
Because your solution will not actually fix most of the things that are really killing people here. And there's probably about eight people in this room to whom that applies.
So the general problem here is if we want all the rows from a table where a equals 1 and b equals 1 and c equals 1 and d equals 1 and e equals 1, we have statistics that analyze gathers that estimates what fraction of the rows in the table are going to match each of those conditions taken individually.
But if the columns are correlated, we can't necessarily correctly infer anything or much of anything about what's going to happen when we take all of those conditions together. So to take a simple example, suppose that each of those five conditions is figures to be true 10% of the time taken by itself.
Well, if the 10% of the rows that have a equals 1 are the same 10% that have b equals 1, which are the same 10% that have c equals 1, which are the same 10% that have d equals 1, which are the same 10% that have e equals 1,
then this correct estimate for all five of those things put together is still 10%. On the other hand, if a equals 1 is over here, these 10% of the rows and the 10% with b equals 1 are over here and the other three don't matter, then the correct estimate is zero. What the query planner actually does is pretty reasonable.
It just says, well, let's assume they're independent, and so it multiplies all the percentages together and it gets some answer. And sometimes that answer is right, and when it's wrong, then people post on PostgreSQL Performance and say, how do I fix my query? Yes? So your question was, why does this matter if the query runs really fast on the big table? The answer is it doesn't.
No bug in the query planner matters if your query runs fast enough for you anyway. And that's why we only have 168 reports in the time period I surveyed rather than 5 million.
Because there are many cases where the query planner estimates things wrong, but it doesn't matter because it comes up with the best query plan anyway. Typically what happens is as the complexity of the query goes up, there are more and more possible plans for the query to consider.
If you've only got a single table involved in the query, the only thing we can really get wrong is we can decide to use a sequential scan when we should use an index scan or the other way around, or we can pick the wrong index to use. So there's some possibilities for things to go wrong there, but there's maybe ten plans we can consider.
When you have a join with six tables, there's probably a million plans that can be considered. And so the estimation errors start to matter a lot more because you only have to be off by a small amount to shift the query planner's notion of the best plan from this bucket to the next bucket over. And sometimes shifting over to the next bucket is fine because the second best plan is still pretty good.
But sometimes it's not. And sometimes something that the query planner thinks is the best plan and is only a small distance away in the costing space from some other plan, the plan that it actually picks is much worse.
So typically this matters when you have a big join, or at least a medium-sized join, three, four, five, six or more tables. If you have a 20-table join and you're doing stuff like this and it works at all, you're really lucky and you should probably redesign your schema before someone invents a second query,
which will almost certainly not work as well as the first one did. Okay, so let me talk about... No. In fact, I doubt we're going to get to solutions in this talk, but yeah, some solutions would be great.
So the other case that came up a lot, which is related to the first case, but it seemed distinct enough that I gave it its own identity, is this case where you're selecting from a table and you have some filter condition, in this case A equals 1, and then you're ordering by some other column,
A, and then you only want the first few rows. Well, there's basic... And let's just suppose for the sake of argument that both A and B are indexed. We can't use both indexes. If you are unconvinced that there isn't a good way of using both indexes,
then think about it harder. There's not a good way of using both indexes. You have to pick. There's two things you can do. The first thing you can do is you can scan the index on B and filter for rows where A equals 1, and when you find 10, you're done. Or when you find N, I guess, in this case,
when you find the number the user asked for. The other way to do it is you can scan the index on A and find all of the rows where A equals 1, no matter how many there are, and sort them with a top-end sort, and then return however many it was that the user asked for.
The query planner really likes to do the first one. Really likes it a lot. All 11 of these reports consist of picking the first one where the second one would be better. I did not find any reports of the reverse situation.
The somewhat good news about this is that you can often fool the query planner into doing the right thing by creating one extra index. If you make a composite index or a functional index, the query planner will say, gee, I think that the bad index is better than the good index,
but this new index you created is ever so slightly better than the bad index, so I'll pick that one, and then everything's fine. Tom? Well, Tom's saying it's considerably better, but I don't actually agree with that.
But in many cases, the actual runtime of the really good index is only a little bit better than this second plan here we're finding. Totally, right? The thing about this is I don't want to give anyone the illusion
that I'm complaining about these problems because it's really stupid that it works this way, and we should just go and find some smarter guys to work on our optimizer, because that's clearly not the case. This is a hard problem, and there definitely are cases where this is really, really bad.
If A equals 1 is going to match 90% of the rows in the table, then this probably isn't going to be too great. Do we consider if it's an MCV? Yeah, I'm sure we do. Josh?
Right, so there's lots of different kinds of row count estimation errors,
and this is only one of them. The thing that's sort of interesting about this one is that it isn't just the total number of rows that matters. It also matters the relative position of those rows in the table. So if we pick this first plan where we scan for the index on B
and filter the rows where A equals 1, we're going to do well if the rows where A equals 1 are either uniformly distributed or if they're clumped together early on where we're going to hit them quickly in the index scan. But if they happen to be clumped at the end because of the ordering of the index and the ordering of the table
where we're not going to hit them until we've already scanned basically the entire index, then our life is going to suck. And I actually had a slide in here at one point that had a test case demonstrating how you could create this situation, but I took it out because it wasn't interesting enough.
Oh, yeah. Yeah, so Josh's point is that there are other cases where we pick plans based on the idea that we're not going to have to execute the entire plan
because there is a limit, and then it turns out for one race or another we do have to execute the entire plan. And what he's saying is that this is only the most common case, and yeah, that's kind of what I thought too. This was actually the only case that came up into my sample, so I submit that if we just fixed this one, we would be a lot better off.
Yeah, this is not every email over the last two years. This is about two chunks of, like a chunk of six months here and a chunk of eight months there or something like that. So it's possible that wasn't in your sample
or the researcher may have screwed up. Yes, it is. It sure is vicious in that case. So, food for thought.
Alright, so I apologize to anyone who's feeling severe pain at actually seeing this example up on the monitor because this really does suck.
Okay, so these two things together made up half of the row count estimation errors, which might lead to the question, well, what about the other half? There wasn't really anything else that was close,
but there were a few things that came up more than once. I had a bucket for using with results in a bad plan. That bucket in retrospect is not consistent with my usual standard of how fine-grained I made these buckets because there were actually,
when I went back and looked at it, there were actually two different kinds of failures in this bucket. Some of these failures are query flattening issues and some of them result from failure to dig out variable statistics. And that's very technical jargon, so if anyone just went, that's fine.
But basically, with in two independent ways acts as a complete black box in terms of the optimizer. And the optimizer just says, I'm going to do everything in the with and then after that I'll kind of hope that I know what the output of that is going to look like and plan the rest of the query.
And there's actually two separate fine-grained causes in there. Generic plans can have wildly wrong estimates if you prepare a query and then execute, execute, execute, execute, execute. And each time you execute it, you pick a sort of ordinary value in the table
and then one time you pick one of the most common values in the table to execute with instead of a value that has a bad average frequency, then suddenly you may have a bad plan because the query planner with a generic plan is expecting that the value is going to,
it has to use something for estimation purposes and it estimates that that value is going to occur about as frequently as the average value in that table occurs. And if that's not true, then you may have a problem. Tom Blain has done some work on this for 9.2.
So the query planner is now smarter about trying to figure out whether or not it can get by with a generic plan or whether it needs to keep replanning. There may be more work that needs to be done there or there may not. I had four cases where people had sort of arbitrary gobbledygook
that the query planner was not able to estimate what percentage of rows would match that condition. So the example that I use in one of my other talks is, suppose you have where A plus zero equals A. What's the selectivity of that?
Where it's not null. Right. It's not 100 percent. It's 100 percent of the not null ones, which is not the same thing. But the query planner has no idea. It looks at that expression and says, that looks like some kind of complex expression.
One out of 200. Done. Okay. And if you put A plus one equals A, you get the same estimate. It's got no idea. And obviously those examples are simple enough
that you could hypothetically imagine some sadist or masochist, you could imagine some masochist who might write code to handle those cases, even though you really shouldn't do things like that in your queries because that's just being mean to the query planner. But there were more complicated cases that came up here,
where people had totally legit reasons for including things like where my func of A in the query. So query planner now has to guess, if I feed all the values of the table into this function, that I have no visibility into what that function actually does,
how often is it going to return true and how often is it going to return false? No clue. So the question is, could we learn from experience?
And I think that question applies to this. It probably also applies to some of the other problems that we've talked about.
One problem with learning from experience is that it still means you take the hit in the nose the first time the problem comes up. I don't know if that's a sufficient reason to throw out that idea because there might be people out there who are like, I can totally run slow the first time as long as we figure it out after that.
I don't have the foggiest idea how to implement that, but it's an interesting thought. There were three reports of joint selectivity not knowing about cross-table correlations. So this is the problem, the first problem on the previous slide, the most common one,
and this is just like kicking it up to an even higher degree of difficulty. It's hard enough to imagine a query planner that's smart enough that it knows that A and B and C and D and E are correlated columns, but then when it's like, those columns from this table are correlated with these columns in some other table,
and so you've got to get the joint selectivity right considering those kinds of relationships. It's a very hard problem, but it did come up three times. Foreign keys, I'm not sure.
I don't think the query planner knows about foreign keys. I don't think that it does. Actually, the code for how many rows we're going to get out of a join is pretty crude. I think there's a lot of room for improvement there. Unfortunately, the cases where
or it's actually bad enough that it hurts you, are the cases that are actually really hard to think of a solution that's good enough. The only other problem on this list that came up more than once is uncommitted tuples don't affect statistics. So a couple of places where people had, for example,
a large bulk insert that hadn't committed yet. Meanwhile, somebody else is trying to do a query, and they think that the table, they know the size of the table, because that's available on the fly. We pull that when we're actually planning the query. But their idea about the frequency of different values in the table can be very skewed
if there's a whole pile of rows that are in the table and therefore in the indexes, but are not yet visible. The query planner is still gonna have, the query, when it's executed, they're still gonna have to look at all those rows, potentially, and ignore them. It may come across index pointers to those rows,
but it's just gonna follow the index pointers and then go, oh, that's not visible. Same problem as excess bloat. Yeah, I guess you could say that is the same problem as excess bloat. The solutions might be different, but yeah.
The last three problems on this slide are fixed. Okay, so those are the row count estimation errors. Now we get to the one Tom was alluding to, cost estimation errors.
This wasn't a really big category, but there was definitely a winner in terms of what came up the most often. When the row count estimates were right, but the costs were wrong, four out of seven cases where the plan failed to account for the de-toasting cost, right? So it thought that, for example,
a sequential scan wouldn't be too bad because the table didn't have that many rows or pages into it, but unfortunately, the sequential scan resulted in de-toasting a lot more tuples than you would have had to de-toast if you'd used an index scan, and so it was slow.
Since I see at least one or two people with a funny expression on their face, toast is how we store values that are too large to fit within a single database block. They kind of get pushed off to a side table, and it turns out the question of where in the plan you do the de-toasting is something that we don't currently consider in any way
when we're doing costing. It just kind of happens wherever in the plan tree it happens. There was one very interesting case where the plan that won actually managed to force the de-toasting to happen at a lower level of the plan tree, which meant that we de-toasted each row one time
instead of 10 times. And so it was way faster, because we didn't repeatedly de-toast the same tuples.
For estimation purposes.
Yeah, that's a little different category of...
Well, in one really interesting case, somebody inserted a window clause that did nothing, or was supposed to do nothing, but the window aggregate clause that they artificially forced into their plan tree
caused everything to get de-toasted before it got fed into the nested loop. And so instead of de-toasting it for every execution of the nested loop, it de-toasted it once, effectively materialized it, and then pushed the result through the nested loop, and that was better. They were like, why would I make the query more complicated and add more stuff to the plan?
Does it get faster? Help? Yeah? I didn't realize that the table was clustered.
Yeah, I haven't run across that one. I mean, one of the things that we do track in the table statistics is the correlation between the logical ordering of the column and the physical order of the table. So I can imagine the scenario you're describing if the table has been clustered, but it has not been subsequently analyzed.
But I didn't happen to run across an instance of that. Yeah, I haven't run into that one myself.
I suspect that may be another kind of case that is underrepresented in this sort of sample because the person's like, that sucks. Yeah, let me try analyzing. Oh, it went away. Next, right? So that may be something that occurs more frequently than this would indicate because people just try some random analyze and it fixes the problem.
Or they come back in the morning and the problem's gone and they don't know what happened, so they just move on. Yeah, Bruce, am I out of time?
Oh, shit. Oh crap, I'm really sorry. I'm 10 minutes over. Let me take one more minute to go through the last slide really quickly and then I'm done because I've only got one more slide.
So the last thing were conceptual errors. There were 28 things in here. 10 of them came up just once. But there were a couple of things that came up like three times. Cross data type comparisons are not always indexable. Inlining the same thing multiple times can lose.
Not in as hard to optimize and we don't try very hard. Those things came up three times each as did target lists being computed too early or unnecessary targets being computed. And then there were a bunch of other things here and you can see what they are, but I gotta stop because I'm way over time. So sorry about that. I was thinking I had until 4.15
but I was just totally wrong. So sorry about that. Happy to talk about this afterward if people want to but I gotta get off the mic now. Thanks.