We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

What's in a Plan?

00:00

Formal Metadata

Title
What's in a Plan?
Subtitle
And how did it get there, anyway?
Title of Series
Number of Parts
37
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
PostgreSQL's EXPLAIN command is a powerful tool, allowing users to expect and understand query plans in detail. However, the output produced by EXPLAIN is not a direct representation of the plan; it omits some information and displays other information in modified form. In this talk, I'll discuss the differences between the plan as displayed by EXPLAIN and what is actually stored in the server's internal data structures. I'll also discuss why the plan includes the information it does, how that information is computed, and how that information is used at execution time. In my 2010 talk "The PostgreSQL Query Planner", I focused primarily on the types of plans that the planner was capable of generating and the reasons why it was necessary to compare various alternatives. Tom Lane's 2011 talk "Hacking the Query Planner" discussed the overall order of operations in the planner and explained some of the motivation behind it. In this talk, I will focus less on the overall operation planner and more on the data structure which it produces as output, namely, a tree of Plan nodes. I hope that this will be helpful in understanding PostgreSQL's execution-time behavior as well as in learning to write patches that touch the planner and executor.
4
18
22
23
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Transcript: English(auto-generated)
I'm going to be talking about something that does exist that today and sort of
some of the things about it that I find interesting and so the title of the talk is what's in a plan how did it get there anyway and by a plan I actually mean the data structure whose name is capital P small L small a small n so many of you are probably thinking well I know how to see the
information that's in a plan you just let explain and you're on the query and it prints everything out and you're kind of right but it turns out that explain might be slightly lying to you so that's one of the things that you'll
sort of see is a little bit of a recurring theme as we go through these slides so since I'm talking about the data structure the plan data structure in one sense the the question what's in a plan is that stuff talk over closing
session yeah I'm going to just group these fields into a couple of categories and I'm going to go through each category individually I want to mention that on many of the slides throughout this talk I've got
either explain output or C code taken from the PostgreSQL source code in many of those cases I have made cosmetic changes to make it fit on the slide better so I don't think that those changes are of a nature that will confuse anybody or certainly not mislead but if the output that you see here
doesn't exactly match what you see in the source tree that that's the reason so actually I'm not quite correct when I say that this is the stuff in the plan this is the type def struct plan definition from plan nodes dot H but
actually this is only the common information that appears in every kind of plan so this plan structure is embedded in some larger structure that additional fields that depend on what kind of plan node this is so it might be a sort or a sequential scan or an index scan or a gather node or whatever
and so they're going to be additional fields that are effectively beyond the end of this structure that pertain to whatever a specific kind of plan you've got and the way that you know what kind of plan you've got is that the very first field of the plan is node tag type and I'm not going to talk
further about the node tag except to say that it's a giant enum that tells you what kind of plan you've actually got and I'm also basically not going to talk very much at all about any of the node type specific data so I'm just going to focus on the common structural information that's part of every plan node so we can break that up into a couple of categories there's
the node tag which I just mentioned there's some costing information some things related to parallel query support there's a target list and something called a qual left and right subtrees and I think those things the target list the qual on the left and right subtrees those are really kind of the meat of this and then there's a list of init plans there's two fields
called X param and all param and if you have no idea what those fields store you're not the only one and then there's the type specific information so if you think about it in one sense it's a little strange that we store
costing information in the plan because actually the way that planning happens in PostgreSQL is we first start by generating paths which are in a sense candidate plans and then we pick the cheapest one and we convert that into a final plan so why do we need the costing information we already know that
this one was the cheapest one otherwise we would have thrown it away so that's kind of an interesting question I'll talk the big answer of course is explain right we like to be able to run explain and actually see the costing information and if we throw it away and we wouldn't have it it's
actually used in a few other places as well and I'll show that on the next slide but there are four fields that pertain to this in the plan data structure there's the startup cost the total cost and the startup cost means the amount of effort in abstract cost units that will be required to
generate the first tuple and then the total cost is the effort to generate all the tuples then we have an estimate of the number of rows that this plan node will produce which as you can see is actually a double value not an integer and then we have the plan width in bytes per row again estimated because we do a lot of floating-point math when we're
estimating things and we don't want to throw away the precision actually we do throw away the precision in a lot of places but but not everywhere also the other advantage of the double is you can represent numbers which are really
really big using the double like bigger than 4 billion so how does this get used explain is a big one if you run explain you see all that stuff there's a few other places where this gets used in the executor if we're building some kind of a hash table specifically if we're doing a hash join or a hash sub plan the row count and the width are used to set the initial size
of the hash table which is not totally critical but getting that approximately right at the beginning makes things slightly more efficient for a hash join we use this information to decide whether or not we should fetch the first outer tuple before building the hash table or after
building the hash table and the reason why that's important is because it might turn out that one of the inputs to the hash join is empty if we try to pull the first tuple that's going to have to be probed into the hash table and we don't get one then we can completely skip building the hash table on the
other hand if we try to pull the first row that goes into the hash table and we don't get one then we can also skip building the hash table and we can skip generating any row from the other side so we actually use the cost to decide which of those two optimizations to try there's a plan shape called alternative sub plan where we actually generate two
different plans for a sub plan how many people have actually ever seen an alternative sub plan is that a question or you've seen one you've seen one okay yeah few people there's one there's a couple of examples in the regression tests that's the only place where I've ever seen
an alternative sub plan but apparently they happen so then and we also use this to decide between custom plans and generic plans okay I gotta speed up so parallel query there are basically
three fields here for parallel query there's a field called parallel aware which answers the question does this plan node know anything specifically about parallelism a parallel safe flag meaning could this part of the plan potentially be executed in a work or can it only be safely executed in the leader process and there's a plan node ID and to
understand why a couple of these fields are here I've just got some very simple examples of plans so in this point in this first plan that I have here on the slide we've got a gather node on top of a merge join and then the merge join is a parallel index scan on a and an index
scan on B so both of those are index scans but the parallel aware flag is set on the first one and what that means is that the rows in a are going to be divided between all of the participants in the parallel phase of the operation whereas for B since the parallel
where flag is not set on the plan every participant is going to see every row and if you think about it that is a hundred percent necessary for correctness because if you have some workers compare a subset of the rows in a to everything in B and then
another worker compares a different subset of a to everything in B and so on you will get the right answer if you take random sets of at subsets of a and compare them with random subsets of B that won't work and we have to know which so we have to know which which one of those things we're doing so that's the need for the parallel where flag the plan node ID is
necessary because of plans like this here we have a gather on top of an append with three parallel sex scans underneath in each of these parallel sex scans we need to divide the rows between all of the workers and for that to happen the workers have to talk to each other but about what they need to know some way to identify which plan node they are
talking about and the plan node ID serves that function you could say well why not do it by table oid but it turns out you can reference the same table multiple times in the same query and so the oid can be there more than once the plan node ID is unique this could maybe be
used for things besides parallel query but right now not so much okay so now we gotta get to the heart of this or the start of the heart of it we've got the target list which is the things that this node is going to produce PostgreSQL has a kind of volcano style executer where tuples start at the bottom of the of the tree and they percolate
up the tree and then they explode out the top there's actually a there's actually a system called volcano and I think it's called volcano because its executor works that that way as well then there's a qual so the qual is actually something that is a bunch of filter conditions
that apply to this node one or more filter conditions that apply to this node so that's a list of expressions anything that doesn't match any tuple that is that would be produced by this node but which fails to match those expressions gets thrown out and then we have a left and a
right subtree for the plan so here's an example of what this looks like in the explain output you can see here that the the merge left join which is at the top of this plan tree has an output row in the explained verbose output it's shown in gold here and that
that node is producing a dot q2 and b dot q1 that's the format of the tuples that it's producing so that's the target list the filter condition which you see there is the qual that's what's being tested at the very end before we admit the tuple you notice this node also has
type a type specific expression a merge condition and that's something that's sort of part of the join processing itself that's part of the way that a merge join works but then whatever the merge join actually produces we apply this filter to it afterward and throw away anything that doesn't
match and then this plan has a left tree which is shown here in blue and a right tree which is shown here in purple not all of these fields are actually used for every node for example the sequential scans that appear on this plan have neither a left tree nor a right tree
because they have no descent of nodes the sort has a left tree but no right tree because it only has one child so what happens here here i have an append node with four sequential scans underneath left tree right tree center left tree center right tree
so what are the left tree and the right tree in this example anybody know they're nil right so there are a number of nodes that can have sort of an infinite list of children and those nodes have the children in a flexible sized list of children that is part
of the node type specific data and they leave the left and right trees completely empty and explain lies to you and it makes it look exactly the same as the other case but it but it's not when you use a subquery in your query and you don't put it in the from clause
then you get these things called init plans and subplans and i've written an example here in the where clause i have select i have where f1 equals
select min of abs of f1 from int4 table and that turns into an init plan init plan 2 here and there's a plan associated with that that computes the value and then i also have another subquery in the target list where i say select odd from 10k1 where unique1 equals f1 and that
turns into a subplan and the difference to condense a very complicated topic down to a few words in a slightly misleading way that emits some important details is that the init plan and only needs to be computed once whereas the subplan needs to be re-executed for every row
so naturally if we're going to have init plans and subplans attached to plan nodes there must be a list of each of these things associated with the plan node right no wrong there's a list of init plan nodes associated with any plan node but there's no list of subplans the subplan nodes just appear in the expressions that are attached to the node
the filter condition the merge condition whatever it is and the executor when it starts up builds a list of all of the ones that it finds while it's walking down the execution state trait and explain shows that that to you and it also shows you the init plan exists which
actually does exist in the plan yeah yeah if it referred to the outer copy of the table
you'd end up with a subplan rather than an init plan which is why it couldn't be an init plan
it would have to be a sub plan but here it only gets executed once and we end up with an init plan now you may have noticed also on this slide we have this thing where it says init plan two returns dollar sign one what's dollar sign one well dollar sign one is a parameter
so basically there are these two fields x param and all param which are associated with every plan node and they're basically the same thing they're like 95 the same thing
they tell you which parameters that plan node cares about the only difference is if a plan node has an init plan attached the output parameters the parameters that the init plan sets get included in all param but are excluded from x param i have yet to figure out
exactly why we do it this way so here's an example of a case where x param is empty and all param is not empty this is a really useful query from the regression tests
select one equals all of select of select one and it ends up producing this plan and as you can see the materialized node here has an init plan attached that init plan returns dollar sign zero so the all param set of the materialized node includes parameter zero but the x param set
is empty because the parameters set by init plans are excluded as a special case all param is used at execution time to decide which nodes to reset when we need to rescan so for example if we sorted some data and we looked at the output of the sort
and then we need to look at the output of the sort again there are two things we can do one is if we still have the results of the previous sort we can just return those results over again like if we sorted in memory and the sorted data is still in memory we can just back up to the beginning and return all the same results over again the other thing we could do
is we could throw away the sort of data that we already have we run the portion of the plan that generated that data sort everything again and then return those rows obviously the sense if the input to the sort hasn't changed because if some parameter value has changed
that changes the input to the sort you can't just reuse the results because now everything's different that was probably totally confusing so you can see here like in this sequential scan if somehow we haven't gotten to how this
can happen yet if somehow the value of dollar sign one changed then the results of this sequential scan would also change because the sequential scan has a filter condition that says that f1 has to be equal to dollar sign one so if you change dollar sign one somehow don't
worry about how that could happen in this example it can't but if somehow dollar sign one changed then the output of the sequential scan changed if there were a sort node on top of it every time dollar sign one changed the output of the sequential scan would change the output of the sort would change so you'd have to redo the sort with the new set of input rows
okay so the executor turns out to rely on x param as well as all param but only barely
there's only there the only thing that the executor does at runtime with x param is test whether it's empty and it only does that in like one corner case so the only query in the entire regression test
suite where the fact that we have both x param and all param in the finished plan makes any difference to what happens at execution time is this one that i showed you on the previous slide that's the only case where it makes any difference whether we have x param or all param i'm pretty sure it would work if we tested all param but it
would like use a few extra cpu cycles so actually having both x param and all param in the final plan isn't really pulling its weight as far as i can tell i think the main reason we actually have both of these is because they're used when we're assembling the final plan we incrementally
convert the paths that we generate into plans and in that process of building up the final plan we end up using both of these sets in the calculations but in the final plan it all the distinction between the two almost doesn't matter so parameters are pretty important
everybody see the parameter in this plan anybody see the parameter in this plan yeah where is it yeah it's the it's this in for table dot f1 that's actually a parameter
so how many nodes in this plan have a non-empty all parameters set which nodes in somebody said two which two the two index scans close but no cigar
also the append the append depends on dollar sign zero because if you change the value of clearly the index scans are going to produce different results because you've changed which value you're looking for but also that means the append is going to produce a different result
so it's like it's like that so we don't see the in this case it's the nested loop that's generating the dollar sign zero parameter and that dollar sign zero parameter is being used
to communicate between the sequential scan on int4 table which is on one side of the nested loop with the index scans on t3i that are on the other side of the nested loop so as we read each row from int4 table the nested loop stuffs that value into a parameter called dollar sign
zero and then when we do the index scans we use that value of dollar sign zero to drive the index scan so that we get the right rows it's just f1 yeah you you i mean i think you can
construct a query where you end up stuffing a whole tuple into a parameter but in the normal case you just put in the column value that you need okay so now i'd like to visit explain revisit the topic of explain versus reality to this point in the talk so parallel safe is not
displayed a plan node idea is not displayed init plans and sub plans are displayed in the same way but only init plans are really attached in the way that they're displayed x param and all param are not displayed although the init plan display lets you infer something about them especially if you also know how nested loops and other part of the planner
parts of the planner work but it's tricky you can't entirely see what's going on there these are the small lies okay these are the small things that it's hiding it's not the whole truth
so where we really get into large lies in my opinion is when we come to expressions
the way that expressions are represented here is totally unlike what is actually stored in the plan in some of these cases and there are sort of hints of that in this display like if you look at this nested loop join here you see that the second output column is the
constant 666 if you go up one level it's now parenthesized 666 where the parentheses come from what's going on here right and there's other things about it that will start to bother you if you spend as much time looking at plans as i do which
maybe you shouldn't because then you might end up giving a talk like this the sequential scan on the int4 table i1 produces as output i1.f1
and that sequential scan i think we can all feel confident to say that that sequential scan really has access to the value of i1.f1 it's scanning that table but this nested loop here how does it have access to i1.f1 it it's not touching the table it's just a nested loop
so of course the answer is that things bubble up right it's a volcano style executor things bubble up so 666 bubbles up and i1 bubbles up and all of the things that start at the inner layers of the plan tree actually move outward and eventually get to the top of the plan tree or else they get filtered out and they don't so here's a more intellectually honest
representation of this plan so down here at the bottom this is the thing we talked about before with nested loops we're not really comparing i2.unique1 to i1.f1 directly we're really comparing it to this dollar sign zero parameter which got populated over here which
makes sense this is a scan on this is a scan on i2 so it's not a scan on i1 so it has to get the value from i1 from someplace and it gets it from the parameter but also in this
display you can see how the information is bubbling up to the higher level of higher levels of the plan we can see here that it's actually this nested loop left join that first produces the value 666 that's where the 666 really comes from the higher levels of the plan aren't producing 666 they're just passing through what they got from the lower levels of the plan
so you can see that the materialized nodes output is just the two columns that it gets from its outer sub plan which is the nested loop left join so the nested loop left join produces two columns one of which it takes from its own outer sub plan and the other of which is
a constant that it generates and then those two columns once they've been produced get passed up to the materialized node and it just passes through those same two columns it takes the first and second column of its outer input and then when you go up and you look at the very top of the plan which i can't reach because
i'm too short you have the join filter which is comparing the first column of the outer tuple to the first column of the inner tuple it doesn't have any direct access to the underlying tables it doesn't know what underlying it doesn't really know what underlying tables produce those values it just knows that the value scan and the materialized node handed at
these tuples and it looks at the first column of each one and compares them to see whether that join filter condition passes if it does then it produces an output tuple and that someone has a laser pointer if it does then what happens is i don't know how to use it
so if it does then it assembles a new output tuple so it takes the first column from its outer tuple which corresponds to what was originally values dot column one and then it takes the first and second columns from the plan on the inner side which is this materialized node
so you can see how the data is flowing up the plan tree through the what i've labeled here as inner and outer references and when the data flow is some way is in some direction other than straight up then it goes by way of these parameters the parameters allow us to sort of
pass data sideways or in strange ways through the plan tree and everything else is just bubbling up so the explain output runs around and tells you where the values came from originally in the actual explain output we have values dot column one because that value originally came
from the value scan and the column one of the value scan and i1 dot f1 originally came from the i1 table and it was produced by the sequential scan and it bubbled up here and this 666 here originally came from this 666 here but because of some idiosyncrasy of the
when we're executing this we we we're not really doing it that way explain has kind of masked the real data flow here a little bit
and one thing that i think is kind of interesting about this is that when we initially generate paths the references to the table columns which internally are called var nodes and also references to expressions in the target list in the initial phases of planning those still refer to the table that will really provide the
value so when we're initially doing the planning the data structures that we're generating actually look very much like the final explain output but at the very end of the planning process after we figured out how we're finally going to execute this thing we no longer care
about the original origin of the value we need to know where the plan node that's running is going to get the value at execution time is it coming from the outer plan is it coming from the inner plan is it coming from a parameter where is it coming from and so one of the last stages of planning is to go through and run through the plan and to replace vars and
expressions with new vars that refer to the outer or inner plan and that produces this kind of thing which is what the data structure really looks like internally that is behind now i'm a little head so that's actually all i've got
on what's in a plan we have a few minutes so i can take a few questions if anyone wants to ask a question yeah i'm going to go with no the question is can i get a summary
of this non-technically no sorry yeah uh yes so that query i don't think i have it handy
but you can find this in the regression tests because this came straight out of the regression test yeah greg okay so you yeah so you have encountered something called the use
physical t-list optimization or sometimes the use physical t-list de-optimization so so what
happens is uh the way that the data is actually passed up the plan tree is in something using something called a tuple table slot and a tuple table slot can hold the the data for a tuple in
one of several different formats and if we're doing something like a sequential scan the easiest thing to do is actually cram the entire tuple with all of its columns directly into the tuple table slot and let the next node up just get all of the columns and then only pull out the ones that it needs so if we determine that that optimization
makes sense in a particular case then the sequential scan will be seen to output every single column in the table and at the next level those columns will be projected away if we determine that the optimization isn't a good idea in a certain case then we'll make
the sequential scan do the extra work of pulling a tuple apart pulling out the stuff we need and only passing things up that we need for example if there's a sort directly on the top of the sequential scan we don't want the sort to get all the columns because then we'd be sorting really wide data instead of hopefully narrow data and i would stink so in that case we'll tell the sequential scan i'm sorry you have to give us exactly the columns that we really
need but when we don't care we'll say to the sequential scan hey if you want to just produce everything that'll be fine with us we'll deal with it at the next level right projection is reconstructing a new tuple yep projection can involve computing expressions throwing away columns
peter yeah i i i had thought of the same thing myself i think having like a raw option for explain or something we can bike chat about the name but i think having a name that displays
uh some of these internal details that would be really helpful at least for developers because it took me a long time to actually understand some of what was going on here and there's bits and pieces of it that i'm still not sure i fully understand but i definitely have had cases when i was working on parallel query where i was like wait which
parts of this plan does the planner think are parallel safe and you can't tell you can tell what it actually chose to put below the gathering dog but that can be a costing decision so i yes i think there would be value in exposing some more of these bits well no i'm not talking
about that i mean that's showing alternative completely alternative plans would be a whole different thing but if you're just trying to judge which things could have been put under gather knowing whether those plan nodes were marked parallel safe would be enough to tell
you whether that was hypothetically possible knowing the plan node ids is probably mostly useful for debugging but we do do debugging sometimes x param and all param are definitely interesting again at least if you're doing development so i i think it would be cool to have an extra
mode that basically says you know don't deparse my expressions so cleverly just tell me what it really looks like um and uh and show some of these other details that are probably mostly boring for users although i think it's kind of useful to have some idea how it works so that you you you you don't get confused by the the output um but certainly for developers
yeah i think that would be handy it's definitely the case that if you do something like this
even on a plan tree this size but certainly on a really long plan tree if you have outer and error references all over there every instance of the word outer refers to something different and every instance of the word inner refers to something different unless they're on the same line so yeah there's definitely confusion possibility there i'm not sure that would be an improvement
but we can talk about it yeah alternate sub plan we uh pick two different plans for
executing a sub query i think usually one of them is a hashed sub plan which will be useful if we're going to execute it a whole bunch of times and the other one is a non-hashed sub plan which will be better if we're only going to execute the sub plan once or twice and instead of figuring out that at plan time we cheat and leave it to the executor to decide which one it's going to do it's kind of a hack and and i say that having asked tom lane
about it yesterday who said that's kind of a hack right so there's like a debug print plan and it dumps the raw node tree um that is sometimes useful but that is really verbose
like the the node right no that's true if you if you turn on debug print plan you get something that will absolutely tell you exactly what's in the plan tree but my problem with that
is for something like this it's like 20 pages of output so i'm no longer confused because explain is hiding stuff for me i'm confused because i can't navigate through this enormous node tree like the just a target list for a typical query is two or three pages long and it gets hard to figure out wait am i still in the same list of var nodes or am i now in the
next list of bar nodes right well that's that's another way you could do it yeah i mean there's certainly we could do nothing like we've managed to make PostgreSQL as good as it is today
without doing anything here right and you know you have an idea of how to make it better and that's cool and i have a slightly different idea and that's also cool but you know i'm happy to hear anybody's ideas maybe not get into a long discussion right now but i'm happy to hear anybody's ideas on how to improve this i i just think that you know this isn't the kind of output
that you probably want to use for everyday debugging but i think it's useful to know that this stuff is happening even during or ordinary debugging because it can make you less confused about what's really going on here and i think certainly for development there are times when
you'd like a little bit more but maybe not something quite so throw pages and pages of ascii at me as as debug print plan which i have used but only on address yeah we have done some things over
the years to make it more possible to hook into explain output so for example we have custom scans
and custom scan providers have some methods that they can call to show things and explain output doing this kind of thing though is definitely beyond the ability of the extension capabilities that we have today part of it is this goes beyond explain in order to generate this output that i'm
showing you here i had to hack rule utils dot c not explain dot c and so it wouldn't be as simple as just providing an api and to explain if you really want to get this you you actually
need something to pass some flags down to something much much lower level or rewrite all rule utils dot c in your own extension which i mean you could do but so i'm not actually sure that there's a very convenient way to execute this to turn this into an extension it could certainly be done i mean you could write a function and you pass it a thing and it
generates a plan and then you copy all rule utils dot c and copy all explain dot c and change everything so that it produces this output i mean you could you could do it in a couple days right it would just be hard to maintain because as the upstream code changed things would diverge if you have an idea of how to refactor it so that we can plug into it
game game for that too i don't immediately see a way to do it but there's lots of things that are good ideas that i don't know how to do i think we have time for like one more question if anybody has one good everything is perfectly clear okay thanks everyone