Hash Joins: Past, Present and Future
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 19 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/48964 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
PGCon 201716 / 19
1
7
8
10
11
12
15
00:45
Computer animation
03:43
Computer animation
06:54
Computer animation
09:13
Computer animation
10:06
Computer animation
11:27
Computer animation
17:52
Computer animation
20:22
Computer animation
22:51
Computer animation
24:15
Computer animation
35:15
Diagram
40:12
Computer animation
42:10
Computer animation
44:03
Computer animationProgram flowchart
47:48
Program flowchart
Transcript: English(auto-generated)
00:06
Hi, everybody. My name's Thomas Munro. I work for Enterprise DB, and I'm going to be talking about hash joins. Very briefly about me, the main thing I've been working on in the past few months is
00:23
a proposal to make hash joins parallel aware. This talk is going to have a small component about my proposal, but also just generally about hash joins and their implementation in Postgres. I work on the database server team at Enterprise DB, Robert Haase's team.
00:46
Cool. This is the format of the talk, just a bit of an introduction to joins, hash tables, and then simple hash joins, and then we'll get into the hairy details of multi-batch hash joins and then talk about parallelism. So basically, a bunch of people at IBM in the 60s and 70s invented relational algebra, and
01:04
the implementation system sort of showed the world how to do a SQL database. In SQL, we have a whole bunch of ways of writing join queries, and I'm going to
01:23
go ahead and assume that you all know what these things all do. So we've got, these are all equi joins, they're all something equals something. So they're all taking two relations, joining them together, and splitting out a new relation. And there's alpha joins, inner joins, semi joins here, where you just test whether
01:45
a matching row exists, and so on. So there's three basic strategies for executing joins in a relational database. Nested loops, almost from the definition of what a join is, sort of fall out of that.
02:05
You just walk through one relation, and for each one scan the other relation for a match. That could be via a sequential scan, or it could be via an index scan. Then there's merge joins, where both of the input relations are in the same order, or have been sorted if necessary, to put them in the same order.
02:21
And then you just walk through them sort of in sync, finding the matches. And then there's hash joins, which were the last to be discovered by people implementing databases. And that's, you build a hash table from one of the relations, and then walk through the other relation, probing the hash table to find matches.
02:44
So if you step back a bit and squint, you can see the hash join from my description is a little bit like having a nested loop with an in-memory index that you build on the fly for the inner relation. So a couple of things we can say about hash joins, they need a lot of RAM.
03:02
I mean, the basic idea, because you're building this temporary hash table in memory, you obviously need a bunch of memory. Whereas those other strategies don't necessarily need any RAM. RAM might be helpful if you're sorting for a merge join, but it's not necessary for index scans, for example.
03:22
So it was the invention of large RAM systems that came along a bit later in database history that led to hash joins being invented. But of course, when you have more RAM, that's also good for sorting. I think I've just lost a cable. No, I haven't. So the choice of join algorithm can, in some cases, be limited by the join type and join conditions.
03:44
Has anyone ever seen this error before? Actually, I believe somebody has something in the next commit fest to fix that particular problem, which is very cool. Okay, so I'm just gonna talk a little bit about hash tables as they exist in Postgres.
04:05
So there are at least three hash table implementations inside Postgres. There's the DynaHash, which is sort of a general purpose hash table used in back-end local memory in some cases and in shared memory in other cases. That's a chaining hash table, meaning there's a linked list of
04:24
pointers for the conflict resolution mechanism. Then there's SimpleHash, which Andres Freund added to the most recent release. And that is an open addressing system with a different conflict resolution system. And then the hash join operator has its own hash table
04:43
implementation. Why, you might ask, when we have two perfectly good general purpose ones. Well, the hash table that's used for hash joins is extremely simple. It's really nothing more than an array. And one of the things about this particular hash table is
05:01
that it has to deal with tuples that have the same key. Not just because of an unintentional hash collision, but because the tuples actually had the same key. So this is a hash table, not like a Python dictionary or a C++ unordered map or whatever, because multiple
05:21
copies of the key can finish up being inserted and you need them all there. So if you did use one of these other hash table implementations, you'd finish up having to do your own chain to deal with the multiple tuples at each value. By the time you've done that, you've kind of, if you have to do that anyway, what are you really getting from using a general purpose hash table?
05:42
There's a couple of other things, properties of this hash table that make this very simple array approach appropriate. One is that the only thing we ever do with this hash table is insert everything in one phase and then probe it in another and then just throw the whole thing away. So I started out thinking when I first started looking
06:01
at this code, why aren't we using the general purpose hash table, but then I came around to the view that this really is a case where just using an array does make a lot of sense. So in memory, it simply looks like an array where you hash your key somehow and produce a number and find the right
06:20
slot and there's a chain of tuples in each bucket. Another feature of this hash table, it's not really the hash table itself, but the way that the hash join operator deals with memory management is that it loads tuples into chunks of memory and that reduces the overhead associated with individual allocations of tuples.
06:43
Those chunks also provide a convenient way to iterate over all the tuples when we need to do a couple of different operations that I'll talk about in a minute. So, moving on to straightforward simple hash joins.
07:01
When you use explain on a hash join, you see that there's the two different relations, what we call the outer plan and the inner plan are there and the inner plan is the one that's hashed. Usually that's the smaller of the two because we're looking for something that's gonna fit in memory hopefully. So, at a very high level, the algorithm has at least
07:23
two phases, for an inner join you have the build phase where you load all the tuples from the inner relation into the hash table and then the probe phase where you scan the outer relation trying to find matches in the hash table. If it's an outer join, you also need to find unmatched
07:42
rows, so that third phase comes into play if it's a certain type of outer join where we need to scan through all the tuples that are in the hash table in memory and look for unmatched rows. There's a couple of optimizations, we can skip scanning either the outer or the inner relation in certain,
08:04
under certain conditions, but obviously outer joins prevent that. So, the hash table consists of a certain number of buckets, it's just this array, at planning time we figure out what size that should be and we try to make sure that the load factor is one.
08:21
In earlier versions of Postgres we tried for a different number and I think that tension is because ideally we'd actually like to have one bucket per distinct value, per distinct value, key value, not per tuple. But it's kind of hard to figure that out, I think.
08:43
So, the planner estimates the number of rows in the inner relation, hash table gets sized to the power of two greater than, the nearest power of two greater than the number of rows that's expected. And after loading the hash table, if that turns out to be too high, then we do a reasonably efficient thing
09:00
where we find a, choose a new size if we need to, because we found out there were too many things in each bucket, and then we can just scan through all those chunks I talked about to reinsert all the tuples. So, here's an example where originally the planner thought there were gonna be 1024 or fewer keys
09:23
in this hash table, but it turned out that in order to meet its goal of a load factor of one or below, it really needed 2048, and I did that by tricking it by writing an expression which is true for every row, I mod five is less than five, that's always true, but it looked at that and said, yeah, that's gonna produce a smaller number of rows.
09:42
So, these cardinality estimations being wrong can lead to the buckets having to be, tuples having to be reinserted into a reallocated hash table. So, that's a very simple overview of straightforward hash joins when no batching's involved.
10:03
But in order to deal with workmen and trying to make hash joins fit into a finite space that you said with the good workmen, we need to use partitioning to reduce the amount of data that's loaded into memory at once.
10:21
The way we do that is by the planner estimating, finding how many batches it needs to chop the interrelation into so that each one will fit in workmen. So, this approach is known as the GRACE algorithm. I think GRACE was actually a particular database machine
10:40
developed in Japan, I think, that first tried this approach of having the planner estimate basically how many times to chop the input relation in half, and there's a slight refinement to that idea, which is called hybrid, which is that the very first partition or batch is loaded directly into the hash table,
11:01
so we avoid writing one of the batches out to disk. Now, it's possible that the planner could be wrong about how many batches it needs to chop the data into to stay under workmen. If that turns out to be the case, we have this adaptive approach,
11:22
which will increase the number of batches, and I'm gonna work through an example of that in a moment. There's an optimization whenever multi-batch hash joins are involved, which tries to move the most common values from the alpha relation and make sure that they get processed at the same time as the first partition
11:42
so that we can avoid having to write those out to disk, so this is an IO reduction technique, and I'm gonna work through an example of that in just a moment. So, when the hash join begins, and if the planner has determined that it's gonna need to chop the interrelation into batches
12:00
in order to fit it into memory, the build phase sucks in all the tuples from the interrelation and then fires them to one of these, in this case, five different places. Either it's decided in this case that it wants to have four partitions or batches,
12:20
so it loads all of those tuples that hash to partition zero into the hash table in memory, but if it sees any of the most common values from the alpha relation, then it'll put them into this special side table, the skew hash table. Everything else goes into one of these batch files, and these are files on disk.
12:45
After it's finished doing that, it then needs to probe, which means scanning the alpha relation, pulling in every tuple, and first of all saying, well, which batch does this tuple need to go in? And if it's batch zero, then it can immediately probe
13:00
either the hash table or the skew hash table to see whether there's a matching tuple there. If it doesn't find a match, and that may lead to emitting a tuple from the hash join operator. Tuples from the outer relation that need to go in any of the other batches get written out
13:20
to a file on disk, so this operation is generating a ton of disk writes. So after it's finished processing batch zero, it now needs to process batch one, so the first thing to do is to load all the tuples that we wrote into batch one on the inner side into memory, and then we proceed,
13:42
and we do that for each batch. We then have another probe phase on the outer side. Now here's where it gets interesting. When we come to process batch two in this case, while loading tuples in from the batch file on disk, we may actually discover that workmem is full, so what's happened here is that the planner decided
14:02
we were gonna have four batches, but the executor decided, discovered that when it got to, in this case, partition two, it found that that was insufficient, it hasn't got enough memory, and it really wants to try and avoid using more than workmem. So in that case, adaptive partitioning algorithm kicks in
14:24
and decides to double the number of batches, so here you can see that on both sides, conceptually, every single batch has been split in two. So when we have one of these, I call it a shrink operation, that word doesn't appear in the source code anywhere, but the contents of the hash table,
14:41
which at this point in time is, it's crashed into workmem, so it's a large amount of data sitting in the hash table and memory, gets hopefully split in half. The goal is to split it in half, because when we double the number of partitions, hopefully, half of the data that's in memory will finish up getting, will be able to stay there,
15:01
but the other half of the data, we can write out to disk to the new partition we created when we split all the existing batches. Now it's possible that doubling the number of batches has no effect on how much memory we need, because it's possible that all the tuples might hash to the same batch, in which case we have a problem on our hands,
15:21
which I'll talk about in a moment. So then after we've expanded the number of batches, and we reach the probe phase again, and we start reading in tuples that we previously wrote out to outer file, the outer file for batch two, we might encounter some tuples that we can immediately,
15:42
that still belong in batch two, so we can use them to probe the hash table that's in memory, but we'll also encounter some that now need to be moved forward to a future batch that we've just created when we split all the batches, so it's fairly complicated. So I'm stepping back from all that. That gives terminology I made up
16:03
to describe these sort of four different ways that a hash join can go. We have the optimal case, which is where the planner thinks that the hash table's gonna fit in memory, and when we come to execute the hash join, we find that that's true. That's the optimal case when the hash join
16:20
is doing a very good job. Then there's a good case, which is where the planner thinks that workmem is not enough to hold the hold in a relation memory, so it decides to use, say, four batches, like in that previous example, and then when we come to execute the hash join, the executor finds that that's true.
16:42
It's still pretty good, because if that data didn't fit in memory, some alternative plan, like a merge join, it would probably have a similar problem if the data doesn't fit in memory and it needed to sort it, it would need to start doing some disk IO too, so it's still a good plan.
17:02
When things start getting bad is when the planner set out, it thought it had an optimal or a good case on its hands, but when we came to execute the hash join, the executor discovers that, like in the case we worked through with pictures, workmem is not enough.
17:21
So now we need to start dumping tuples out to disk and it starts increasing the amount of disk IO considerably, potentially. And then there's a special case of bad, which I call ugly, where it's reasonably unlikely, but it's possible and it's certainly easy to contrive,
17:42
a case where no amount of repartitioning is gonna help and the data is never gonna fit in memory. In that case, we have no choice currently, but to stop respecting workmem. And I don't know if you've ever seen that before. That might happen if you're unlucky. Of course, in many common cases,
18:01
it's still just gonna work and no one will ever notice, but from time to time on the mailing list, you'll see someone complaining about this and it's because of this kind of problem. Yeah, yes, okay. So just really quickly,
18:21
here are some examples of the optimal, good, bad and ugly cases. Optimal here, we see that the, I don't know if you can see that's in bold there. The planner said that there would be one batch and at execution time, that was true. We set workmem to 64 megabytes and the memory usage turned out to be 43 megabytes.
18:41
That was a perfectly good hash join. Everything went fine. And here we have the good case. In this case, the planner determined that 64 batches would be enough to stand underneath workmem. I set workmem to one megabyte here and the hash join ran perfectly fine. It used 64 batches and it never used more than a megabyte of memory. So all good.
19:03
Here's the bad case. The planner determined that one batch would be sufficient, but it turned out at runtime that the executor needed to split many times to reach 64 batches.
19:20
But it still managed to stand at the goal. Target workmem of one megabyte, memory usage was 808. That's the peak memory usage. And then in the ugly case, I actually made a really badly skewed table here which is called awkwardly skewed and I made sure that the stats would be wrong. There's various ways that the stats can be wrong.
19:41
In this case, I just altered table, don't vacuum this thing please ever and then I did some tricks to make sure that the stats would be wrong. But the stats could be wrong easily for many reasons that are not contrived. And in this case, the planner determined that one batch would be sufficient. At execution time, it split the, it repartitioned
20:03
and it observed that further repartitioning would not help so it gave up, threw its hands in the air and then continued the hash join. And even though I said we had one megabyte of workmem, it managed to use 35 megabytes at peak. So obviously, in some case, it could decide to use 48 gigabytes and we only have one and kaboom.
20:22
So we'll talk about, I've got an open problem section at the end where I'll talk about some potential solutions to that problem. Okay, so now I'm gonna talk about parallel hash join. But first, let's do a quick recap of the situation, what parallel query, the relevant bits of parallel query for this problem space.
20:43
So parallel query is based on the idea of partial plans. Partial plans are query plans that can be run by many workers in parallel but each worker, and each worker will see a fraction of the total results. But together, they'll all see all of the results,
21:01
will generate all of the results. So usually at the bottom, sort of leaf nodes in a parallel query plan, you've got some source of parallelism, which is usually a parallel sequential scan or a parallel index scan. In future, there could be more ways of producing parallelism, but that's kind of the source of parallelism.
21:22
And at the moment, the granularity is always pages. And somewhere above that, there's gonna be a gather or gather merge node and everything in between the gather merge node and the parallel scan nodes is a partial plan.
21:41
So we have hash joins can be involved in parallel queries at the moment in 9.6 and 10. But those hash joins are parallel oblivious, meaning that they're not doing anything special. They don't know about parallelism as far as they can. I mean, the only reason it works is because the outer relation they're seeing is partial,
22:03
as in it's receiving a fraction of the total set of tuples. The hash join node is completely unaware of that and it's simply built a copy of the hash table from the inner side in each worker. And the planner has proven that that's safe. And there's various cases where it wouldn't be safe.
22:21
For example, I said here, problem two, since there are multiple hash tables, which are all kind of copies of the same hash table, certain kinds of outer joins can't be run this way because there's many different sets of the match flags, for example. But the main problem with this is that each worker is gonna produce a copy of the hash table.
22:41
So it's run a plan that does a whole bunch of work and then it's used up a whole lot of memory holding the results in a hash table. So I've seen quite a few slides today that have a Amdahl's law slide. I decided to make an Amdahl's outlaw slide because at first this seems,
23:03
you know, running the probe phase in parallel but not running the build phase in parallel seems like a straightforward case of Amdahl's law. Like you've got the bit that you made faster by adding more concurrency, you know, parallelism, and the bit that you can't make faster. But actually it's worse than that
23:21
because running n copies of the same plan actually creates damage. You know, it actually generates a ton of contention on all kinds of resources, buffer locks and so on, and also uses up a ton of memory for nothing, you know, all these copied hash tables.
23:40
It's actually okay if the inner plan is very small and the resulting hash table's very small, you know, it doesn't use many resources. But the interesting cases here are ones where the hash table would be large. And none of these externalities included in our costing model. So I have a picture of this polar bear on a, you know, the melting arctic circle situation
24:03
because like economists talk about externalities, you know, your factory doesn't, you're actually damaging your environment by doing this right, so. So the basic approach is to solving this problem,
24:20
making the hash join completely parallel and not just parallel in the build phase, partition wise joins, and we have a project in development for that. My colleague Ashutosh Bapat is working on that with others. So that idea is basically just saying, well, you've got a whole bunch, if you've got a partitioning scheme
24:40
on both on the tables involved and the positioning schemes match, then you can simply, you know, plan a hash, a parallel oblivious hash join on each partition and everything will be just fine with no communication. And that's great and we will have that, but it will only help you if your partition, if your partitioning scheme is set up just right.
25:04
Another approach is dynamic repartitioning and there are a whole bunch of different strategies for that. One strategy is that it might be that one side of your hash join is suitably partitioned and the other side can be repartitioned on the fly to match.
25:23
There's also, there's the possibility that no partitioning is involved at all. So you need to repartition both sides to match. And then the third approach is using a shared hash table. In the literature, they refer to this as no partition hash tables. And I have a proposal to do that.
25:40
So how do we choose between all these different approaches? Well, you don't actually have to choose because certainly the first one, partition-wise joins, we should definitely have and we will have. It's just that that can't deal with all of your join requirements. It's not a completely general solution. The repartitioning schemes are really interesting.
26:02
So the state-of-the-art cache-aware repartitioning algorithm called radix-join is, there's a lot of stuff written about this and there are systems out there doing this. It does a really expensive multi-pass partitioning phase before it even begins the build phase of the hash join.
26:21
And the goal of that is to minimize cache misses during probing. So it actually knows about the size of your L1 cache and your L2 cache and knows things about NUMA nodes and all that kind of stuff. And it does a ton of work up front. And it makes the probe phase so cheap that it manages to win back that time,
26:40
which is really interesting that that really says something about how expensive cache misses are. And hash tables, of course, are prone to cache misses because you're randomly accessing memory all over the place and there could be gigabytes of hash table in memory and you're, you know. I think you just killed your mic. Ooh, can you hear me now?
27:02
I think the battery might have died, yes. Can you hear me now? I can hear you. You can, okay. Repeat the question, okay. So Peter asked if it would be fair to say
27:22
that this is a trade-off between memory bandwidth and compute bandwidth. Absolutely it is, yes. Are both microphones working now? Can you guys hear me? No. Okay, I'll speak loud. Okay, so dynamic repartitioning hash joins
27:45
are really interesting. They're also extremely complicated. And several researchers, I've got some references for this if you want to look this up later, which will be at the end. Several researchers have claimed and shown that just a really simple shared hash table based system
28:02
is usually about as good in many interesting cases and it can be better in skewed cases. And the funny thing about data is that your data is skewed. So researchers who are working with non-skewed data are not necessarily, I mean, it's very interesting, but you have to consider skewed data as a very common case.
28:23
So, and there are people arguing with each other and trading papers that say each other, saying the other guy is wrong right now. This is like quite a hot topic. But one thing that I figured out from reading about this is that the bar for beating a plain old big shared hash table is really high.
28:42
Like in terms of engineering challenges, I don't think we can build a dynamic state-of-the-art cache aware, like hardware aware repartitioning algorithm thing that would work anytime soon. So in terms of communication between backends, in terms of all kinds of things that are non-portable,
29:03
well, portability is an issue. Yeah, yeah, yeah, so that is an interesting topic. I don't think I can do that. And I don't think it'll, I think it would take us, and I think in years to come, you know, it could be interesting to look into that.
29:22
I decided to take the advice of a number of researchers who say, if you don't have a partitioning phase, you don't have to do that. In many common and interesting cases, it works out better anyway. So my, so yeah,
29:44
simple repartitioning algorithms always lose. So we already have a simple repartitioning algorithm, which is the one we do for batching, right? So if you don't have enough workmen, we already do this repartitioning. People who have built systems like that, well, the research papers that I've got links to at the end,
30:05
those types of systems that don't try, see those types of systems aren't actually trying to reduce cache misses. They're actually just trying to chop the data up arbitrarily so that it fits, right? While maintaining the property that the partitions are disjoint on both sides
30:21
so that the logic still works and you get the right answer, but they're not trying to reduce the cache misses. If you have a partitioning phase, but you don't get anything back in terms of cache hits, then you'll lose. That seems to be the economic trade-off, the way it works.
30:57
Right, so, yeah, so Peter's question was,
31:01
could we imagine a future where we would want to do the radix? Eventually we'll have to as hardware advances. Okay, so it's certainly possible, yeah. I mean, one of the, I don't think I've read anywhere near as much of the database literature as you have, Peter,
31:22
but I think that from what I have read, one of the factors in this whole thing is the constantly changing economics, like, for example, the sort versus hash debate, which has been raging since about 1980 or something, whether it's better to sort stuff into a merge join or do a hash join. As far as I know, hash joins are basically in the lead,
31:41
but there's always a lot of papers saying that, do you know, in about two years' time, they'll have SIMD vectorization that's this wide and then we'll win. But we just have to wait and see how that pans out. Yeah. Okay, so, yeah, my proposal is the relatively simple
32:01
approach of just sharing the hash table. Emphasis on relative. Relatively simple, yeah. But there have been a lot of complications and the actual shared hash table hash join, I got working really quickly, but then there were so many sub-problems and special cases and things
32:21
that I spent a lot of time on. So the basic idea is to load the hash table into shared memory, into DSM segments, DSM segments being Postgres's abstraction of basically a map files or something like that, something equivalent on each platform that has the properties that you can always make more
32:42
or ask operating system for more, unlike the traditional Postgres shared memory, which is fixed to start up. And also that pointers are not, you can't naively exchange pointers because the memory might be mapped at a different address in each backend process. Ancient decisions in Postgres's design history that we code around.
33:03
Okay, so the basic idea is that we, is that during the build phase, we insert stuff into buckets using compare and swap. So there's no lock partitioning or anything like that. It's just compare and swap to insert stuff. So hopefully you're inserting different keys and that's going to work nicely because you'll be just compare and swap,
33:20
doing compare and swap operations on different buckets. However, between the build and the probe phase, you have to wait for all workers to finish building the hash table. Nobody can begin probing until every jupyter's been loaded, right? Otherwise the answer will be wrong. Unlike those, all those,
33:41
so we have to introduce this wait point, synchronization point at the end of build. And that's done using a barrier IPC mechanism that was one of the things I had to write to get this thing going. Now, the competing repartitioning systems, they also do a ton of communication and waiting for each other and stuff as well.
34:01
So it's not as though, even though once they finally get to the build phase, they can run the build and then the probe without any communication. Hooray, that's great. But they had to do a lot of communication to get to that point. So adding one synchronization point between builds and probe, it doesn't seem to be that bad.
34:22
And certainly in practice, I can't really measure any effect from that. So some of the things I had to build to get this thing going were the shared memory allocator. So that's the thing that can create all these DSM segments, these MMAP files, essentially, and allocate space and manage the space within them.
34:43
That's been committed into Postgres 10 and is also used by the parallel bitmap heap scan. Then I had to deal with shared temporary files and I still haven't quite finished arguing with those. And that involves quite a bit of discussion
35:01
with Peter Geoghegan, whose parallel create index can also benefit from shared temporary files. Shared tuple stores, shared record type mod registry, and a couple of other bits and pieces. So here's like a really simple example that just tries to show the hash join operator.
35:21
I'm not showing TPCH results or anything like that. This is just like a very simple three join example. And it's doing an aggregate so that there isn't really any processing above the gather node that you have to worry about. So I'm just trying to capture the raw hash join speed of a very simple table that has like 10 million rows or something, I think they're integers.
35:42
So on the x-axis, you can see the number of workers. That is number of processes in addition to the backend that you're logged into and running queries on. So where it says zero, that means there's one process and where it says one, there's two processes and so on. And speed app is shown on the side there. Now, in the unpatched code,
36:03
so this is basically Postgres 10, it's green there, you can see that it's not actually getting much speed app as we add more workers. Now, thinking back to the armdahl thing, you've got like the build phase, which we know is being run a complete copy
36:21
in each backend in master, but the probe phase should be run in parallel. So you'd expect this line to be going up and it isn't really going up, is it? I mean, it's going up a little bit at one and two, but then it sort of flattens out. So on my external costs slide from a few slides back,
36:42
I think what we're seeing here is that the, even though the increased parallelism is causing it to run the probe phase faster, it still sucks overall because as you add more workers, each worker is now building another one of these gigantic hash tables
37:01
and like thrashing the page cache system and using RAM and all these things. On the other hand, if you look at the patched line, you can see the speed app, it's kind of a line, it's basically a line, but it isn't, when we add each worker, we don't get another 100% speed up as you would hope.
37:26
And I haven't researched this fully yet. I think what we're seeing there is the, is just the underlying sequential scan not being as good as it could. And there is a patch out there that would fix that from David Rowley, which I'm planning to test
37:41
with some simple examples like that and see if I can get that line to go, to become pure linear speed up. So here are the query plans there. The unpatched version, you can see, sorry, it cuts off at the end, but nothing interesting is on that side. So you can see that there's three joins here.
38:01
Each one uses, let's call that 500 megabytes of memory. But because there are four workers, five processes in total, and three of those hash join nodes, or hash nodes, that actually leads to a total of 7.5 gigabytes. Did I add that up right?
38:20
That's a lot of memory. So yeah, we've actually asked this computer to generate and then insert 7.5 gigabytes of duplicate junk into memory, right? Whereas with the patched version, we see all of the hash nodes changed into parallel shared hash nodes in the query plan.
38:44
The amount of memory usage for each of these hash nodes is still about 500 megabytes. Should actually be exactly the same number, but it's not because in this patch, I changed the recipe for how that memory is accounted for. So it's changed what explains shows it hasn't changed.
39:01
It's just made the answers more truthful, but it's basically about 500 megabytes. And so the total, it comes out to 1.5 gigabytes instead of 7.5 gigabytes. And I think that you can sort of see that the computers had to do an awful lot more work to deal with the 7.5 gigabytes of data. And you get the answer sooner.
39:24
Okay, so how are we doing for time here? Got five minutes. Ashutosh. So Ashutosh asks.
39:44
So Ashutosh asks how it would affect it if we didn't have the aggregate there. The reason I put the aggregate there is because I wanted to measure just the hash joins speed. I didn't wanna deal with the fact that there'd be a gather node that has to spit out millions and millions of rows.
40:03
I think that would just kind of pollute, make the measurements harder to understand. That's the reason I put the count there. Okay, so it's just a few minutes left. I'm just gonna talk very quickly about some open problems. This is not to do with parallelism, just to do with hash joins in general.
40:22
These are just some things that are on my list, things that I think are interesting that we should fix or improve or contemplate for hash joins in general. So the first one I talked before about good, bad, and ugly cases. In the ugly case, Postgres will hopefully produce your answer, but it might not if you haven't gotten enough RAM, right?
40:40
So I've thought of a couple of different ways of attacking the problem. I've mentioned this on the list, just exchanged an email with Tom Lane about this. So the first thing is you could switch to a sort merge for the problematic partition when you realize that it's never gonna fit and work then. It seems like solution, but the problem is, well, firstly, dynamically changing to a different query plan for one partition
41:02
is kind of weird. We don't have any other examples of that. I don't know how you'd set that up. It would be a bit complicated, but somehow that's just programming, right? It seems like it should work. Unfortunately, there are some cases where because of available operators and details, not every query that Postgres is happy to run as a hash join can be run as a merge join
41:20
because of missing operators and compatible operators, which is a bit of a pain. So we could just say, well, we don't care about that, or from now on, we require all data types to have the right set of operators, but that's kind of hard to do. You could just say, well, okay, your computer could still run out of memory if you don't have the right operators, but otherwise we'll switch to a sort merge.
41:40
I don't know, maybe that's a solution. Another approach would be to invent a new algorithm for processing the batch in multiple passes. That has some complications for dealing with those matched bits, which you have to keep track of, and I haven't really figured out the answer to that yet. So yeah, a couple of different ideas there. Also, there's a closely related problem
42:01
with hash aggregates. You can quite easily write a query that will make your computer melt using hash aggregates. Okay, another thing which comes up from every couple of years on the mailing list is why can't we use Bloom filters to make things faster? So there's a couple of different places
42:20
where you could use this. We know that other databases do this. They plan a hash join, they've built a hash table, they've done all this work to figure out, and they've done all the computations required to make a relatively small Bloom filter to say, for filtering out rows from each other.
42:40
So on the outer side, and then they can push that all the way down to a scan so that they can basically filter out some tuples closer to the data, closer to the disk, whatever, right? But no one's ever figured out how to do that and make it win in Postgres. Peter Gagan here pointed me out a paper from a student, I think it was an undergrad project,
43:00
from a student in Singapore who wrote this, got it completely working, and then showed some nice graphs that appear to win in some cases, and then didn't send us the patch, and no one's ever heard of it. So maybe I should email him and say, can you send us the patch? But anyway, so that's an interesting case.
43:22
And there's another idea here, which again is from Peter Gagan. He and I spent a bit of time talking about all this stuff because there is some code we can share between this and his project for parallel create index. That got us talking about various topics. And he had the idea that you could use Bloom filters
43:40
to filter the data that you write out to those relation, so after relation batch files, preventing a bunch of disk IO. Now it might be that Bloom filters in general can't speed things up enough for optimal, happy, small hash, smaller hash joins, but that they might still be able to prevent you from doing a whole bunch of really expensive disk IO. So that's an interesting case to look into.
44:12
Right, yes, Peter says quite rightly that Bloom filters only save you cycles if they actually filter stuff out, right? Otherwise they just cost you cycles for nothing.
44:26
Okay, so while looking into all this stuff, I figured out that in Postgres what we call, well in Postgres we usually produce left deep join stacks
44:40
but we can produce bushy plans and right deep plans for nested joins, stacks of joins. Typically you'll see left deep plans. And one interesting thing that I figured out is that most databases produce left deep plans. The original system R implementation only did left deep plans
45:03
but you'll find oracles, equal server, db2, quite often see left deep plans. But strangely, we seem to be the only ones who call the side that we hash the right hand side for a hash join. Everyone else, as far as I can tell,
45:21
every database I've found information on, hashes the left relation. And they also call it the driving relation even though it's the hash table, it's the other way around because we're sort of thinking of it as the hash table being like the inner side of a nested loop. It's the thing you're probing so we call that the right hand side and the inner side.
45:41
This is all very confusing but everyone else is doing it the other way around which means that their left deep plans have the property that each, when R and S are joined together in say Oracle or SQL server, producing a hash table full of, feeding the hash table above that join
46:00
and then that feeds the hash table above that join. So you only ever need two hash tables in memory at once. Maybe most of the time you don't care about that but if they're gigantic hash tables like gigabytes and there is a whole stack of them, you really don't wanna have those all in memory at the same time. Those guys only have two of them in memory at the same time. We have all of them in memory at the same time. I think that's kind of interesting.
46:21
I don't have any, I mean I know that whole query plan memory usage is a gigantic can of worms and I know, I've never discussed it publicly but I've read the archives on this and there's just great big long conversations that never go anywhere because it's really hard to constrain memory globally in a useful way. It's kind of got all kinds of circular problems in it
46:42
but this is kind of thinking about total memory usage but not in that way where you have to actually model it. It's just saying, hey, we only need two hash tables at once, let's do it that way. That's something we could perhaps consider.
47:01
Yeah, so by, we don't even try to minimize, we don't, by doing, by typically doing a left deep plan, we'll finish up with all hash tables in memory. We'll maximize peak memory usage. They, in this case, I'm talking about SQL Server and Oracle.
47:20
I think DB2 is probably the same but I haven't looked into it. They would only have two at a time. Now even if we did a write deep join, which I believe we're capable of generating, although I haven't seen one recently, we still don't actually free the hash table as soon as we could so you still wouldn't get anything out of it but if you did both of those things, then you could have only two hash tables in memory at once.
47:41
That might actually be, that might enable things that would otherwise not be possible.
48:07
Okay.
48:23
That is very interesting and ties back to these two slides together, which I didn't know were related, thank you. So Jim says that the choice between left and right deep joins has an impact on whether or not you can push bloom filters down to scans.
48:40
That is a very interesting thing which I now plan to look into, thank you. Okay, so much more low level thing. We have these chunks of 32 kilobytes. One thing that's wrong with that is that some operating systems, when you ask for 32 kilobytes plus a tiny header,
49:01
it actually eats 36 kilobytes and that's like 12.5% extra or something like that, is it? Whatever it is, it's just a waste of memory. Another, yeah, there may be other reasons to increase the chunk size. I don't know, that's something to look into. Actually, that's all I have and the time is, I've actually run over.
49:21
But if anyone wants to ask any questions, please, please do. I've answered all of your questions. Good, thank you.