We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Implementing Parallelism in PostgreSQL

00:00

Formal Metadata

Title
Implementing Parallelism in PostgreSQL
Title of Series
Number of Parts
31
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production PlaceOttawa, Canada

Content Metadata

Subject Area
Genre
Abstract
Where We Are Today, and What's On The Horizon PostgreSQL's architecture is based heavily on the idea that each connection is served by a single backend process, but CPU core counts are rising much faster than CPU speeds, and large data sets can't be efficiently processed serially. Adding parallelism to PostgreSQL requires significant architectural changes to many areas of the system, including background workers, shared memory, memory allocation, locking, GUC, transactions, snapshots, and more. In this talk, I'll give an overview of the changes made to background workers in PostgreSQL 9.4 and the new dynamic shared memory facility, which I believe will form the foundations of parallelism in PostgreSQL, and discuss some lessons I learned while implementing these features. I'll also discuss what I believe is needed next: easy allocation of dynamic shared memory, state sharing between multiple backends, lock manager improvements, and parallel algorithms; and highlight what I believe to be the key challenges in each area.
IntelMetropolitan area networkCache (computing)Single-precision floating-point formatBefehlsprozessorBenchmarkVarianceNeumann boundary conditionWordParameter (computer programming)Presentation of a groupServer (computing)DatabaseEnterprise architectureMultiplication signParalleler AlgorithmusPoint (geometry)Projective planeArchitectureGoodness of fitXML
Single-precision floating-point formatMetropolitan area networkCache (computing)IntelBenchmarkBefehlsprozessorVarianceGamma functionCore dumpTurbo-CodeMusical ensembleHash functionPort scannerStorage area networkAsynchronous Transfer ModeParalleler AlgorithmusStreaming mediaFront and back endsPrice indexDirected setSynchronizationOverhead (computing)Fundamental theorem of algebraSemiconductor memoryComputer fileIcosahedronPhysical systemSystem callDigital photographyCausalityIntegrated development environmentIdeal (ethics)MIDITable (information)Message passingDataflowMathematical singularityTotal S.A.Domain nameRegular graphQuantum stateFunction (mathematics)Set theoryCodeACIDMilitary operationCursor (computers)Statement (computer science)IRIS-TDeadlockFinitary relationData managementBlock (periodic table)Maxima and minimaMereologyWeb pageParameter (computer programming)Local ringKernel (computing)Thread (computing)Primitive (album)Operator (mathematics)Position operatorMultiplication signSeries (mathematics)AlgorithmParalleler AlgorithmusIntegrated development environmentWindowSynchronizationTable (information)CodeHash functionQuery languageSequenceFront and back endsTerm (mathematics)Dynamical systemProcess (computing)Computer simulationDecision theoryProjective planeNumberSubject indexingDifferent (Kate Ryan album)BitStress (mechanics)Buffer solutionPhysical lawQuicksortWeb 2.0Object (grammar)AreaWebsiteSoftware testingCodeBit rateFamilyStatisticsPhysical systemVirtual machineFraction (mathematics)Cursor (computers)BenchmarkComputing platformServer (computing)Right angleDirection (geometry)Computer programmingFirst-order logicContext awarenessAlpha (investment)Message passingConfiguration spacePoint (geometry)Goodness of fitSystem callTape driveElectric generatorTwitterSet theoryStatement (computer science)Befehlsprozessor1 (number)WordComputer configurationRoboticsMereologyCuboidData managementFunctional (mathematics)Connectivity (graph theory)Validity (statistics)Hacker (term)Game theoryTheory of relativityComputer hardwareBlock (periodic table)InformationShared memorySimulationCASE <Informatik>Software bugSlide ruleWeightNP-hardReal numberSoftware maintenanceResultantTupleMoment (mathematics)Database transactionOnline helpSocial classWorkloadMultilaterationWhiteboardResource allocationCompass (drafting)Radical (chemistry)Barrelled spaceTraffic reportingGroup actionFrequencyPower (physics)Variable (mathematics)Endliche ModelltheorieHydraulic jumpImplementationDatabaseElement (mathematics)Planning2 (number)DeadlockData structureMassCategory of beingPressureSoftware developerNetwork topologyException handlingAnnihilator (ring theory)MehrplatzsystemExergieData storage deviceOverhead (computing)Memory managementUser-defined functionFundamental theorem of algebraRandom number generationDegree (graph theory)Queue (abstract data type)Single-precision floating-point formatSemiconductor memoryComputer architectureComputer fileSemantics (computer science)Disk read-and-write headProcess modelingRelational databasePropagation of uncertaintyProgramming paradigmRegular graphOperating systemError messageCross-platformFile systemCharacteristic polynomialControl flowContent (media)Address spaceFigurate numberPointer (computer programming)Mathematical optimizationFlow separationMedical imagingMixed realityComplex (psychology)Electronic mailing listCondition numberSystem administratorStrategy gameService (economics)Structural loadExecution unitLimit (category theory)Order (biology)Particle systemGame controllerTelecommunicationTouch typingSource codeStreaming mediaGene clusterSet (mathematics)Arithmetic meanQuantum stateSoftwareBuildingSpacetimeExtension (kinesiology)Information securityEmulatorModule (mathematics)Similarity (geometry)Type theoryLeakBeta functionSound effectAbsolute valueTrailLine (geometry)Interior (topology)DiameterCache (computing)PlastikkarteDeterminantMathematicsLattice (order)Core dumpPatch (Unix)Maxima and minimaFitness functionDivisorMetadataUniformer RaumRadio-frequency identificationVacuumPort scannerRandomizationCrash (computing)MiniDiscRevision controlFilm editingComputer animation
Transcript: English(auto-generated)
Can you guys hear me in the back? Yeah Good. All right, that clock still says it's a couple of minutes to 11, but my Time disagrees, so I think we'll go ahead and get started
My name is Robert Haas I'm a major contributor and committer to PostgreSQL and I work at Enterprise DB where I'm the chief architect for the database server And today I'm going to talk to you about implementing parallelism in PostgreSQL
Which is something that I've been working on for the last year or so. Yes, Steven So the question is is it done yet and the answer is no So, please do feel free as I go through this presentation to jump in with questions
If you want to have a protracted argument about my sanity, I'd be happy to have that with you after the session So, I guess the first question about parallelism is why do we need parallelism or in other words Robert Why are you working on such an impossibly hard project at which you will probably fail and be publicly humiliated and shamed
And the answer to that question is I really think PostgreSQL needs parallelism to remain relevant I have had a feeling for a while now that single-threaded Performance on CPUs is just not increasing nearly as quickly as it did when I was growing up and you bought a new machine every two
Years and it blew the old one completely out of the water And so I went and I looked for some statistics to back that up and I found a few The first one there is from somebody's blog post and he went and looked at single threaded spec int and spec FP benchmarks Which was kind of a funny concept because there's like implicit parallelism stuff going on there
But he did his best to separate it out and found that between 96 and oh four He saw performance increasing by more than 50 percent year over year, which is a less than two year doubling time But between 2004 and 2012 he saw it increasing prettily pretty steadily at 21 percent a year now I personally would be pretty happy if all my things got 20 percent faster every year
But but it is definitely a slowing of the rate that we used to see and I think on a lot of real World tests you actually see less gain than that And so another example that I found was from the Anand tech website They did some benchmarks when they upgraded their web surfer
Only one of those was a single threaded benchmark, which right there tells you something about why we need parallelism But the single threaded benchmark, I worked it out It's an annualized rate of seven point seven percent per year based on the date. Those CP dates. Those CPUs were initially released 39 percent improvement over roughly four and a half years
So if your data is growing faster than that Then and you're stuck in this single threaded world, then you're getting slower with every year that passes by I Think the most compelling evidence I found of this tread is, you know We sometimes tell people or at least I sometimes people tell people when they're specking out PostgreSQL servers
Don't buy more cores buy faster cores, but I went to the Dell and and it's good advice up to a point But there's a growing problem in this area I went to the Dell configuration tool which I've always found to be a really great way of putting the machine together and understanding how that much the Components will cost and on their high-end servers
You got these two processor options one at two point eight gigahertz one point at three point four gigahertz Overall fairly similar at three point four gigahertz You're paying eighty four hundred dollars instead of seventy seven hundred dollars, but that's okay because it's faster What's not so okay is that you've got six cores on the more expensive one with the higher clock speed and fifteen cores
the lower speed one and that's something where a lot of people are either going to make the wrong decision or Just say, you know, that's crazy You can't expect me to pay that much more for my hardware If I want to run Postgres QL and really
How far ahead of the problem are you going to get by having a three point four gigahertz processor rather than a two point eight? Gigahertz processor, that's just not enough of a difference Especially when you consider that the memory and interconnect is the same speed to really get out from under this problem And I want to add here that I'm primarily concerned with things
That take minutes or hours to run if your query runs in a couple of seconds and you want it to run and run in fewer seconds That's probably a problem that's worth solving but it's not really the the use case that I'm concerned with here at least not at first So
I think it'd be useful also to just do a brief overview of what kinds of things We might be able to parallelize and Postgres QL And thinking over this I think they're kind of three main categories two related to queries and then one other group So this is a plan that
It looks vaguely like something that might come out of explain We've got a sequential scan on foo and then we're doing a hash join against bar So we scan bar and build a hash table and then we do a hash join Between the two so a simple plan but imagine that The foo table is really really big so that this hash join is going to take a long time to run
What could we do to make it faster? Well, one thing we could think about doing is pipelining the execute pipelining the execution of these two Operations you've got a hash join and you've got a sequential scan So why not have one process do the sequential scan and the other process do the hash join and feed
Tuples from one to the other so this is what I'm calling an intern node parallelism because it's parallelism Between one node and another node, right you do some nodes over here and you do some nodes over there And they talk to each other right now The other way of looking at this same plan tree is you know, maybe we need more than two processes
Working on this plan tree in that case We need to actually do parallelism within single nodes or intranode parallelism. So sequential scans, for example I'm supposing that we have a filter condition here, which is the well-known something complicated filter condition
Which as you guys all know takes a lot of CPU time in order to execute because it's so complicated and So we might not be IO bound doing that sequential scan on table foo We might actually be CPU bound and if we are or our database might fit in memory Which when you can get a commodity server with a terabyte of memory does not strain fragility too much
so We might like to actually have multiple backends processing this foo relation different backends process different blocks each Backend applies the filter condition and the MVCC visibility checks to its own block And passes the results up to the join and hey while we're at it. Maybe we could parallelize the join, right?
I mean if we've got multiple streams of tuples coming in from somewhere And we parcel those out to a series of backends that have access to a shared hash table that we built over the bar Relation each of those each of those guys can do independent probes into that shared hash table
And and do part of the hash join separately We don't even need locking on the hash table because it's not changing once it's initially built So there's a lot of interesting things that could be done here and none of them work yet The third category moving outside the realm of parallel query is parallel maintenance or DDL commands
I think these may actually be good early candidates for parallelism for two reasons One is that it's easier and this is a really hard project So anything that's even a little bit easier is always nice and two is some of this stuff takes a really long time
And that I think is the case where you can get the most win from parallelism So I'm thinking about things like create index where again you could perhaps Parallelize the heap scan you perhaps parallelize the sort that we need to put the data in the right order or vacuum Again obviously parallelizing the heap scan could be a very useful thing to do
I was having dinner with Andres and Hakey the other night and they were pointing out that Once you finish the first scan through the heap you have to go and scan all of the indexes However, many of them there may be and remove the tuples that you've identified as dead from each of those indexes Well right now we do that one index and then the next index and then the next index
However, many indexes there are if you've got one index on your primary key, that's one thing But if you've got ten indexes on the table, which people sometimes do There's a really obvious opportunity for parallelism there by having different
Workers scan different indexes now Maybe you could even make it more fine-grained than that and have multiple workers cooperate on a single index But even just getting to one worker per index would help people who have indexed the heck out of their tables So Those are some of the kinds of things we can probably parallelize
I'll just pause briefly here and ask if anyone wants to ask a question before I move on to Technical ramblings Yeah, yeah
Right, so that right
This is something I was going to allude to later that the question is basically if all my CPUs are already in use And then I start doing parallelism of whatever kind I might actually be making things worse for the system overall, even if performance for one particular
Process gets better and his question is so what are you doing about that? And my answer to that is I'm not doing anything about that yet because I haven't got that far But it is a really it is a really good thing. It is a really good question I'll allude to it briefly a little bit later on Yeah
Right, so so we're talking about cluster and vacuum full, you know cluster in some ways is Vacuum puller in some ways very similar to these commands Because once again, you have the same opportunities for parallelizing heap scans and sorting
The issues are a little bit different in every case But yeah, I think once we get the ability to parallelize some of these common operations Well, then probably go through a period where we try to plumb that infrastructure into various different parts of the system Which will no doubt be a
Barrel of laughs. All right, so I want to talk a little bit about Architecture I and and some of the architectural decisions that I've made early in the project. I Obviously, I'm not saying that I've made those decisions for the project because as much as I might wish that I had that power
I totally don't I'm talking about the decisions that I made in terms of what I was going to work on and how I was going to pursue this so If people disagree with these decisions, that would be great to hear about Maybe after the talk would be better than during the talk just in the interest of time
I didn't bring any kind of shielding device So if people feel the urge to throw things at me when they hear some of these things go by I would sure appreciate if that could be something that you would refrain from
So The first decision I made is we're gonna use processes or at least I'm gonna try to use processes for this Not for friends. So people often say oh You know, it would be so much easier for you to do parallelism If We you were using a threaded model than a process model and they might be right because there are certainly plenty of challenges
Doing it this way But I don't really feel like rewriting the entire back end and having it become totally unstable and not work anymore So I just can't see that there's any joy there I mean if you start multiple threads inside an individual Postgres back end
The good news is the rest of the system all the stuff that's in shared memory Doesn't have to know the bad news is almost every piece of back-end local code and every back-end local data structure Has to know and there are actually a lot more of those than there are things in shared memory So changing all of that around we could have a great argument over alcohol about whether that is
A good thing to do on general principle, but I came rapidly to the conclusion I think it took about six microseconds That if I embarked on that approach with this project I would be dooming it to certain failure So I decided to pick the approach that was highly likely to fail rather than the one that was absolutely certain to fail
The second decision that I Made at least for myself is that I was going to pursue the idea of doing parallelism with processes that were started by the Postmaster rather than having individual backends fork to make a bunch of siblings that would then all cooperate on the query
One of the big reasons for that is that we have a lot of users on Windows and as much as I hate Windows And I really do We would not PostgreSQL as a project We did not enjoy the success that we have today if we didn't support Windows as a platform
And I'm not excited to be the first guy who implements a major major feature that can never work on Windows under any circumstances whatsoever and Virtually, none of the code that we use on for that feature on other systems could ever be adopted to work on Windows
I just Don't want to be that guy The other issue of course is that right now Today all backends are direct children of the postmaster Linux does and Unix systems in general don't really handle grandchildren the same way that they handle children
And that's what we would end up with if we tried to have individual processes fork I'm not saying those issues couldn't be fixed because they probably could be But it just felt painful So, I think that's a lesser consideration than the fact that it would just completely break on Windows
But it was another thing that leaned me in that direction All right, so that's the process model Then what about IPC? How do things communicate with each other? Well, you know, there's a lot of tool There's a lot of IPC tools available on the platforms that we support
There's relatively few of them that work on every platform we support which is kind of the pits but Pretty much every system has some way of doing shared memory some way of doing pipes. And of course everybody's got files Of course, if you're on Windows the API calls are all different, but you know, never mind that
So I kind of rejected files pretty quickly. There might be something we want to use files for Within this set of things around parallelism parallel tape sort or something, but for the most part You know writing to a file is a system call and you might end up doing IO and then somebody else has got to read
The data back in that you wrote out and it just doesn't seem like a good way to build a high-performance system Pipes are a good paradigm There's a lot of things we want to do in terms of streaming data between one process and another process For which pipes were would be a good fit but shared memory is a lot more flexible And as evidence of that I offer the fact that you can much more easily use shared memory to emulate a pipe
Then you can use a pipe to emulate shared memory now Noah pointed out to me that there actually probably are ways to use pipes to emulate shared memory But I was a little scared of some of those things because some of them were operating system dependent
So what I did instead is I wrote a module called shim underscore MQ that uses shared memory to emulate a pipe and one of the things that I think is nice about that is I've got one set of platform dependencies around shared memory instead of two sets one about shared around shared memory and one around pipes and That shim MQ module actually provides some nice semantics
You can make your pipe as big as you want it to be or as little as you can get away with Depending on what performance characteristics you need. It's totally platform independent It never splits up messages into smaller messages which operating facilities would probably do on some platforms and not others So it that seems to be working out kind of nicely
I also decided that we needed specifically dynamic shared memory rather than the main shared memory segment And the big consideration here is that if you want to do a parallel sort and you want to do it like a giant Quick sort in memory. You might need like a terabyte of memory
And you can't pre-reserve a terabyte in the main shared memory segment on any system that I'm aware of and not have users Complain about the fact that when you're not doing a parallel sort you're like not using one terabyte of their RAM So that just didn't seem like there was any way to make that work a more painful decision And this one was really painful and the pain will no doubt be with us for a long time
Is that we would live with the fact that these dynamic shared memory segments could be mapped at different addresses in different prop Different cooperating processes and that kind of sucks because it means you can't use absolute pointers Inside these dynamic shared memory segments and have things work the way you want them to I hate that
It's awful on the other hand. I'm Only one hacker has told me that he thinks that it might be possible to actually Make having it at the same address in Every process and every other hacker I've talked to including me when I talk to myself has said
That that is never going to be reliable and it's not going to work and you'll die if you try So I'm trying to deal with that complexity In the individual users of the dynamic shared memory facility rather than baking it into the facility itself
So Those are some architectural decisions that I've made Right, so we have a so the question was what's the difference between dynamic shared memory and regular shared memory
There is a regular shared memory segment that we create at Postmaster startup and we actually do manage to get that at the same address in every process on Unix we do that by forking on Windows There's some thing that kind of makes that work and it's not a hundred percent reliable, but we but we do it right
Dynamic shared memory segments are different because you can spin up a dynamic shared memory segment Anytime you want during the execution of the server So the main shared memory segment is created once at startup and that's what you got forever
Dynamic shared memory segments, you can start them up. You can create them on the fly and you can turn down on the fly While the server is running And there's no limitations that has to be a back-end startup or anything like that Just whenever you want to spend one up spin one up when you don't want it anymore throw it out
Other questions on the architecture stuff Yeah, so the question is about non uniform memory access right now There's no code at all anywhere in PostgreSQL, which is Numa aware
there might be there might be a good argument for Making some part of the code Numa aware. What we lack is benchmark results showing that the lack of Numa awareness Is actually hurting us. I don't necessarily
doubt that it is But it's a one it's one thing to think that there might be a problem with something and another thing to have real Reproducible test cases that demonstrate that you do have a problem with something the closest I'm aware of it We've come is Kevin that a benchmark that did show a significant overhead on one workload
But I think you were never able to reproduce that reliably
So there may well be smarter things that we can do there But I think we just don't know enough to be that smart I don't think this particular project is the right place to start having that smarts if there are smarts to be added
It's probably much more useful to add them to the existing stuff that everybody uses all the time Yeah, John Right. So the question is can we safely assume that? Dynamic shared memory uses the same sidestep around the system 5 shared memory issue that I added in
93 the answer is no, you can't assume that because it's completely false The trick we used in that case Does not work for this case Because the only way of propagating that kind of shared memory segment from one process to another is via fork
Which is not an option here So what we have for dynamic shared memory is a configuration variable that lets you pick from among the shared memory implementations that work on your operating system Hopefully most people will have POSIX shared memory, which doesn't suffer from the system 5 limits But if you're running on a BSD system, you might find that you haven't got POSIX shared memory
In which case you should number one ask your BSD Kernel guys very nicely if they wouldn't mind adding that and number two Configure that variable to point to say system five it out in it DB will actually take care of it for you automatically
I'm going to move on here just in the interest of time because I think people will have questions about Some of the later slides as well So the basic question here is you know What do we need to build and how far along are we in building it in terms of actually making something that?
Works as opposed to something you can give a conference talk about I've kind of divided the work here into Five areas and the parenthesized comments give you some idea of where we are with those areas The there's some basic facilities which I think are pretty much done in 94 that are sort of the very fundamental building blocks
of parallelism Then there's what I'm calling plumbing Which is a bunch of things that are not technically part of those basic facilities But are actually needed in order to use them in an effective way I'll talk a little bit more about each of these over the coming slides So don't get too hung up if you don't totally understand what I'm saying now
Then the next thing is Establishing what I'm calling a parallel environment and I'll talk a little bit more about what that means internally to EDB We've done a little hacking on this. We've got some ideas There's a lot more work to be done there and Then once you have this parallel environment
the goal of the parallel environment on top of the plumbing on top of these basic facilities is to create a Space where you can reasonably do Things in parallel and then of course you have to write the code to actually do the things in parallel Which is the parallel execution piece and eventually for parallel query as opposed to for maintenance commands
You're going to need parallel planning and the idea here is that the first three? Things on this list are stuff that you mostly you build it once then you've got the infrastructure You know and then the those last two things you do over and over again for each new thing you want to parallelize
So getting through that first pile of stuff is sort of a prerequisite for actually being able to paralyze individual things. Yes, even You're right Yeah, I'm not talking about doing the planning in parallel
I'm talking about planning what parallelism to actually do for a particular query. Yeah So the question is why does the planner need to know? The planner needs to know because the planner has to tell the executor what to do
It has to tell the executor Take these this part of the plan and run it in a separate process from this other part of the plan It just has to decide where to put those breakpoints
Right like like that like if if if parallel if doing the query in parallel is cheaper than doing the query Not in parallel Then you you want to know that so you can pick the parallel plan if doing the query not if the optimal query
plan is not the planner has to know in order to correctly choose between plans and also to tell the Executor exactly, which things are we parallelizing in this particular query? How many parallel processes are we using? What is each one going to be responsible for all that stuff?
The the planner will have to tell the individual notes whether to run in parallel because otherwise they won't know
so the question Right. So the question is does XC or anything else have any tie into this work?
Are there API's or so that we could reuse? I think that there aren't because I think that the communication within a Cluster is going to be fundamentally different than the communication between multiple clusters Let me actually just move on through the next couple of slides because I got a slide or three on each of these
And then people can ask questions on each slide as we come to the topic rather than jumping around So the basic facilities that we've got a nine for our dynamic background workers Which means and dynamic shared memory so dynamic background workers means you can tell the postmaster please start a process and have it run this function and
Behold the postmaster will start that process and have it run that function. What happens after that is up to you Right. So it's a really low-level facility and dynamic shared memory similarly is You can spin up a dynamic shared memory segment whenever you want
And it will eventually go away and the it will eventually go away part was actually where most of the work Went here because it turns out You know We're those of us who are used to coding inside the PostgreSQL backend are used to the fact that when for example your transaction
Aborts all of your resources get released Well, it turns out as I found out the hard way that if you add a whole new kind of resource And you don't write any code to make it automatically go away It doesn't automatically go away because it turns out that the transaction abort machinery is not magic
So That was kind of when I had the wake-up moment that made me realize that I was building really low-level Facilities here and then I had to get things like transaction abort process termination cluster termination There's like three different levels of making sure we clean up dynamic shared memory segments at the appropriate time
And if you got rid of any of them, then you would have leaks so That was the complicated part of that. But the good news is it's done In fact, I've already got bug reports on these things and the great thing about that is that people are using them
Right, we barely got nine point four beta one out the door And I think even before beta one went out the door I had bug reports related to these facilities people don't find bugs and things that they're not using So even though we are a long way from having real parallelism here People are latching on to these see api's
And doing stuff with them. That's significant enough that they actually care about whether it works which I think is fantastic I think that's really fantastic Part of my goal here was to build the individual pieces of this project in a pluggable way So that they could be used for parallel query, but they can also be used for other stuff and in each layer I'm trying to make that possible
So next is plumbing right and this is the stuff where it's like oh gee I can get a big chunk of undifferentiated bytes. That's called a diameter memory segment But what the hell do I do with it after that right? And so one option is do whatever you want with it They're your bytes have fun
But that's not necessarily an easy programming model to do complex things in so one of the things I did in 99.4 is I created this concept of a DSM table of contents We have a similar table of contents, although it's more complex for the main shared memory segment And the basic idea here is that this provides some
dead simple infrastructure So that when you map a dynamic shared memory segment You can actually Do some discovery and figure out what kind of data structures are in there and locate the particular data that you're interested in within the overall dynamic shared memory segment
Several people said on list that they weren't sure that I got this part, right and They might well be correct, but we won't know until we try it. I think it's pretty good I think I can see how to use it for a bunch of useful stuff But you know as with any of this fundamental infrastructure
And particularly once you get beyond the absolutely most basic stuff that you know, you're going to need there's some room for maneuvering Another thing that got done in 9.4 is a simple message queuing facility. I mentioned it earlier SHM underscore MQ is what it's called And this is intended to answer the question how to background work how to cooperating processes communicate
if you have a background worker that's running and it generates tuples that need to get to another process so that they can be further processed or if it just errors out and The other back end needs to get that error so that it can decide to stop doing whatever it's doing or so that it can
Print the air out or whatever. It's going to do there has to be some easy way of doing that and so that's what I invented this for as a way that you would hopefully be able to send tuples and errors and notices and What other other control information you needed to send between the co-operating processes? I'm in a kind of a low-level. This is still kind of a low-level facility. I
Did get a bug report on this already. So again, somebody is using it, right? The error propagation Piece of this is sort of what I think the next thing is here The idea is that we don't want every person who uses
Background workers and dynamic shared memory and message queues to have to reinvent their own way of sending error state For example from one process to another process We want to invent some ways of doing that that are somewhat standardized and people can if they so choose
Just use them. Of course, some people may want to roll their own for whatever reason and that's fine, too So that's something that I've just begun to work on I have some ideas about that that I think are promising We will see how many rotten tomatoes get thrown at me when I post the patch
Another thing that I'm sort of putting in this category of plumbing is a shared memory allocator. This is an idea about which there is Considerable skepticism, which is not without some foundation. Indeed. I have some skepticism myself despite the relatively Actually very large amount of time I've spent trying to figure out how to make this work at the same time if you look for
other pieces of software out there That do things in parallel That do not have a shared memory allocator, which today we do not I think that you will find Possibly no such projects and certainly not many that have been as successful as we are
So, you know, I think there's very legitimate concern about how much shared memory allocation It really makes sense to do how dynamic can you really make this without having the behavior of the system become? Complex and difficult to understand how do you reconcile it with the fact that both our main shared
Shared memory segment and these dynamic shared memory segments have a size that is fixed when you create it on some platforms There actually are ways of growing a dynamic shared memory segment Once you've created it There are tricky problems there though as well I don't really know what the answers to all the questions in this area are but I do think that
There are going to be some things that are really hard to do if you don't have this So I'm working on it and we'll see how it turns out I think someday we'll want other data structures and shared memory to that have common Implementations that can be shared by multiple users For example a shared hash table
you can use this on the one hand to imagine creating an in-memory data store extension for PostgreSQL that just makes a big hash table On dynamic shared memory and then store stuff in there and spits it back out on the other hand You can also imagine doing it for using it for something like the combo CID hash So that you can actually do rights and multiple back ends at the same time and not lose track of your combo CIDs
But I'm not planning to work on that right away because I think I can you got to draw the line somewhere Yeah
No, I have not thought about hash aggregates That is a good example of where the hash table would need to change as you were working your way through So that that's a very good. That's it. That would be a good use case for a shared hash table. Yeah You know That's very hard that particular case of a an aggregate is very hard to make work
Because if you run out of memory and your shared memory segment You have no workable fallback strategy for sorting you do right if you fill out you fill up your shared memory segment That is fixed size. You don't have to worry about trying to grow it or add another one You can just say alright, we're switching to an external sort
but You know, obviously if you are doing an aggregate you can't say Let's just not read the half of the second half of the data and we'll just emit the results based on The tuples we've read so far I'm I'm fairly sure that someone would complain about that patch Yeah, it's hairy
All right so These are some things that I think are useful. Some of them are done. Some of them are not done There is debate about which ones are useful and to what degree Which is fair alright, so Setting up a parallel environment
What I really mean by this is making a background worker look enough like a regular user back end to do useful work and specifically Making it look enough like the back end that it's cooperating with that caused it to be started For it to do useful work This is a complex problem
I think that a lot of it can be done by basically copying relevant state Like the user which database were connected to our snapshot to the background worker Using dynamic shared memory as the vehicle for that. Obviously, this is going to be somewhat expensive
So this is why I say I think this kind of parallelism is probably going to be mainly suitable for long-running operations if it takes 100 milliseconds to copy all of your state over for one back end to another Your query has to or a query or maintenance operation or whatever has to be long enough running that you really don't care about the fact
That it took an extra hundred milliseconds to get things started if you're doing a hundred millisecond query You will care if you're doing a five-minute index build you won't care Provided that there is at least some speedup Useful work doesn't mean everything Right getting the background worker to be able to do useful work does not mean that everything that you could do
Inside the users actual dedicated back end can be done equally well within the worker Some operations seem fundamentally in unsafe in a parallel context For example, if you have a user-defined function That's not a security definer function and it sets a good that good value will actually that change in that good value will persist
after that function returns That is arguably really weird But it is our it is a good example of behavior where in a parallel context I'm not really sure what that means. I can't really make sense in my head of what
The semantics of that in a parallel context are but they'll certainly be different than whatever they are in a single-threaded process And so you probably don't want to let people do that Um Some operations Could theoretically be made safe in a parallel context, but we probably won't bother
Magnus you suck Is that you No, all right. Maybe that's my own fault I wrote bother bother or else he changed it to bother bother when I went to the bathroom
So It's parallel bother, yeah so a Good example of something that I think might not be worth making parallel safe Even though it could be done is set seed in random If you call set seed you're setting the random seed you then expect that you're going to get the random numbers
In a in the same order that you got them the last time you configured that particular seed Well, obviously if you make parallel background workers If they each have their own copy of the random state, they're each going to generate that series which will produce
The user visible behavior difference. Whereas if you're running in a single process, then you'll get one sequence, right? now you could fix this right you could take you could provide code that allows the Random state that random uses to relocate its state into dynamic shared memory and build all those plumbing
Said that this works exactly the same way or enough like the same way in a parallel context that maybe nobody would care I can't imagine we're gonna bother Maybe you can't even make it safe, but even if you can who cares Also
Even if we have a lot of things have state that's shared between all of the cooperating backends Arbitrary user supplied code can never be safe Because if you're running some user-defined function that somebody has loaded into the back end via data So or even some built-in function that some hacker wrote and put in there
You can't inspect the code of that function and see whether it uses a back-end private piece of persistent state That needs to be copied from one back end to another. There's just no Way of doing that So I don't see around I don't see a way around the need to have some kind of labeling of functions to say
That these functions are parallel safe These functions are not parallel safe if you label your function parallel safe when it's really not that's your problem and The planner will look over have to look over the query tree and say oh I might have considered parallelizing this
But it contains a function that's not safe in a parallel context. So Sorry Yeah It would be great if we had a program that could analyze another program and determine whether that program
Was safe, but Alan Turing had something to say about that Alright, so what do we actually need to copy? This is my tentative list. Actually. I should say more properly this is kind of Noah's tentative list because he did a lot of research on what the The main things were here I think we can probably build a basic parallel environment that can do an interesting level of stuff
copying these things obviously, you need the background worker to have the same notion of the user ID and the current database as the Original back end that probably needs to have the same values for all the books
It probably needs to know what transaction It's in maybe you could do some really simple read-only things without too much in this area We probably need to copy some elements of the transaction state If it's going to do any database access it's going to need the same current and active snapshot that you have in the original back end and
If you've got a come with the ID hash You're gonna have to share that too because otherwise you might not be able to correctly read data That has been previously written by your own back end of the same session I'm watching the reactions of the other hackers in this room as they've grown in pain looking at that list
Yeah Correct for an initial version of this. I'm not trying to target anything that involves trial update or delete
Those things are obviously valuable be nice to have them eventually Doesn't seem absolutely essential for the first go-around. I have a separate slide on locks
Rights I'm thinking probably not in the initial Cut at this because if you have rights, then the combos the ID hash can't just be copied It's got to actually be a shared hash table And I'm pretty sure there's a bunch of other stuff that is going to need to be fixed as well
This is hard enough. So I want to get this done and then we'll look at extending it Okay Now you also need to prohibit some things A lot of these things are things that could theoretically be allowed if you had the right State in your dynamic shared memory, but we probably won't some of them are things that probably just can't ever be made to work
Basically, these are a bunch of things again. This is research by Noah that that You know have back-end local state associated with them sequences have a back-end local sequence cache
Cursor operations the the position of the cursor is back-end local large objects have cursors Listen and notify have associated back-end local state temporary buffers are stored in back-end private memory Prepared statements are started back in private memory Some of this stuff could be synchronized but it seems okay to me for a first cut at this to say
Well, you can't do those things in a parallel context and we'll figure out later which ones which of those restrictions actually rub on people Really badly and see if there's a reasonable way of relaxing them Assuming I don't die first generation of invalidation messages seems like something that can probably never work
I think that's really only going to be an issue if you have like One user-defined function that's going to create a table and then another function that gets called later in the same query is going to try To read from that table or something like that
If those two operations are actually happening in different back-ends Then the invalidation messages from the first back-end that created the table would have to somehow be Immediately perceived by the other back-end in time to read the table. It's a mess. I don't want to go there I don't care about that use case You'd probably don't either it needs to not crash
Um Lock management is a whole nother cattle of fish This is what Andres was asking about for the most part the system overall I think doesn't really need to care that much about The fact that a group of processes are cooperating to do something most of the data structures that we have in shared memory
Really don't need to worry about that Like the processes themselves need to know that they're doing it and they need to coordinate with each other But the system overall doesn't really need to know but locks are an exception You can think about a couple of ways to try to handle this and they don't work
The background workers can't take no locks at all and just rely on the user back-end to hold the locks Because the user back-end might die or be killed before the background worker finishes doing whatever it's doing that requires that lock That will be a disaster You could try to fix that by saying the user back-end is never allowed to die and release its locks until all of the other
Workers are dead and I have no faith at all that that kind of solution can be made robust Another problem now the other obvious option is well the User back-end takes the locks initially and then background workers just also take locks on toast things
The problem with that is that parallel query might self deadlock the user back-end lock some relation with say an access share lock Another process comes along and tries to start a cluster operation on that table Which means no more locks on that table can be granted Then the background worker comes along and tries to acquire its lock on that relation And that background worker is now just going to sit there
Until the cluster completes which it won't because the cluster can't start because the main back-end Is already holding a lock on that relation which is not going to release until the background worker finishes working Which is not going to happen because it's waiting for a block. Okay, so Something obviously needs to be done about this problem. One option is just sort of
Have the background workers try to grab all the locks without blocking and if any of them failed Then they just die and you fall back to non parallel query or something but I think that's a pretty unappealing option and I think it will be as Complex to implement is what I think the real solution here
Which is probably to have some kind of concept of locking groups inside the lock manager so that we can say that a group of back ends Are sharing their lock state either they all collectively hold whatever locks they hold and individual processes acquire and release release locks on behalf of the whole group or maybe
There's some way where we just tell the you know, the deadlock checker and related machinery that Processes in the same locking group are allowed to queue jump and they don't have to wait for other unrelated locks in order to get their own locks I haven't figured out the details here yet, but I'm sure it will be exciting
Does that answer your question Andres? At least right. We'll go to the additional issues at the appropriate time. Okay So parallel execution in some sense. This is the easy part
I think if you've got all of that plumbing and all of that state synchronization code and you've prohibited the things that aren't safe And all that kind of stuff and now all you and you've got your basic infrastructure in place You've got the tools that you need to actually code a parallel algorithm Well, you know the parallel algorithms are pretty well described in the literature
They just assume you're using threads and you can just throw them off with a few easy locking primitives You know parallel sorting algorithms are described in the literature. They're pretty well understood you can go read about how parallel quicksort works or You know for parallel sequential scan
You hardly even need to think about What to do there too much I mean, it's not there's a couple of different things you can do in terms of how you interleave the different workers But it's fundamentally I think not that complicated The major thing that I think that comes up here that's a little bit subtle is Amdahl's law
Which says if alpha is the fraction of running time a program spends Executing serially the maximum speedup from parallelism if you have infinite workers is one over alpha In other words any time that your background workers spend not being busy Is really bad it will rapidly erode all of the gains that you got from parallelism if
Ten percent of the time is spent executing serially You can't get more than ten times faster and you can only get ten times faster if you squish everything else down to zero So one of the key things that comes up with these parallel algorithms is that you've got to structure the algorithm in such a
Way that you won't have workers starving because they don't have anything to do. So for example a sequential scan you do not Want to break the relation into equal-sized chunks and have one background worker scan each chunk Because then if one chunk turns out to be fast to scan because it's in memory and another
Chunk turns out to be slow to scan because it's out on disk You'll have some background workers that are done and other background workers that still have lots of work left to do And you won't really get as much benefit out of parallelism as you could have gotten You'll probably also have lots of crappy random IO behavior if you do that Well, you probably want to do instead is have the backends kind of work together
Work their way through the relation together so that if so first of all, so you get sequential IO behavior and also so that you And also so that you don't Don't have one runway ahead of the others and finish first and then have anything else to do my mic died
Yeah, I'm almost done I got like two slides left So parallel query planning I said to Noah one time there are two kinds of PostgreSQL development projects those That don't touch the query planner and those that are unsuccessful
That's a slight exaggeration but not that much This is one of the reasons why I think making maintenance operations parallel might be a good place to start One of the problems that someone in the audience asked about earlier Is that you know, you might have a plan that's cheaper for you and that you'll wait less time to get the results
But enormously more expensive in terms of the expenditure of overall system resources So we're gonna have to think a little bit about what it means to pick the cheapest cost plan How do we factor in the overall cost to the system of spinning up a bunch of workers
And there's other things we'll need to do We'll need to have a notion of the worker startup costs of the IPC costs that are introduced when you parallelize things And That is actually pretty much all I have So I'll stop here and take a couple of questions and then I guess we'll go eat lunch
Thanks, John. John. Did you have a question?
Looking at the which code I did read I did read a lot of papers on memory management It turns out that I wasn't able to find a paper that describes somebody doing exactly what we're doing in particular a lot of the literature about allocators
assumes that The allocator can allocate its own metadata by calling the system malloc if you're allocating from a dynamic Dynamic shared memory segment. There is no system malloc that knows how to do that There are other problems as well, but that's one of them I'll stop here because everybody wants to leave but feel free to come up and ask me questions if you want. Thanks