Logo TIB AV-Portal Logo TIB AV-Portal

Implementing Parallelism in PostgreSQL

Video in TIB AV-Portal: Implementing Parallelism in PostgreSQL

Formal Metadata

Implementing Parallelism in PostgreSQL
Title of Series
Number of Parts
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date
Production Place
Ottawa, Canada

Content Metadata

Subject Area
Where We Are Today, and What's On The Horizon PostgreSQL's architecture is based heavily on the idea that each connection is served by a single backend process, but CPU core counts are rising much faster than CPU speeds, and large data sets can't be efficiently processed serially. Adding parallelism to PostgreSQL requires significant architectural changes to many areas of the system, including background workers, shared memory, memory allocation, locking, GUC, transactions, snapshots, and more. In this talk, I'll give an overview of the changes made to background workers in PostgreSQL 9.4 and the new dynamic shared memory facility, which I believe will form the foundations of parallelism in PostgreSQL, and discuss some lessons I learned while implementing these features. I'll also discuss what I believe is needed next: easy allocation of dynamic shared memory, state sharing between multiple backends, lock manager improvements, and parallel algorithms; and highlight what I believe to be the key challenges in each area.
point enterprises presentation server Computer animation time architecture parallel Databases argument
hard intel Actions time decision indicators part variance Sans backend web Benchmarks rates different memory configuration CPUs single hash area man relation Relational Blocks load Stream bits scans sequence Benchmarks backend category Neumann-Problem processes CPUs hash orders website configuration Right sort Results point modes server statistics Turbo component machine Stream parallel second Twitter goods caches operations Hardware band testing Gamma conditions projects parallel plan cores Databases maintenance single Indexable words Computer animation Query case mix topology table tuple
multithreaded code decision directions time process indicators help barrel argument part parallel powerful backend period architecture terms different operations Direct structure model systems fundamental overhead projects traction parallel share bits Later scans backend single Indexable processes CPUs Computer animation Query case Synchronous Right table family Windows tuple localization
complex Context dynamic Actions system call code decision time administrations unit sets argument iked part Semantics photos emulations image different memory configuration file system model position systems man paradigm multi-platform files share simulations scans variables Benchmarks Types message-based processes Right sort systems Results server implementation overhead services files real Stream characteristics regular parallel number workloads architecture terms Hacker operating system testing platforms set addresses modules projects parallel memory limitations system call particle causal Pointer kernel Computer animation case Windows
dynamic building code time directions multi-user singularities sources part total Bugs strategy rates memory configuration hash model extent Allocation systems area paradigm Blocks moment share The list Transactional bits several backend Terminal category message-based processes hash media communication orders Right ideal sort Board figure fundamental spacetime Slides functionality implementation control table mid Clusters Continuation similar parallel goods terms Hacker operations queue environment structure Gamma message-based report optimal platforms flow information compass projects parallel Content plan memory maintenance Computer animation environment Software uncertainty propagation Query case table tuple
Context dynamic code time singularities ones heads Semantics programs backend different memory Operationen hash box series determinants Security errors position area man generate share Transactional acidication scans sequence backend useful work degree means message-based processes hash orders buffer Right faith Results Slides functionality RNG table State cursor Mass parallel elements goods regular Hacker robotics operations environment statements message-based UDF State validation compass First-Order parallel code Databases memory maintenance cursor Indexable Computer animation environment Query case functions topology statements set objects table pressure tuple localization domain
Actions time primitives part programs fraction data management configuration memory different Synchronous CPUs Relation Allocation systems exception classes man simulation algorithm relation Blocks Development web pages share bits part deadlock scans backend data management processes Iris sort Blocks faith Results Slides maximal Codes theoretical parallel deadlock terms operations touch environment structure report Jump alpha State NET law projects parallel heaps plan maintenance words Computer animation Query game table
follows immunize him in the back that that are at that point so since that's a couple minutes to 11 but my the time disagrees so I think it I will go ahead and get started on the mountain is Robert Hass signed a major contributor and committed to post issue I wanna work in enterprise to be where I'm the chief architect of database server and today I'm going to talk to you about implementing parallelism and post rescue well which is something that I've been working on for the last year so yesterday and so the question is is it done yet and the answer is no I you I suppose you feel free as I go through this presentation to jump in with questions if you want have a protracted argument about my sanity I'd be happy to have that with you after the session I so I guess the 1st
question about parallelism is why do we need parallelism on other words from a wire you working on such possibly hard project in which you will probably fail and he publicly humiliated shame
and add any answer their questions I really think post rescue needs parallelism to remain relevant I I have had a feeling for a while now that single-threaded put performance on CPU is is just not increasing yearly as quickly as it did when I was growing up and you find a new machine every 3 years and will the old 1 completely out the water and so I went and I look for some statistics to back that up and I found a few and that's the first one there's from somebody's blog post and he went and looked at single-threaded spec and inspected the benchmarks which was kind of a funny concept because there's like implicit parallelism stuff going on there but he did his best separated out and found that between 96 and 0 for and he saw performance increasing by more than 50 per cent year over year which is a less than 2 year doubling time but between 2004 and 2012 besides increasing Bradlees pretty steadily 21 per cent year now I personally would be pretty happy of all my things got 20 per cent faster every year but but but it is definitely a slowing of the rate that we use to see I think I want real-world tests you actually see less gains now I and still another example that I found was from the nanotech website that it's benchmarks when they created a web server only 1 of those was a single-threaded benchmark which right there tells you something about why we need parallelism and but the single threaded benchmark I worked it out annualized rate of 7 . 7 per cent per year based on the date those city the CPU's were initially released and 39 per cent improvement over roughly 4 years so the data is growing faster than that then and your sisters and you're stuck in the single threaded world and you're getting slower with every year that passes by I I think the most compelling evidence I found this trend is you know we sometimes tell people at least I sometimes people tell people when they're up outpost rescue well servers don't buy more costs by faster cost but I went to the delta N. and it's good advice up to a point but there is a growing problem in this area I went to the dull configuration tool which I've always found to be a really great way of putting a machine together in understanding how the of components will cost and a undermined servers you get these 2 processor options 1 and 2 . 8 gigahertz 1 point at 3 point 4 gigahertz overall fairly similar at 3 point 4 gigahertz you're paying a 400 dollars instead of 70 700 dollars but that's OK because it's faster was not so OK is that you've got 6 scores on the more expensive 1 with the higher clock speed and 18 costs on the lower speed and that's something where a lot of people are either going to make the wrong decision or just say you know that's crazy you can expect me to pay that much more from my hardware if I wanna run post rescue well and really how far have a problem where you can get by having a 3 point 4 gigahertz processor rather than a 2 . 8 gigahertz processor that's just not enough of a difference especially when you consider that the memory interconnect the same speed at to really get out from under this problem and I want to add here that I'm primarily concerned with things that take minutes 4 hours to write if your query runs in a couple of seconds and you want to run running through seconds that's probably a problem that's worth solving but it's not really the the use case that I'm concerned with here at least not at 1st I so I I think it would be useful also to just do a brief overview of what kinds of things we might be able to paralyze and post rescue out and in and thinking over this I think they're kind of 3 main categories to related queries and then 1 other group and so this is the plan so that it looks vaguely like something that might come out of explain and we've got a sequential scan on and then were during have joined and far so we scanned bombilla table and then we do have joined at between the 2 so Simple Plan but but I mentioned that our the food table is really really bad so that this hash joint is going to take a long time for what can we do to make it faster well and then we could think about doing is lining the execute pipelining execution of these 2 operations you've got a hash joint and you get a sequential scan so why not have 1 process through the sequential scan and the other process do have joined and the 2 tuples from 1 to the other so this is what I'm calling inter node parallelism because it's parallelism between 1 node another node right you do some nodes over here and do some nodes over there and talk to each other and now the other way of looking at this same plan tree is you know maybe we need more than 2 processors working on this tree in that case we need to actually do parallelism within single loads or in front of parallel so sequential stands for example I'm supposing that we have a filter condition here which is the well known something complicated filter condition which as you guys all know takes a lot of CPU time in order to execute because it's so complicated I and so we might not be I O-bound doing that sequential scan on table we might actually be CPU back and if we are database might fit in memory which when you can get a commodity server with a terabyte of memory is not strain credulity too much and so we might like to actually have multiple back-end processing the school relation different backends process different blocks each back and apply the filter condition in the embassy see visibility checks to its own block and passes the results of the join and hanging wall that maybe we could paralyze the John right I mean if we've got multiple streams of tuples coming in from somewhere and we parcel those out to a series of back-end that have access to a shared hash table that we built over the borylation each of those through what we need to those guys can do independent probes in the lecture hash table and do part of the hash join separately we don't even need locking on the hash table because it's not changing once it's initially built but so there's a lot of interesting things that could be done here and none of them work yet of the 3rd category moving outside the realm of parallel query as parallel maintenance a DDL commands I think these may actually be good early candidates for parallelism for 2 reasons 1 is that it's easier and this is a really hard project so anything that's even a little bit easier is always nice and 2 is some of stuff takes a really long time and that that I think is the case where you can get the most went from parallel something about things like create index where again you could perhaps paralyzed the heat and the fact that paralyzed the sort that we need to put the data in the right order of or
back up again obviously paralyzing he can could be a very useful thing to do and then and there were hundreds taking the other nite and they were pointing out that once you finished the 1st scan through heat you have to go and scan all of the index's however many of them there may be and remove the tuples that you've identified as dead from each of those indexes will right now we do that 1 index and then the next index and then the next index however many indexes there are if you got 1 index on your primary key that's 1 thing but if you've got 10 indexes on the table which people sometimes do others really obvious opportunity for parallelism thereby having different workers scan different indexes maybe you could even make it more fine-grained than that have multiple workers who operate on a single index but even just getting to 1 worker per index would help people who've index the the heck out of the tables I'm so that some of the kinds of things we can probably paralyze oldest pause briefly here and ask if anyone wants to ask the question before and you want to be a technical ramblings yet all of you have rights you write it this is something I was going to allude to later that the question is basically if all my CPU's are already in use and I start doing parallelism whatever time I I might actually be making things worse for the system overall even if performance for 1 particular process gets better and his question is so what are you doing about that and my answer to that is I'm not doing anything about that yet because I haven't got that far but it is a really is a really good that is a really good question on all into a brief a little bit later on yet all right so is so technical Austin back in full you know cluster in some ways you and backing polar in some ways very similar to these commands and because once again you have the same opportunities for parallelizing stands and sorting and the issues are a little bit different in every case but yeah I think it once we get the ability of paralyzed somebody's cabin operations will all then probably go through a period where we try to plot that infrastructure and the various different parts of the system which will no doubt be a barrel of laughs alright so I want to talk a little bit about architecture i and and some of the architectural decisions that I made up early in the project and I obviously am not saying that made those decisions for the project because as much as I might wish that I had that power I totally down I'm talking about the decisions that I made in terms of what I was going to work on and how I was going to pursue this so I think if people disagree with these decisions that that would be great to hear about maybe after the talk will be better than during the talk just in the interest of time and I didn't bring any kind of shielding devices so if people feel the urge to throw things at me that when they hear some of these things go by and I would sure appreciate if that could be something that you refrain from so that the 1st decision I made is we're gonna use processes were released on Friday use processes for this but not for profits so people often say well you know it would be so much easier for you to parallelism is we you were using a threat model in a process model and they might be right because there there's certainly plenty of challenges doing it this way but I don't really feel like rewriting the entire back and and having it become totally unstable and not work anymore so I just can't see that there's any joy there I mean if you start multiple threads inside an individual post press back and the good news is the rest of the system all stuff that's in shared memory doesn't have to know the bad news is almost every piece of back and local code and every back and local data structure know and there are actually a lot more of those than there are things in shared memory so changing all that around if we could have a great argument over alcohol about whether that is a good thing to do on general principle but I came rapidly to the conclusion I think it took about 6 microseconds at that if I embarked on that approach with this project I would be doing it to certain failure so I decided to pick the approach that was highly likely to fail rather than the 1 that was absolutely certain too fast and the 2nd decision that I have made at least for myself is that I was going to pursue the idea of doing parallelism with processes that were started by the Postmaster rather than having individual backends for to make a bunch of siblings that would then all cooperator query and 1 of the big reasons for that is that we have a lot of users on Windows and as much as I hate windows and I really do I mean would not cause stress throughout the project would not enjoy the success that we have today if we didn't support windows is applied and and I'm not excited to be the 1st guy who implements a major major feature that can never work on Windows under any circumstances whatsoever and virtually none of the code that we use and for that feature on other systems can ever be adapted to work on Windows I just don't want to see guy on the other issue of course is that right now but today all back ends direct children of the postmaster on Linux does and UNIX systems in general I don't really handle grandchildren the same way that they handle children and that's what we would end up with if we try to have individual processes work I'm not saying you the issues couldn't be fixed because they probably could be I by it just felt painful and so I that's a lesser consideration of the fact that just completely break on Windows but it was another thing it means me in that direction right so that's the process model and then what about IPC things communicate with each other
well you know there's a lot of tools but there's a lot of IPC tools available on the platforms that we support there is relatively few of them that work on every platform we support but which is kind of pets but but I pretty much every system has some way of doing shared memory somewhere doing pipes and course everybody's got files particular windows the API cultural difference and so I I can't rejected files pretty quickly and there might be something we want use files for within this set of things around parallelism parallel tapes or something but for the most part of you know writing to files system call you might end up doing I O and somebody else is gonna read the data back in the euro allowed and Justin seem like a good way to build a high-performance systems and plants are good paradigm a lot of things we want to do in terms of streaming data between 1 process in another process for which types were would be a good thing but shared memory is a lot more flexible and as evidence of that I offer the fact that you can much more easily shared memory to emulate a pipe then you can use a pipe to emulate shared memory now pointed out to me that there actually probably are ways to use pipes to emulate shared memory but I was a little scared of some of those things that some of them are operating system dependent on so what I did instead it here's our I wrote a module called should underscore and you it uses shared memory emulator pipe and 1 of the things that I think is nice about that is I've got 1 set of platform dependencies around shared memory is that if 2 sets 1 about shared around shared memory and water pipes and action and q model actually provides a nice semantics you can make your pipe as big as you want to be or as little as you can get away with depending on what performance characteristics you totally platform-independent it never splits of messages and the smaller messages which operating facilities would probably due on some platforms and not others and so it that seems to be working kind of nicely but other decided that we needed specifically dynamic shared memory rather than the main shared memory segment and the big consideration here is that if you want to parallel sort anyone do it like a giant quicksort in memory you might need like a terabyte of memory and you can't free reserve a terrified in the main shared memory segment on any system that I'm aware of and not have users complain about the fact that when you're not doing a parallel sort you like not using 1 terabyte of their rank and so I just didn't seem like there was any way to make that work more painful decision and this 1 was really painful and the pain will no doubt be with us for a long time is that we would win with the back that these dynamic shared memory segments could be mapped to different addresses in different problems different cooperating processes and that kind of sucks because it means you can't use actual pointers inside these dynamic shared memory segments and have things work the way you want them to and I hate that but it's all on the other hand I'm the only 1 actor has told me that he thinks that it might be possible to actually make having at the same address in every process and every other hacker I've talked to including me when I talked myself has said that that is never going to be reliable and it's not gonna work and you'll die if you try so I'm I'm trying to deal with that complexity but the in the individual users of the many images memory facility rather than making it into the facility itself and so those some architectural decisions that are made on right so we have a so the question was what's the difference between dynamic should remain regular shared memory there is a regular shared memory segment that we create that postmaster start and we actually do you manage to get that at the same address and every process on units we do that by working on on a Windows there's some thing that can accept and it's not 100 per cent reliable and what we what we we do it right and then dynamic shared-memory segments are different because you can stand up and dynamic shared memory segment anytime you want during the execution of the service so that major memories administrative once they start up and that's what you got forever dynamic shared memory segments you can start a mother you can create them on the fly and you can turn down of well server's running and there's no limitations that has to be a back and start up or anything like that just when everyone spend 1 up then 1 up when you know 1 anymore the other questions on architecture stuff and yet the you so the question is about non-uniform memory access time right now there's no code all anywhere in post rescue 0 which is a new model where but there might be on there might be a good argument for making some particle beam aware of what we lack is benchmark results showing that the lack of NuMA awareness that is actually heard I don't necessarily doubt that mediates but it's a 1 it's 1 thing to think that there might be an problem with something and another thing to have real reproducible test cases that demonstrate the you have a problem with something the closest I am aware of that becomes Kevin that a benchmark it did show a significant overhead on 1 workload but I think you were never able to reproduce that reliably the use of the art of the year and so there may well be smarter things that we can do there but I think we just don't know enough to be that smart I I don't think this particular project is the right place to start having that smart if there's much to be added it's probably much more useful add them to the existing stuff that everybody uses all time yet John and all right so the question is can we safely assume that are dynamic shared-memory uses the same sidestepped around the system by shared-memory issue that I added in and 9 3 the answer is no you can't assume that because it's completely false but the trick we used in that case does not
work for this case because the or only way propagating not on a shared memory segment from 1 process to another is the of 4 which is not an option here so what we have for dynamic shared memory is the configuration variable that lets you picked from among the shared memory implementations that work on your Operating System on hopefully most people will have positive memory which doesn't suffer from the system 5 limits but if you're running on a BSD system you might find that you haven't got posit shared memory in which case you should number 1 to ask your BST kernel guys very nicely if they wouldn't mind adding that and number 2 I can figure out of that variable to point to to say system fight it out it could be well actually take care for it automatically and move on here just in the interest of time because I think people have questions about the simulator slides as well
but so the basic question here is you what we need to build a how far along are we in building it in terms of actually making something that all works as opposed to something you can give a conference talk about it and and I had a divided the work here into 5 areas in the parenthesized comments give you some idea of where we are with those areas that there's some basic facilities which I think a pretty much done in time for sort of the very fundamental building blocks of parallelism Although there's what I'm calling plumbing which is a bunch of things that are not technically part of those basic facilities but are actually needed in order to use them in an effective way and all talk a little bit more about each of these over the coming slides so don't get too hung up if you don't totally understand what I'm saying now and then the next thing is establishing what I'm calling a parallel environment I'll talk a little bit more about what that means internally T-butyl hacking on their that some ideas and there's a lot more work to be done there but and then once you have this parallel environment is the goal of the parallel environment on top of bonding and have these basic facilities is to create a space where you can reasonably do things in parallel and then of course you have to write the code actually do the things in parallel but which is the parallel execution peace and eventually for parallel query as opposed to for maintenance commands you're going to need parallel planning and the idea is that the 1st 3 so that things on this list stuff mostly you build it once you've got the infrastructure but you know and then there are those last 2 things you do over and over again for each new thing you want paralyze so getting through that 1st pile stuff it's sort of upper requisite for actually being able to paralyze individual things yes even tho we know all about that you know you're right at the end of the and the thing about doing the planning and parallel I'm talking about planning what parallelism to actually do for a particular query yet so question is why does the planner need to know on the planet needs to know because the planner has to tell the active here what today it has to tell the executer take these this part of the plan models in a a separate process from the southern part of the plan but this has to decide where to put those break points out of was the use of the right like like that like if if if character of New England Aquarium parallel is cheaper than doing the query or not in parallel then you you you want know that so you can pick the parallel plant if doing query not if the optimal query plan is not on the the planet no wonder to correctly choose between plants and also to tell the executor exactly which things are we paralyzing in this particular query how many parallel processes are we using clickers each 1 going to be responsible for all of that stuff you you you know where the the people uh the the the the planner will have to tell the individual nodes where the drug problem because otherwise they won't know you are all so it was of course all right so the question is does X up or anything else have any time to this work the API source so that we can reuse I think that there aren't because they think that the communication with at the end of a cluster is going to be fundamentally different and the communication between multiple clusters let me actually just move on to the next couple of slides has a got a slider free on each of these and then people can ask questions on each slide as we come to the topic rather than jumping around and so the basic facilities that we've got a item for a dynamic background workers which means and an amateur memory so dynamic background workers means you can tell the postmaster please start a process and have it run this function and behold the postmaster will start that process and have it run that function what happens after that is up to you right so it's a really low level facilities and dynamic shared-memory similarly and years and you can spin up a dynamic shared memory segments whatever you want and it will eventually go away and the it will eventually go away part was actually where most of the work went here because it turns out you know were those of us who are used to coding inside the postcritical back and are used to the fact that when the for example your transaction aborts all of your resources get released well it turns out as I found out the hard way that if you had a whole new kind of resource and you don't write any code to make it automatically go away it doesn't automatically go away because it turns out that the transaction aboard machinery is not magic so of that was kind of when I had the wake-up moment that made me realize that I was building really low-level facilities here and I had to get things like transaction board process termination cluster termination there's like 3 different levels of making sure we clean up dynamic shared memory segments at the appropriate time and if you got rid of many of them then you would have 2 weeks so that now is the complicated part of that but the good news is it's done on in fact I've already got a what reports on these things and the great thing about that is that people are using the right we barely got 9 . 4 beta-1 out the door and even before beta-1 1 out the door I had 1 reports related to these facilities people don't find bugs and things that they're not using so even though we are a long way from having real parallelism here but people are latching onto these C API and doing stuff with that that's significant enough that they actually care about whether it works which I think is fantastic and that gets really fantastic part of my goal here was to build the individual pieces of this project in a plausible way so that they could be used for parallel query but they can also be used for other stuff and each layer and trying to make that possible so next
is clock rate and this is the stuff where it's like 0 gee I can get a big chunk of undifferentiated bites that's called an amateur memory segment but what the hell do I do with that after that right and so 1 option is to everyone within their your bites have thought but that's not necessarily an easy programming model that you what things so 1 of the things I did 9 . 4 is I created this concept of the DSM Table of contents we have a similar Table of contents although it's more complex for the mentioned memory segment and and the basic idea here is that this provides some dead simple infrastructure up so that when you map a dynamic shared memory segment you can actually do some discovery and figure out what kind of data structures are in there and locate the particular data that you're interested in within the overall direction memory on several people that on list that they work Chih that I got this part right and they might well be corrected but we won't know until we try and I think it's pretty good I think I can see how to use it for a bunch of useful stuff but you know as with any of this fundamental infrastructure and particularly once you get beyond the absolutely most basic stuff that you know you're gonna need there's some room for maneuvering none of them that got under 9 . 4 is a simple message queuing facility I mentioned earlier SHM underscore and that is what it's called and and this is intended to answer the question how do background work out of cooperating processes communicate that if you have a background worker that's running and it generates tuples that need to get to another process so that they can be further processed work just errors out and then the other back and needs to get that error so that it can decide to stop doing whatever it's doing work so that you can break the air out whatever it's going to do it has to be some easy way of doing that and so that's what I invented this for as a way you will hopefully be able to send tuples narrative notices and what are the other control information needed to stand between the operating processes but the title of this is still kind of a low-level facilities but I did get a bug report on this already so again somebody is using it right the error propagation on a piece of this is sort of what I think the next thing is here the idea is that we don't want every person who uses background workers and dynamic shared memory and message queues to have to reinvent their own way of sending their estate for example from 1 process to another process we want invent some ways of doing that they're somewhat standardized and people can if they so choose just use them for some people may want rolled around for whatever reason that's fine too and so that's something that I have just begun to work on but I have some ideas about that anything promising and we will see how many rotten tomatoes get thrown at me when I post that I another thing putting in this category of plumbing is a shared memory allocator this is an idea about which there is a considerable skepticism which is not without some foundation indeed I have some skepticism myself despite the relatively actually very large amount of time I've spent trying to figure out how to make this work at the same time if you look for for other pieces of software out there that do things in parallel that do not have a shared memory allocator which today we do not I think that you will find Proc possibly no such projects and certainly not many that have been as successful as we are so you know I think there's a very legitimate concern about how much shared memory allocation it really make sense to do how dynamic can you really make this without having the behavior of the system becomes more complex and difficult to understand how do you reconcile it with the fact that both men shared 2nd shared memory segment and these dynamic shared memory segments have a size that is fixed when you create on some platforms there actually are ways of growing in dynamic shared memory segment once you've created and they're tricky problems there are no as well I I don't really know what the answers to all the questions in this area are but I do think that are there are going to be some things that are really hard to do if you don't have so I'm working on it and we'll see how it turns out I think someday will want other data structures in shared memory to that have common implementations that can be shared by multiple users for example a shared hash table you can use this on the 1 hand to imagine creating an in-memory data store extension proposed rescue well that just makes a big hash table on dynamic shared memory and store stuff in their back out on the other hand you can also imagine doing it for using it for something like the compass CID hash so that you can actually do rights and multiple back ends at the same time and not restricted to campus and its on but I'm not going to work on it right away because I think I can get online somewhere all we know other not have not got about half aggregates that is a good example of where the hash table would need to change as you work your way through so that that's a very that's that that would be a good use case for a journalistic yet and you know that's very hard that particular case of an aggregate is very hard to make work because if you run out of memory and a shared memory segment you have no workable fallback strategy for sorting you did right if you feel like fill up the shared memory segment that is fixed size you you don't have
to worry about trying to grow around another 1 you can just error were switching to external sort of by you know obviously if you are doing an aggregate you can't say it's just not read the half of the 2nd half of the data and will just to make the results based on the tuples with that so far and I'm fairly sure that someone would complain about that that's it that's carried on so the there's something that I think are useful some of them are done some of them are not done there is debate about which ones are useful and to what degree which is so alright so setting up a parallel environment what I really mean by this is making a background worker look enough like a regular user back and to do useful work and specifically making it look enough like the back and that its cooperating with that caused it to be started at for it to do useful work and this is a complex problem but I think that a lot of it can be done by basically copying relevant state like the user which database were connected to a snapshot of to be background work worker on using dynamic shared memory as the vehicle for that obviously this is going to be somewhat expensive so this is why I say I think this kind of parallelism is probably going to be mainly suitable for long running operations if it takes uh 100 milliseconds to copy all of your state over 4 1 1 back into another your query has or a query or maintenance operation whatever has to be 1 of running you really don't care about the fact that it took an extra 100 milliseconds to get things started if you're doing 100 millisecond query you will care if you're doing a five-minute index build you won't have provided that there is at least some speakers and useful work doesn't mean everything right getting the background worker to be able to do useful work does not mean that everything that you could do inside the user's actual dedicated back and can be done equally well within the work and some operations seem fundamentally in unsafe anti-parallel context for example if you have a user defined function that's not a security define a function and its that's a good that's good value election that change in that book value will persist after that function returns so that is arguably really really but it is our is a good example of behavior where in a parallel context I'm not really sure what that means I can't really make sense in my head of what the semantics of art that apparel context are but will certainly be different than whatever they are in a single threaded process and so you probably don't wanna what people do that of some operations could theoretically be made safe in a parallel context but we probably wall father madness such thank you for now and that's my own fault I reported by the or else he changed it about about another when I went to the bathroom and so on and so forth it's parallel what area so a good example of something that I think might not be worth making parallel faith even though it could be done is succeeded random if you call set see you're setting the random seed you then expect that you're going to get a random number of it in a in the same order that you got the last time you can figure that particular state well obviously if you make a parole background workers and if they each have their own copy the random state they're each going to generate that series which will produce user-visible behavior difference whereas if you're running a single process then you'll get 1 sequence right now you can fix this right you could take you could provide code that allows the random state that random users to relocate its state into dynamic shared memory and build all plumbing so that this works exactly the same way or enough like this and when parallel context that maybe nobody would care I can't imagine we're gonna box but maybe you can even make it states like even if you can ah who cares also even if we have a lot of things have state that's shared between all the cooperating backends arbitrary user-supplied code can never be set because if you're running some user-defined functions that somebody is loaded into the back and the adult so or even some built-in function that some hacker wrote to put in there that you can inspect the coder that function and see whether users a back and private piece of persistent state that needs to be copied from 1 back into another there's just no way of doing it so I don't see around I don't see a way around the need to have some kind of labeling of functions to say that these functions are parallel state these functions are not parallel safe if you label your function safe when it's really not that's your problem and the planner will local Apple will cover the query tree and say I might have considered paralyzing this but it contains a function is not safe and parallel context so sorry I get a lot of it would be great if we had a program that could analyze another program and determine whether that programs with saved plant answering and something to say about that alright so what are we
actually need copy and this is my kind of less actually since the more properly with this kind of know it's kind of less as he did a lot of research on what the the the main things were here and I think we can probably building basic carol environment that can do interesting level stuff copying these things are obviously you need the background worker had the same notion of user ID and the current database as the original back and a provenance have the same values for all the books and you probably needs to know what transaction uh it's maybe give you some really simple read-only things without too much in this area of been probably to copy some elements of the transaction state and if it's going to do any database access it's continued the same turn active snapshot that you have in the original back and and if you've got a compass CID hash you have to show that to you because otherwise you might not be able to correctly read data that has been previously written by around back and say and watching the reactions of the other hackers in this world as they around in pain looking at the so so correct for a furniture emergent if I'm not sure trying to target anything that involves for all update or delete and those things are obviously valuable be nice to have eventually doesn't seem absolutely essential for the 1st go around you and separate slide on what of rights I'm thinking probably not in the initial had this because if you have more rights than the complicity idea that can just be copied it's that actually be a shared hash table and I'm pretty sure there's a bunch of other stuff that is going to need to be fixed as well but this is hard enough so I wanna get this done and then we'll look at extending OK now you also need to prohibit some things but a lot of these things are things that could theoretically be allowed if you had the right it's stating your dynamic shared memory but probably won't some of them are things that probably just can ever be made to work basically these are a bunch of things again this is research by knowing that I that I know how back and local state associated with them sequences have a back and local sequence cash cursor operations that the position of the cursor is back and local large objects have curses and listen notify have associated back and local state temporary buffers first-order back and private memory prepared statements are started back in private memory some of this stuff could be synchronized but it seems OK to me for a 1st had this to say well you can do those things in parallel context that will figure out later which ones which of those restrictions actually robot people of really badly and see if there's a reasonable way of relaxing on assuming I don't die 1st generation and validation messages seems like something they can probably never worked on I think that's really only going to be an issue if you have like 1 user defined function that's going to create a table and then another function that gets called later in the same query is going to try to read from that table or something like that and if those 2 operations are actually happening in different back and then the validation messages from the 1st back and that created the table would have to somehow the immediately perceived by the other back and in time to read the table it's a mass I don't wanna go there I don't care about that use case you probably don't here is the not pressure on what
management is a whole other kettle of fish this is what undress was asking about for the most part of the system overall I think it doesn't really need to care that much about the fact that a group of processes are cooperating to do something most of the data structures that we have in shared memory you really don't need to worry about that like the processes themselves need to know that they're doing it in the coordinate with each other but the system overall doesn't really need to know but locks are exception but you can think about a couple ways to try to handle this and they don't work in the background workers can't take naloxone all and just rely on the user back and blocks because the user back and my diabetes before the background worker finishes doing whatever it's doing that requires that what that would be a disaster but you could try to fix that by saying the user back and is never allowed to die and release its slots until all the other workers are dead and I have no faith at all that that kind of solution can be made robust but another problem that now the other obvious option is close to the user back and takes the lock initially in the background workers just also take lots and times that the problem with that is that parallel query myself that what the user back and locked simulation within access share lot another process comes along interest to start a cluster operation on that table which means no more locks on that table can be granted but the background worker comes along interest acquire it's like on that relation and that background workers now just gonna sit there until the cluster complaints which it well because the cluster can start because the main back and is already holding a lock on that relation which is not gonna release until the background worker finishes working which is not going to happen because it's waiting for a block of cash so something obviously needs to be done about this problem 1 option is just sort of have a background workers try to grab all the what's without blocking and if any of them failed then they is dying you fall back to non-parallel query or something but I think that's a pretty unappealing option and I think it will be as complex to implement is what I think the real solutions here which is probably have some kind of the concept of walking what's inside the lock manager so that we can say that a group of back ends are sharing a lot state and either they all collectively hold whatever locks they hold an individual processes acquiring these really slots on behalf of the whole group of or maybe there's some way where we just tell them the you know the deadlock checker and related machinery that processes in the same walking broke are allowed to Q. jump and they don't have to wait for other unrelated what's word that there are lots of and figured out the details here yet but I'm sure it will be exciting to that answer your question with ah I will go to the additional issues have purpose of it
and so parallel execution in some sense this is the easy part and they give you got all the plumbing and all of that state's synchronization coding you prohibited things that safe and all I can discover now all anybody basic infrastructure in place you got the tools that you need to actually code of parallel algorithm all you know the parallel algorithms are pretty well described in the literature that just and here using friends and you can just them off at the theories like in primitives and no parallel sorting algorithms are described in the literature that pretty well understood you can go read about how parallel quicksort works so our you know for parallel sequential scan you hardly even need to think about what to do there too much I mean it's not there's a couple different things you can do in terms of how you can only leave the different workers but is fundamentally I think not that complicated but but the major thing that I think that comes up here that's a little bit so resembles law which says if alpha is the fraction of running time of programs and executing serially the maximum speed up from parallelism you have infinite workers is 1 over Alpha in other words any time that your background workers spend not being is is really matter it will rapidly erode all the games that you got from parallelism if 10 per cent of the time is spent executing serially you can get more than 10 times faster than you can only get 10 times faster if you switch everything else down to 0 and so was the key things that comes up with these parallel algorithms is that you've got a structure of the algorithm in such a way that you won't have workers starving because they don't have anything to do so for example a sequential scan you do not want to break the relation into equal sized chunks and have 1 background worker scan each chunk because then if 1 chunk turns out to be faster scan because it's in memory and another structure and that to be slow to scan because it's out on desk you have some background workers that are done as the a background workers that still have lots of work left to do what really get as much benefit out of parallelism as you could have gotten you probably also have a lot of crappy random I O behavior if you do that we'll probably wanna do instead is have the back and kind of work together work their way through the relation together so so that if so 1st of all so you get sequential I O behavior and also so that of you want and also so that you don't and that don't have 1 runway ahead any others and finish 1st and then have anything else to do my might died yet I'm all I'm almost done I'm going to slides left and so parallel query planning and I said to no 1 time there are 2 kinds of post rescue well development projects those that don't touch the Query planner and those that are unsuccessful but that's a slight exaggeration but not that much and this is 1 of the reasons why I think making maintenance operations parallel might be a good place to start and 1 of the problems that someone in the audience asked about earlier is that you know you might have a plan that's cheaper for you and that your weight less time to get the results but enormously more expensive in terms of the expenditure of overall system resources so we have to think a little bit about what it means to pick the cheapest cost plan we factor in the overall cost to the system of spinning up a bunch of workers and then there's other things will need to do only to have a notion of the worker started parts of the IPC classes that are introduced when you paralyze things so and that is actually pretty much all I have so all stop here and take a couple questions and I guess what we want so thanks at the time of your report from the inside looking at the which can be ideally I did read a lot of papers on memory management it turns out that I wasn't able to find a paper that describes somebody doing exactly what we're doing in particular a lot in the literature about allocators assumes that that the the allocator can allocate its own mitigated by calling the system Malik if you're allocating from a dynamic dynamic shared memory that there is no system Mallet that knows how to do that and the other problems as well that's what all stuff here everybody wants to leave but feel free to come up and ask questions if you are thanks