Parallel Sequential Scan

Video in TIB AV-Portal: Parallel Sequential Scan

Formal Metadata

Parallel Sequential Scan
Title of Series
Number of Parts
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date
Production Place
Ottawa, Canada

Content Metadata

Subject Area
Unleashing a heard of elephants Parallel query is close to becoming a reality in PostgreSQL! A year ago, much of the low-level infrastructure needed for parallelism, such as dynamic shared memory and dynamic background workers, had been completed, but no user-visible facilities made use of this infrastructure. Major work on error handling, transaction control, and state sharing has been completed, and further patches, including a patch for parallel sequential scan, are pending. In this talk, we will talk about parallel sequential scan itself, including performance considerations, the work allocation strategy, and the cost model; and we will also discuss the infrastructure that supports parallel sequential scan, including state sharing for GUCs, transaction state, snapshots, and combo CIDs; error handling and transaction management; and the handling of heavyweight locking. Finally, we'll discuss the future of parallelism in PostgreSQL now that the basic infrastructure is (mostly) complete.

Related Material

Point (geometry) Wechselseitige Information Code Software developer Patch (Unix) Multiplication sign Planning 3 (number) Mereology Type theory Computer animation Term (mathematics) Cycle (graph theory)
Point (geometry) Asynchronous Transfer Mode Dynamical system Server (computing) Code State of matter Variety (linguistics) Patch (Unix) Multiplication sign Direction (geometry) Mathematical singularity Parallel computing Client (computing) Number Software bug Revision control Propagator Read-only memory Semiconductor memory Queue (abstract data type) Implementation Message passing Error message Exception handling Projective plane Shared memory Content (media) Bit Line (geometry) Density of states Front and back ends Message passing Arithmetic mean Process (computing) Computer animation Propagation of uncertainty Query language Personal digital assistant Order (biology) Cycle (graph theory) Quicksort Arithmetic progression Communications protocol Fundamental theorem of algebra Tuple
Point (geometry) Ocean current Asynchronous Transfer Mode Context awareness Functional (mathematics) Randomization State of matter Code Patch (Unix) 1 (number) Mereology Semantics (computer science) Front and back ends Computer architecture Goodness of fit Different (Kate Ryan album) Synchronization Information security Physical system Exception handling Authentication Mapping Information Computer-aided design Compass (drafting) Structural load State of matter Electronic mailing list Code Sound effect Database transaction Database ACID Variable (mathematics) Cursor (computers) Front and back ends Process (computing) Computer animation Permanent output Right angle Energy level Window Asynchronous Transfer Mode Library (computing)
Point (geometry) Asynchronous Transfer Mode Functional (mathematics) Context awareness Patch (Unix) Set (mathematics) Mereology Food energy Power (physics) Crash (computing) Deadlock Military operation Operator (mathematics) Cuboid Error message Physical system Graphics tablet Module (mathematics) User-defined function Halting problem Database transaction Deadlock Category of being Computer animation Query language Function (mathematics) Network topology Partial derivative Video game Right angle Asynchronous Transfer Mode
Web page Asynchronous Transfer Mode Functional (mathematics) Group action Dependent and independent variables Standard deviation Standard Model Electronic mailing list Planning Mereology Sequence Rule of inference Type theory Arithmetic mean Computer animation Finitary relation Partial derivative Normal (geometry) Endliche Modelltheorie
Computer virus Multiplication sign 1 (number) Insertion loss Parameter (computer programming) Icosahedron Mereology Fluid statics Performance appraisal Energy level Information Statement (computer science) Form (programming) Information Shared memory Planning Parameter (computer programming) Menu (computing) Mereology Statistics Front and back ends Performance appraisal Word Computer animation Vector space Query language Statement (computer science) Partial derivative Website Summierbarkeit Queue (abstract data type) Row (database)
Mountain pass Multiplication sign Maxima and minima Parameter (computer programming) Icosahedron Mereology Performance appraisal Read-only memory Military operation Befehlsprozessor Operator (mathematics) Information Statement (computer science) Condition number Dialect Expression Keyboard shortcut Variance Planning Parameter (computer programming) Statistics Sequence Front and back ends Computer animation Query language Network topology Queue (abstract data type) Family
Run time (program lifecycle phase) Confidence interval Code Mountain pass Multiplication sign Execution unit 1 (number) Set (mathematics) Insertion loss Mereology Formal language Semiconductor memory Military operation Befehlsprozessor Logic gate Physical system Area Wrapper (data mining) Feedback Parameter (computer programming) Bit Maxima and minima Sequence Front and back ends Degree (graph theory) Type theory Arithmetic mean Right angle Figurate number Point (geometry) Slide rule Implementation Perfect group Variety (linguistics) Characteristic polynomial Maxima and minima Machine vision Number Revision control Causality Read-only memory Term (mathematics) Operator (mathematics) Energy level Code refactoring Nichtlineares Gleichungssystem Condition number Form (programming) Addition Greedy algorithm Default (computer science) Chemical equation Projective plane Planning Total S.A. Basis <Mathematik> Word Uniform resource locator Computer animation Personal digital assistant Query language Network topology Table (information) Service-oriented architecture Tuple
Link (knot theory) Interior (topology) Multiplication sign 1 (number) Sheaf (mathematics) Insertion loss Water vapor Mass Mereology Web 2.0 Crash (computing) Goodness of fit Mathematics Selectivity (electronic) Monster group Physical system Sampling (statistics) Sound effect Grass (card game) Sequence Front and back ends Computer animation Personal digital assistant Statement (computer science) Partial derivative Video game Right angle Family
Slide rule Dialect Dynamical system Block (periodic table) Graph (mathematics) Electronic mailing list Disk read-and-write head Rule of inference Front and back ends Mach's principle Data management Computer animation Data storage device Different (Kate Ryan album) Different (Kate Ryan album) Right angle Endliche Modelltheorie Quicksort Block (periodic table) Extension (kinesiology)
Functional (mathematics) Scripting language Table (information) Computer file State of matter Multiplication sign Virtual machine Maxima and minima Median Mereology Rule of inference Food energy Number Revision control Expected value Internetworking Semiconductor memory Software testing Condition number Form (programming) Metropolitan area network Point (geometry) Staff (military) Sequence Front and back ends Residual (numerical analysis) Computer animation Vector space Statement (computer science) Software testing Right angle Quicksort Object (grammar) Tuple Reading (process)
Computer file Structural load Patch (Unix) Content (media) Planning Price index Function (mathematics) Streaming media Degree (graph theory) Revision control Subject indexing Regular graph Performance appraisal Programmschleife Computer animation Term (mathematics) Buffer solution Partial derivative Right angle Loop (music) Partial derivative Tuple
Code State of matter Multiplication sign Decision theory Set (mathematics) Price index Mereology Medical imaging Strategy game Different (Kate Ryan album) Hash function Endliche Modelltheorie Resource allocation Partial derivative Physical system Electronic mailing list Menu (computing) Bit Maxima and minima Sequence Hash function Right angle Quicksort Arithmetic progression Classical physics Point (geometry) Vacuum Digital filter Transformation (genetics) Number Revision control Workload Natural number Average Operator (mathematics) Musical ensemble Loop (music) Inheritance (object-oriented programming) Information Projective plane Planning Line (geometry) Subject indexing Loop (music) Computer animation Integrated development environment Query language Partial derivative Video game Table (information) Tuple
Computer animation Menu (computing)
we might have a a a a a a a a minor I have a lot of time and I don't know what my 2 more minutes in thank you for your money is my mind because of the of the of the of the of the in a a a a a a a a kind of a lot of my Government something ridiculous like I would not mind I might have a lot of of people in the world as well as other but OK so it's 1 o'clock so I guess will get started and so today I II and III my colleague emit Capello will be talking about Parallel sequential scan which unfortunately did not make it in the past rescue 9 . 5 all I but were were ahead but were hoping that it well what we're hoping to get it into a post rescue all 9 point 6 early in the development cycle so that we have time to shake out all the bugs before and it actually gets released and this is certainly the biggest feature that I've ever tried to get into post rescue well not necessarily in terms of its impact on users although maybe and but certainly in terms of the amount of 5 development time and effort and planning that have gone into making it happen and I'd like to just think if you people for being a part of that effort and that has done a huge amount of work on this which I'm very grateful for I know of was involved in the design of this feature while he was at the phrase and a couple my other colleagues ultimate counterterror blockier and evangelical road pieces of code that ended up in some the patches as well so I'm a lot of hands have touched on this a lot of thought has gone into it obviously undresses review has been invaluable even if it's sometimes maybe 1 terror the hair out of my head and but I think we're getting very close to having this feature and so I'd like to talk today about some the things that have been done and what remains to be done and answer what the status of the patches are and that will talk about other portions that they're there his work and all types of questions about the my work and so
on uh the overall status of this feature here is that we can have been making progress on this release by release and imposed rescue 0 9 . 4 but we got sort of the very basic fundamental infrastructure facilities that we knew we were going to need in order to build this feature on AllByRow Herrera had added up background worker processes to post rescue all 9 . 3 but they have all had to be starting at the time that the server was started which was obviously not gonna work for parallel query and so I made those into dynamic background workers and which can be launched while the server's running and can actually be started up and shut down pretty quickly and we also have a fixed size shared memory segment which was not going to work for parallel queries that might need to exchange large amounts of data you can't just nailed down that much memory in advance that server start up and if you plan sort terabytes of data at some point during the several lifetime and you have enough memory to do that in memory that doesn't mean that you have a terabyte that the server can permanently reserved for its entire lifetime so we really needed to have dynamic shared memory in order to make this a possibility and and the other thing that we got in 9 . 4 wasn't very basic infrastructure for managing the contents of a dynamic shared memory segment and the biggest piece of that was shared memory message queues so you spin up a dynamic shared memory segment and you have to processes attached to a cue that stored in that segment and then 1 can send messages to the other that you want to talk in both directions you need to choose use 1 going its direction and that's obviously something that we can use that for a variety of purposes we can move error messages were warning messages between processes as well as tuples and we need to do both of those things so 9 . 4 in the 9 . 4 cycle basically what we got in place was a bunch of infrastructure that is targeted toward parallelism but is actually pretty general lots of people are excited by dynamic background workers in using them for other things this is a somewhat smaller number of people are excited about that dynamic shared memory but that too is being used for other things and bugs in the evening and a shared memory message queues were reported on by people who want me so that means that somebody actually took my code and used it to do with the fact that it was not something that was what was I was planning to do with it which was pretty cool non-PostScript you 9 . 5 we made a lot more progress specifically toward parallel computation and we got a series of things for error propagation so it's not quite easy for a year process to spin of a background worker and the background worker throws the error and the error is actually reported by the master process to the client with all the bits of the error state not just the error message that the final number where happened the line number where it happened that he the detail all that stuff gets collected about collected and forwarded it works not only for error messages but also I for for lots and warnings and so on and on so that leaves the actual feature as the thing that is isn't done yet we have working patches this I there are some unresolved issues with those patches that still need to be fixed and but it does work and if you apply the right patches and you don't mind the things that aren't all a fixed yet but you can try it out and say that works or doesn't work from my use case which is pretty cool but so just a bit more
detail about what we got the 5 1st of all there was this message propagation stuff we're actually using a slightly modified version of the front and back and protocol for the parallel workers to stand there and warning messages back to the master process and so they they assemble their log were warning or error messages in exactly the same way that they normally would they rate them into 1 of the shared message shared memory message queues and the master of receive that message and without having to write a lot of code were doing it that do anything particularly exciting in the master that same messages pops right back out the master and it may not really be obvious that you need that may sound like kind of a boring that feature and it is a boring feature honestly but but I think 1 of the hardest things about implementing parallelism is that the role whole bunch of really boring things that were not exciting or glamorous to work on but you have to have them in order for the future to work because like if you say hey would you like to do a project on error reporting emits light had no I want proper a lot faster because that's cool but you get exactly have parallel query and not report errors right having that clearly isn't going to work so this is stuff that that the world is now behind us and that there may be issues to resolve but that really it works and
so the other and probably bigger thing that we got in 9 point 5 was this concept of parallel mode and parallel contexts and and this basically makes it much easier for the heart master process to spend about a bunch of workers and to initialize the state of those workers so that it more were last and matches the master that doesn't exactly match the master and that's because we're not creating these worker processes by 4 can I air in post-processing architecture every process in the chapter system is a child of the postmaster process and so that means that that you knew worker processes that were spending up launched by the after we contact the postmaster and say please watch a worker process and it does and in that process basically has no state right it's just a copy of the postmaster that those k a worker so it doesn't have the same snapshot it's not part of the same transaction it doesn't know anything about locks you know it it doesn't know any of this stuff that pertains to the transaction started and so on and and uh you know the even if we wanted to change the architecture so that not all processes had to be children of the postmaster that would get us very far on Windows because Windows does not support for us so we felt we needed to take that this approach and so that means we need to worry about how we're going to synchronize all the state to the worker and we actually synchronized with the patch that went into 9 . 5 off an
awful lot of stuff and so this is all the things that the worker process goes and loads up when it starts executing all the libraries that your original master process had loaded the worker goes and loads the same libraries it connects to the same database using the same user ID effects all of its books to the same values that those books have in the original process but it gets the XID for the current transaction and for the top-level transaction which may be different if there's some transactions involved at the point where you begin parallelism against the list of X ideas that appear as committed to the original process which is important forgetting the embassies the visibility checks right and it gets all combo CID mappings of from the from the from the original process for which you can blame me taking out and you can also blame him for the fact that you're 2 for the 4 but smaller let's go back and by that gets the active snapshot it gives the transaction snapshot which may be different against the current user ID and the current security context now the way that I came up with this list is I said hey no can you give me a list of all stuff that needs to be the same in these processes and he came up with pretty much this less so then we went implemented that but it turns out that this is actually pretty good like it sounds like kind of along an eclectic list with a bunch of different random stuff in it but it basically boils down to 2 things you need to have the same tuples visible in the parallel worker the visible in the master under all circumstances even if there are some transactions even if there are all kinds of we're going to combo CAD is you've been generated the open cursors whenever get exactly the same and the CC semantics in the worker that you had a master so a bunch of this stuff is related to that and you need to have the same ducks because there's all kinds of pieces of code like input and output functions and all kinds of crazy stuff that depend on the value of books so if the goods are not all the same it's hard to predict what will go wrong but probably something well and so on all this really comes down to that and then of course you need the same authentication information we actually have inside the and were different user ID so that are maintained in 4 different variables and all and they can all be different and the values all have to match and apparel worker versus the master so now this is all taken care of and it turns out that once you take care of all these things almost all the back end code that you might wanna run in a parallel worker ones just fine 1 notable exception is you cannot allow anybody to do when he writes because as soon as you start doing the right things like your compass CID mappings can start changing and things will get out of sync with each other so that doesn't work and so on so what we have a lot
of so patches for 9 . 6 and that parallel mode and context of has 1 unfortunate emission right now which is that it doesn't include any heavyweight lock handling of and that's important because without that and if you spend up a bunch of parallel workers and a bunch of these are things that are unlikely to happen in real life actually do happen then you will just deadlock possibly without reporting the deadlocked or possibly with reporting a deadlock but leaving you go away that should the deadlocked and so it's incredibly boring and tedious problem on which I probably spent a hundred 20 howers more at this point I'm trying to get this right and convincing people that I've got this right but that's 1 of the things again and nailed down on and then I that the other thing is that nearly every system defined function is safe to to run in parallel mode but many user-defined functions will not be an end to the few system defined functions that will not be so we need to leave a lot of functions with whether they do anything that is not safe to to do and and parallelism and and in particular like we need to know whether they might update any date right because is if they update any data and you try to do that and parallel modes can be so you want fall back to non-parallel mode if you're function for example doesn't updates were different users in some transactions that's not gonna work in parallel so then that function is therefore parallel onset and we have very very few functions that we ship out of the box that fall into that category but we do have some and so this assessing parallel safety pads adds all that labeling and it arranges when we plan a query to search the query tree for any unsafe functions any operations the right data and if we find any then we say no problem but don't this query will not be parallel because if we tried to do it in parallel with that a lot of work by the way and much of this was was no was work and tuples work or was it a mechanic or whatever and that kind of car and you know it was put into making sure that if you label all the functions long you'll just get an error rather than like a crash for silent bad behavior so we put a lot of energy into trying to make it so that I you know as much of this is possible would be figured out by the system automatically there's some user activity required because of the halting problem and by but if you do make a mistake and failed to solve the halting problem yourself correctly and therefore mislabel function as being safe when it really is and we just want that to fail we don't want to walk in know horribly messy way around and then the last set of patches for and 9
point 6 is around Parallel sequential scan but there's kind of 2 main parts to that there's a bunch of things that need to be done to teach the executer specifically about parallel license and we've talked the transaction system about parallelism be executed as a separate bank and it's a lot about parallelism to every single module in the system needs to know a few things about parallelism that needed need to know before I and then we get a new x that executive nodes are final and partial sequential scan which are actually going to be things that power this new Parallel sequential scan functionality I'm so that I turn it over to admit to talk about that stuff and you know of
of millions of years we don't use it would be quite good and using that no you start with the 1 that you know what would you so for this fact we use of who whom different rules of who implemented 1 is often alone and another is action sequence can so in the final node of the City of London would has gone 1 child and that type is known by all the back and then of the of the responsibility of the phone and work is to combine all the samples of from the list of all us with this and after executing a plan and I'll send it back to the black and then on the features of the suggested model for this is that of the following pieces that not enough to enable all of what that using something and say they are not same that was 1 of the few on no back to the most of the function itself can stand on the pages
and the fact that the world these are the the mean of all of them like this 1 and 2 and then on the partial sequence that is a new movie which actually is the standard of well for the data so this norm is responsible for extending the payment of what might partly each of the standard model of the using this and where it will form of the whole step I so what
is a simple example of all the plan for a stand on that standard look like so what what is true is that the beginning before a bag of words 1st of the 2 and we used to understand it would be what part and it where you really to do is stand up religion so that indicates that it would be that also in London of seeking resources that we have to make up for that
and infrastructure the vandalism we have share many things tho as well as its the sites specifically for that query execution only being need some parts of the conductor on the planet which needs to be passed on to the so that they can perform the so that the important ones are listed and so on and this is the 1st thing that each 1 of these is that we need some form of planned on some form information to the users standard to master of the back forms the plant statement of which extends to all of and the death of the virus what about the and 2nd of all forms of blind and it's kind of knowledge from the union of the qualities of the mind that CM who was also need to will so so that the sum of performed execution of such kind of the ladies and then the time and important thing for me the execution is better and exact parameters of the time required for the evaluation of some kind of subselects so the user also needs to reach out to related to each of the workers and once this information is possible that of each worker is not all that we want to perform their stand and after performing the can each of them have to say about was that losses will lead to the user and the use which are establishment of a master of vectors and then finally of we need some status information as well let's say for the static member experience statements on all full them each of the summands sumoylation information also needs to be shared with master and the book of records to these are the top level of information we need to have a huge amount of loss and what the Americans who performed the standard correctly and community
so there is another interesting part of this is the family just because of the planning to introduce to also belong to that and we will be using all we need to work so here the 1st thing is the valid sequence and some of the really was also called it into your vandalism this again this time we don't really indicated that how many of the bedroom workers could be used for a particular badly operation so that the name given in the money that regions something like part of Indonesian all or some other things so the variance of each ship would use these many of each of the of of
the of course which would be a lot of people that that that that the sequential scan that is being performed here is not necessarily at the top of the query poetry tree and could have a filter condition applied to it which will be executed by the workers not by the master so if they if they see an expression that they need to evaluate that expression might refer either to bind parameters print exact parameters and so for the next thing
is that the implementation was not of this world the only case that I was back in the muscle back and so there is some loss associated with it so if anyone in the area of conditions we could set the cost of that and then get lost in all this is that said that although the ordering generality which we know with the beginning of the binary operation so this is 1 of the course like that of course we we didn't move the the setting up of the grand jury memory and the workers and all the stuff you the form of stock and that it's going to look like well I these asking how big is the parallel set cost I think of relative to other economies the biggest cause Marine told and I don't know exactly the spin attack well you again but this is this is not for scanning a 1 megabyte table if you think 1 megabyte is is a table that's big enough to Parallel sequential scan that get get out this is going to the but you have to know you In the land of the importance of the maximum instead of thinking right now there is no language and meaning is know already would the research topic that we try to optimize on over right western-made these on this many workers on this much investment that was about this data dating what we so for solutions we're all something genuinely it so anyone who is it's also anyone is expecting this to have perfect performance characteristics in the 1st version I tell you I have bad news for you we in you with a node level degree of greedy approach URLs in the form of a parallel has a of so I it should probably just be parallel degree not parallel sex can degrade I was noticing that only this is really is controlling the degree of the final yeah which is not supposed to be specific to the sequence and it started with you use your room rename that after the talk I had a lot of work yes we are having all the same problems that we have work on every single 1 of us and you know to the the reason why we have a problem with work the room problem with work that is because of the way the plant out of rhythm works is by figuring out the cheapest plan for each subtree of the joined tree and it's not clear how to say well you know if we do this plan here than some other part of the tree isn't allowed to choose its best plan anymore because now that land uses too much memory so it is just like there could be a general refactoring of the plant which could try to maximize the total resource usage across the whole of planetary rather than just considering the research that the resource cost at each portion of the planetary individually and about 1 thing we are not doing is adding that project into this project another thing that was the use all of these so but let's answer question number 1 in the so of something like this you know of a lot of it is that we want you to know this vision of the world and you use know so much that you externalities yeah so that so what word do is a planned time were going to assume that the number of workers you said we should plan to use will actually be available at execution time and if they're not the query will still run but just with however many workers we can actually get so if you set up Max workers total system-wide 2 8 which is the default value and then you said parallel degree to 14 then you will be planning on the basis of assumptions that will never ever work out in practice but that's your problem because you but it's only set yeah all you all you have to to believe me or not but what is involved in the very near so that you know about the the was so it only takes 1 or a couple of milliseconds to spend up the parallel workers but if you think about what a couple of MS looks like in terms of ah cost equation it's a big number right like that's it I mean I could be hundreds of thousands or millions you know you can do a lot of scanning in a couple of MS variety in the Union was the that was was put in all of this is like yeah because of course 1 of the things is that parallel sequential scan figures to work a lot better with a type filter condition than it does with no filter condition in fact until we get some for Atlanta enhancements with no filter condition it's probably a loser also so I think and I don't know what do you think of a lot of people yeah you know you're right at all so yesterday I gave a lightning talk saying that you know the biggest thing that needed to be worked on with respect to advancing 1 data wrappers toward shot was making the Query Planner smarter and other units progeny be true here to which will actually get to in a couple of slides and there's a lot of interesting query plan work that can be done here to make this much smarter and a little bit of executer work that can be done to make this much smaller but yeah there are lots of query Planner problems that are not solving the world not solving and addition to the ones that you can think of right now I have every confidence that there will be additional ones
that none of us are at this point which will become only become apparent once this feature is committed and people really start banging on it and that's unfortunate because we'd like this to work perfectly in the 1st version but it was not right and 1 of the crucial things is we're gonna need feedback from everybody who tries this in which cases that it work for you if any in which cases did not work for you there will definitely be some of those and then will incrementally at work on broker right it it will not be perfect out of the gate yet holds for all you know now that could be changed in the future it might not even be that hot but yeah right out of the set of all words in the in the in the world you are in 1 of with the people on the way to the merits expensive to check every tuple whether you can watch them I mean ideally in the long run if we've got 50 workers available in 10 queries with like system to sell balance until each query is getting 10 workers and then you know when 1 of the queries finishes would like to reallocate those workers on the flight to the other queries that's not going to be the 1st version is not going to be in the 1st version 1 on 1 of other the ways in which in itself will be he it was worth it now if 1 of the things yet yet when we 1st tried it when we 1st tried a public to both from the node the node goes up I'm actually being asked for tuples pay we don't have that in the budget I think I think you could consider that to be part of the start costs has its either way you're paying it once every time you go through the of the associated with the balance of all these is the placement of the this version of the yellow argued quite someone that is all that we use every possible that you know there's no no no I'm not able to guarantee anything during that had this so I mean you give it the planets planets a Parallel sequential scan plan you're going to be running the Parallel sequential scan code you may be running it with no actual workers in which case it's just going to be slower than if you were doing the regular sequential scan path hopefully not too much but but someone right so yeah you were all even really add something for that I mean you can do that with auto explain or and other tools will have good auto explained so in the
1st place and some better because there's an acquaintance and hence it is that I but most people out of all of the moles that rather than looking at the fact that we have to here is so the only gonna do family workers as this this is 1 of the most advanced workers and the ones at the start of the execution and the out of school as soon as we have seen the last of the of of the in this then all you need to understand what good is at the end of the execution so I why the needed Peters reduces to right solving it is the cases we had we might not be able to stop and and not the other of the last on something like the usual drunk such cases so we have to find something and all that is very small then of each of the battle of President of actually in a previous life than the loss of any of the other planets statement and send it to the web so what actually each 1 of you is it is executing the partial sequence that all the even and that of others and samples of that when use the monster that we do that using the tunnel and collect all calls and send it back the land something like anything that national during all now now if you're thinking it sounds like a terrible idea that have a parallel sequential scan on the inner side of mass of the outer side here and lower side of the nest little you're right but the current source crash if for some reason that 1 part of link with the help of and is linear so you would like notes select anything we will the water with the 2 is
on the whole system goes down you want to in the world that is if any back-end impose grass gets killed at any time by the room killer the whole system resets Pearl workers don't change anything about that that's not I'm not if somebody gets killed unexpectedly they could be holding a spinlock at the time that the die and then you're screwed because the next person who trusts effect that's been what will sit there forever instead so that somebody gets killed you done you know not happens a primarily has been locked which is not the same as the section now that a critical section the segment that afterward is a critical section is something totally different the critical section is when you really need to promote character to panic this
difference is called on the lexical and some of you through the journey of so I struggled to recover more slides here so we have a chance to finish the finished talking some interesting simple head of the regional lots of things that can be quite interesting all of you and so the that other thing of all the work is they have a lot of research on how whole like the sort of experiments like that we know about all the management and then I would like to analyze the correct way of the right line of sight easier to follow minority among the 1 1 is a not biologically act as soon as of the standard 1 block he didn't get next not understand and exactly the extent to which will write in the in the beginning when the master and this is that OK there were 2 of this is the standard room for us to and and then know like this so this is the list would reduce the right and the phone then there is not much difference in performance data so we we have a lot of graphs in the model is not good because it gives us no more dynamism and kind of like an efficient be better they were after the genes of the rules right so this is not understanding the media tend to see the last years this so we use
the last interesting thing of all the sequences and we are already done and perform residuals so all of these are obviously and then a assembling the test with some 30 million rules and they have the money and then get on so here that the state sequence in that I have used is such that it contains some of our functions and some conditions like this in your function has slightly some cost as well and you use the Internet freedom and you know that you want to know what we can actually not so really good so I have right of the test we have been in the world of material in the
there are some of the expectations of so here is that you can see that as the number of the energy parallelism is increasing that I this using a form of what I know I have write the the kind of late through a number of articles what is it that people that statements 1 % of rules for and and what would I see in almost 2 years since the staff and the readings are similar if they had not done yield is increasing of the and it's it's and and the and like I have to be done here is that after some white on anymore but would help us under the aegis of performances you know what not using a lot of work and this is the only there a movement in that you want to go know a lot of money and we use it to the other the 1 on this so there is no that this this is also that there is this characters because this is all retired OK a 13 year that bondholders was as part of the 30 million euros and that's not very bad that someone like 30 gets a lot so generalizes the maybe all but that doesn't a hearing I yeah I mean that machine has it to jillion bytes of memory and the solving of files I a lot of time here it seems strange that there's not more a gap because you should have a lot more tuples flowing through the cues 25 per cent of Risk qualify them 1 of the there are all sorts of things Prof US contested Nicodemus your objects with have a a lot of a lot so this is the that's all what you whatever I think along with some of the vector of all so the fact that you know there was you a that is what we want to to access that this 1 million is having a lot of yes this is all that is less than that and no problem with is that some of the in general the into all that matters is so in the early versions of this we just had 1 new executed node which was called parallel sex get and and we realized that eventually that that was a you know a kind of
2 right to we splitted into 2 nodes a final node and partial sequential scan node and you know the idea here is that if you're running a lot of fun load whatever is under that final node is going to get executed a copy that is going to get executed in every worker and then the final is gonna merge those output streams back together and so that has been refactoring the codon perilous have sex skin degree is when they got messed the trying to refactor the code so that the final node can run with anything any other it's executed 3 underneath it now that obviously is going to require more planner work but but it I think it's very promising in terms of squeezing more benefits added and so on here's an example
I you know if you've got a if you've got this 1st plan appear at the top where you have a nested loops with a sequential scan and an index out in the 1st version of the past the current version the patch the best you can do is this plan right here right you replace the sequential scan would a file over a partial sequential scan right but if you think about it that kind of stinks because that means that all of the tuples come back to the master and the master does this iterated index scan for every single 1 of them which seems like something that would be great to have a whole bunch of workers doing in parallel I mean assuming there isn't too much contention buffer locks but that's a problem and of exports help I'm so that so here's what
would be a lot better and here we called upon a lot or push the nested loop down everyone look at it so that the new we do the partial sequential scan and then each worker does the index scan for the tuples that it gets from the partial sequential scan and then only after performing that joint and do you use the model to bring everything back together so that only problems here and 1 of them is that 1 of the sort of nasty problems here from the planner appointed you is that there's a lot of times when you is that that formal and a lot of cost right that final is expensive because that's where you're paying the cost of starting up the workers so you don't necessarily know until you generate a lot of parallel paths which 1 so those are actually gonna turn out to be promising so we're going after we think a little bit about how the final part of this needs to work so that we don't waste time generating a whole bunch of parallel paths and then go away when you had a million to each of those then they all suck up so that that's gonna require some thought but I think this kind of transformation suddenly turns once we have which again it's not going to be an initial version suddenly turns out features may be kind of useful and interesting for certain classic queries into something that actually there's a reasonable amount of stuff that can benefit from this now if you could also do this with a hash join rather than a nested loop but it won't be quite as good because every back and is gonna have to build its own copy of the hash table fixing that there's a whole different project with a merge join you can consider this strategy but it probably hopeless because it's unlikely to ever be a good idea to resort in every pack and all maybe at the other side of the merge join is it index scan it could work 1 of so I don't know what that would mean all life on work on and so on so many years and I think that really what do I think we're going to try pretty hard to make it not do that and now another thing we can do which I just found out today
that Simon is very interested in actually making happened is that speeding up doing parallel aggregation right so if you're accounting giant table of or taking the maximum in the average or whatever by right giant table the where aggregation system works is that you start stuffing and values and you build up this thing called the transition state which contains all the information about what you are you know what the progress of your aggregation is so far and and but then when you're done you finalize the transition state and your answer popped out and so that won't quite work here because you need to build up a set of transition states we need to build up a transition state in each worker then somehow ship those states back to master combine them and then finalized that to give the user the answer but that's still feels pretty doable you need some kind of partial aggregate noted that runs under final image workers and spits out transition states and then the final will collect those and combine them on and you we really have and all we need is somebody to write a couple thousand lines of really hard code so should be no problem at all our well also underpinned node has a series of different children that it runs 1 after another and node has a single child that it runs multiple copies of at the same time so there there are other kinds of parallel those that might be useful besides besides final right so you can imagine that parallel append node that has a list of children and a pool of workers and its hands out the children to the workers and every time 0 0 1 of the children finishes the child that it got handed you pass that same worker the next child and you just keep doing this until the pool of workers have exhausted all that would be a great parallel primitive which would allow us to do all sorts of interesting things that is not but right so just require building a new thing right you can also imagine a plain old upon an unmodified append node of the kind we have today with a bunch of children that were all partial sequential scan right so if you have a table with inheritance children you have a table and it's got a whole bunch of partial sequential scans and that could be as long as those partial sequential scans can all lined up correctly between all the workers so that they they they know that there 6 of them and each 1 matches up to to its correct that can behave just like a single partial sequential scan that that that the attack on the problem you don't actually need any other parallel modes besides funnel I expect but in the long run and probably even in the short run somebody will want both actually having both will probably be more of a long run thing that but we'll see how it goes my initial light from yeah so the question is about paralyzing sorts I I sort of started with the idea of parallelizing sort because I said 0 you know if I do parallel sort it wouldn't even have to work for general-purpose queries it could just work for you know like index belts right because the time it takes to sort the tuples for and expelled can be long at the tables really bad and then you wouldn't have to do any of this fund for a planning stuff you could just have the parallel execution environment but the number of workers to use could be declared but I still hope to get back to that at some point but I basically got hung up on the fact that I spent a long time writing uh an allocator that could be used to allow the as infrastructure of the parallel sort and nobody like that and and so I decided to opt for the approach of doing something else and coming back to the problem at a later time yes mean in the end I'm pretty happy with this because I wasn't very happy with how useful this was going to be when it was just a parallel sex can node but now that we've got the partial such scan nodes separated from the final I actually think there's a lot more potential for this to be used for interesting things I still do think things like parallel sort and parallel vacuum are going to be things that people really really want but because even if your workload is mostly 0 LTP you will sometimes have these book operations where you need to do it on a work all once you need a vacuum and enormous you know you normally read 1 role from time from an enormous table but every once in a while you build an index on and vacuum introduce something else and so for those things those kinds of things are going to be really useful and I just had a minute I just sort of made a tactical decision that this felt like a more of a reachable 1st goal and search ended up that may have been a bad decision it's possible I would be further along if I'd done something else but I you you can only Connolly live your life for itself up here that yeah I'm natural sticks out but you know is this is sort of a strategic decision the again made yeah OK there were not about time
of feel free to drop me for questions afterward based upon