We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Postgresql: Lists and Recursion and Trees (oh my)

00:00

Formal Metadata

Title
Postgresql: Lists and Recursion and Trees (oh my)
Title of Series
Number of Parts
97
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
PostgreSQL 8.4 has radical new capabilities inside the database: Windowing functions and Common Table Expressions. PostgreSQL 8.4 has radical new capabilities inside the database: Windowing functions and Common Table Expressions. You'll learn about each with practical examples to make your querying days more fun. Time permitting, we'll do some that are less practical.
Electronic mailing listRecursionExpert systemDatabaseSelf-organizationSystem callFreewareComputer animationXMLLecture/Conference
Physical systemFunction (mathematics)Process (computing)Electronic mailing listTraffic reportingDatabaseData structureRecursionNetwork topologyBitLecture/Conference
Electric currentFunction (mathematics)Window functionLattice (order)Order (biology)Local GroupWindow functionCountingContext awarenessComputer fileQuicksortResultantNumberStability theoryLevel (video gaming)Function (mathematics)Query languageSound effectDeterminantRow (database)SubsetRankingEntire functionSummierbarkeitOrder (biology)Traffic reportingMereologyElectronic mailing listPhysical systemGroup actionMultiplication signTable (information)Lattice (order)Right angleInheritance (object-oriented programming)Computer forensicsHacker (term)Pointer (computer programming)CASE <Informatik>Standard deviationTriangleLoop (music)MappingClassical physicsAdditionElectronic visual displayQuadratic equationCartesian closed categoryPopulation densityLecture/Conference
RankingFunction (mathematics)Order (biology)Shift operatorExecution unitResultantWindow functionTraffic reportingRow (database)QuicksortOrder (biology)SummierbarkeitRankingSet (mathematics)Function (mathematics)1 (number)Partition (number theory)Frame problemContext awarenessStatisticsComplete metric spaceRevision controlDatabase normalizationGreatest elementNumberProgram slicingPopulation densityRecursionWordRight angleTable (information)Scaling (geometry)TesselationVector spaceDistribution (mathematics)Standard deviationSlide ruleGradientLecture/Conference
RecursionPoint (geometry)Local GroupOrder (biology)Maxima and minimaElectronic visual displayContext awarenessSet (mathematics)View (database)Computer configurationBlock (periodic table)Query languageReal numberDiagramQuicksortProjective planeResultantIterationNumberDimensional analysisTable (information)Point (geometry)RecursionDampingRecurrence relationQuadratic formAffine spaceOrder (biology)Hidden Markov modelElectronic visual displayCartesian coordinate systemMaxima and minimaGoodness of fitGroup actionUniqueness quantificationInitial value problemBuffer overflowRow (database)CASE <Informatik>MathematicsRadical (chemistry)Cartesian productSubject indexingSpacetimeDifferent (Kate Ryan album)Right angleCondition numberStack (abstract data type)MappingNegative numberString (computer science)Pointer (computer programming)Repeating decimalPlanningLattice (order)IntegerLecture/Conference
Table (information)IntegerKey (cryptography)Travelling salesman problemSymmetric matrixComputer programSoftware development kitMathematicsNeuroinformatikQuicksortTable (information)Multiplication signDirection (geometry)DistanceCheat <Computerspiel>Arithmetic meanExpressionData structureTravelling salesman problemNP-hardComputer programmingUniqueness quantificationMereologyRight anglePunktgruppeSemiconductor memoryRadical (chemistry)BitForcing (mathematics)View (database)Revision controlRecursionGrass (card game)Standard deviationQuantum stateResultantGraph (mathematics)Row (database)Set (mathematics)Array data structureLecture/Conference
RecursionTravelling salesman problemComputer programRadical (chemistry)Electronic visual displayDigital filterProof theoryTravelling salesman problemException handlingNumberDimensional analysisTotal S.A.Cycle (graph theory)ProgrammschleifeMereologyElement (mathematics)Gaussian eliminationDistanceEndliche ModelltheorieSpecial unitary groupForcing (mathematics)Query languageFactory (trading post)Disk read-and-write head1 (number)Point (geometry)Limit (category theory)Process (computing)ImplementationMultiplication signQuicksortOrder (biology)2 (number)Internet forumRight angleLecture/Conference
XML
Transcript: English(auto-generated)
Check one two. All righty. Shall we get started? Yes. No, maybe
Maybe okay, let's get started My name is David Federer I Work for an outfit called Postgres QL experts that concludes my marketing spiel for today
I would like to thank very much the FOSDEM organization for inviting me over here and for Very kindly paying for my air ticket. That was that was really above and beyond the call And I'm going to talk about some things which until now
Weren't really easy to do in SQL databases or at least not in the free SQL Databases that anybody had any experience with okay, so that's a lot of qualifiers is anybody using Firebird
Okay, well so there there there there is one free database that actually has this stuff, but I'm going to talk about it in Postgres so one of the things that you find yourself needing to produce is
These yes Somebody might notice that You don't normally have to produce any of these in production, but it's nice to know that you can
And yeah, it's a little bright in here, but that's a yeah Okay No, not so much I don't want to put anybody to sleep Anyhow, so the things that make these that make it possible to do both of these things with your SQL engine are
Windowing functions and Yeah, I don't see a light dimmer here
Yeah, the the the seat ejector is the one I'm scared to touch it's a I Don't know which seat it's gonna eject and whether it's gonna open the roof first or you know Yeah
No, you know if I touch one of these things I'm going to destroy it I better not Okay, so anyway use your imagination a little bit that there is a Mandelbrot set So windowing functions are things that let you deal with
Lists such as the kind you would of lists you would use when you're generating one of those TPS reports Recursion is something you would use to deal with tree like structures Has anybody ever tried to deal with the tree like structure in a database system and enjoyed it?
Okay. Well, we're we're gonna bring the joy Today is today we bring the joy to that to that process So what the what windowing functions let you do that that's the list part They let you see outside of the current row in your result set
which Lets you make those TPS reports better. So for example, you could have a running sum and A running sum is a kind of handy thing to have in a reporting system and in you know maybe a display on a on a
On the a point-of-sale system, let's say So Windowing functions operate on a window and a window is in general a Subset of a query it may not or of a result set it may not be a proper subset
So it may contain the entire result set, but it's some chunk of a result set which you've carved off It returns a value for each row in the window And then it calculates the value from the rows inside the window
Windowing functions let you use the new windowing functions, which all some of which I'll show you as we go along and Existing aggregate functions like sum and count and yes. Oh, okay
By the way interrupt anytime and and if I'm If I'm not paying attention just like scream or throw things that usually gets my
Okay You can also make your own user-defined windowing functions, although those you need to write in C right now and We don't really make any guarantees about the stability of the API So you're kind of you're kind of off the map there if you go to write your own windowing functions, but you can do it
And Of course you can use any user-defined aggregate functions, which has anybody made a user-defined Aggregate function for both. Oh great That's that's more than usually say so What is yours do doesn't remember
the So your aggregate function would ball up a set into or a bag into a set
It would unique a file list. Oh, that's neat. So anyway, you could use it in the windowing context Right away just because of how windowing functions work So basically There we go Here's how aggregates work. They sort of ball They crunch together things from a larger set down to a smaller one
With windowing functions You get sort of more of this effect you you get a kind of cross mapping thing where you Where you get an effect more like this, so
one of the first things you can do or the easiest things to demonstrate with windowing functions why they're helpful is Numbering rows and output now, I know Oracle has this thing called row number and that's kind of handy
It's I don't think it's in It's done in the standard way, but you know Oracle has has got its own standard but Has anybody tried to do this without with inside Postgres before? Get row numbers. Yeah, not not fun. Is it no
So we're gonna add some fun, but here's here's how we used to do row numbers So You have the classical employee table and You join it to the No, this is emp salary So basically what you want to do is is take the employee number department name and salary
count stars row number from Emp salary E1 join emp salary e2. So you're gonna cross or you're gonna join it to itself And then you're gonna it's sort of a cross join so the the So you get sort of the upper part of the triangle of the cross join
and then you group by this and that the other thing and So is this is this query gonna run really fast? anybody Fast yeah, it's quadratic. It's good. It's gonna get a nested loop join, right?
It's it's gonna be it's gonna be super unfun for for for performance and We got lucky right we noticed it's obviously wrong
If you're doing hacks like this In order to get windowing kind of behavior it may not be as obviously wrong as this is and When it is unobviously wrong the people who remind you that it was wrong could be
forensic accountants for example And By the time of forensic accountant is is reminding you of anything You can be in real trouble So you don't want you don't want it to you don't want to go there
This is why the SQL standard has these Windowing functions which allow you to do this Instead of join instead of instead of cross joining the table to itself You just select from the table and then you have this construct here where you have a windowing function in this case
it's called row number and Then it says and then over and here's where you you can define a window. You can also define it elsewhere We'll go over that later You say over order by salary descending nulls last
Okay, so you have an ordering in there and that's that that's how the row numbers are going to come out So if if two employees have the same salary What's going to happen here anybody
First okay. Yeah, the the answer was first one gets first, but since we haven't defined an ordering here It's unspecified and you will not be able to reproduce the result Or you won't be able to guarantee that you can reproduce the result
So when you're defining your windows You need to think carefully about the ordering that's in there. And is it sufficient to make it a deterministic? Output or You know, you could decide that you don't care about determinism at some level of the output But you have to at least sort of think that through
so that's what our that's what our query looks like and Of course our our row numbers have come out, right? In addition to row number which unconditionally increments as rows come out
You can also have rank which Which looks for ties and then and then orders by those so if if your numbers or let's say you have
A C C C F. So what you get there is a rank like one Four four four five. That's what rank would look like Corresponding to a C C C F
And dense rank you'd see one two two, two three In other words dense rank sort of collapses the ranks that you'd get questions so far comments Anybody still awake? Okay. Yes
No, you'd get one two two two three with a dense rank I Think oh, okay my mistake. Thank you
It actually piles the the ranks towards the top instead of the bottom Thanks for thanks for noticing that so here are the Built-in windowing functions that we that got added in eight dot four
Row number rank that's right there. Well, you can read all those here if you really want to I'll demonstrate some of these and and how they work So we've seen row number just bumps it unconditionally over the window This can be handy for you know numbering results
Rank we've seen So there's a gap So I guess they're yeah maybe I was mistaken about being mistaken here because Well because we have you have to hear yeah, I guess we do so we're
Yeah, there's there's there can be gaps in between them and I guess it piles it towards the top instead of the bottom Dense rank just Closes all those gaps in your ranking
Percent rank Scales the you know normalizes it not in the sense of database normalization But like in the same sense of a vector, so it it squeezes it all into the interval from zero to one That's percent rank
Which is the regular rank Normalized to that interval You can also look at cumulative distribution, which is any statistics geeks in here Okay, well good. Um, well then you know more about this than I do. I hope it's useful
Entile so, you know quintiles or quartiles or Or three aisles I forget That sort of divides things into buckets and says which bucket it's in
Let's see lag this one's kind of handy It basically returns the value of the row above or you can Parametrize it to say lag by how much or how many rows? So the first one doesn't really have a row above so the its lag is
No Similarly lead takes the row below and of course the last one in the window Doesn't really have a row below so we have to call that null to
First value takes the yes First value takes the first one in the frame or in the yeah in the frame Similarly last value You have to add some extra
Frippery here, which will be on the slides, which I will Publish as soon as we're done here But basically it takes the last value in the frame nth value I guess the
This was the SQL standards committee's version of humor Because I have not seen anywhere yet where You'd actually need the nth value. Has anybody else seen one? No
Okay, well for completeness I presented here, but it just Yeah, okay, so So windowing can affect aggregates if you make if you want it to One way you can so here's here's where it's not affecting our aggregates we're selecting value and the sum
Over the empty window, which basically means we're not specifying anything about order or partitions or anything like that We're just saying some and so it just acts like the the sum as it usually did
Except that it's Except that instead of bunching it all together. It's repeating the sum at each row There is that fairly? Okay But the sort of more interesting thing that you can do with
With this running with this windowing function over aggregates is that you can change aggregate behavior so What we've said is value sum of value over Order by value descending from table, so What's happening at these first two rows? Is that since you have not distinguished one from the other?
They can't really be separated one from the other So in the first two rows your sum is actually the sum of the first two rows because they tie that fairly Okay, so again you again it emphasizes that you have to be
You have to at least think about what the ordering is going to do to your result set And how how much you specify? So if I'd said order by value and maybe something else which distinguished the rows one from the other Then you would see five ten
Etc. Okay, so yes The other ones which would only work on a partition The question was the other ones which only work on a partition lag and lead
Why is that well You can partition that you can partition your result set into different slices and when you say You don't actually need
You Don't actually need to say you don't need to have a partition by clause in your windowing function But if you do have it lag and lead will only refer to the stuff inside the partition and not the whole window Right. So if you've said partition by ID or partition by last name
It's going to to do the lag and lead only in the context of the partition that you have sliced off Does that answer your question? Right, so if so if there's no partition is there's no partition by specified then the frame is the window and
You're done Thanks Okay, so that's that's enough about the TPS reports. I don't want to put you all to sleep too too much Well You know, it was a large night for some of you last night I imagine
So I'd like to talk about something else. That's that's really new and different in SQL and that's recursion Everybody read that Is that legible back there?
Okay, so we have a Google result set and it I'm looking for recursion and it says did you mean recursion? Yeah, okay So recursion is fun and and and you know, it can actually come up in real context I'm going to show you how we generated that little
Diagram earlier So we introduced a few little chunks of syntax and Here's the first one so we say with recursive X of I okay. So what we're doing here is we're saying
Here we are going to make ourselves a one or more Temporary views or subroutines whichever way you want to look at it Which are good for the context of this query and then we'll vanish
So we say with and then recursive is an optional keyword and that talks about whether the views or or Subroutines can refer to themselves And then we define the name of the the view and the names of its columns
Optionally, so you don't have to name the columns you do have to name the view So You've named the view and then we're going to see what we're going to Define what happens inside the view. So the first thing we do is
Get us a zero value up here. Sorry. Has anybody got a pointer? So that's the first step in the recursion is that when you recurs you have an initial condition And then you have some sort of a recursion recurrence relation
And then if you're smart, you'll also have a termination condition Just in case Yeah, so you say in the recursive view you say initial condition union or union all
What would be the difference between union and union all generally? Anybody? Yeah, the answer was unique strips strips out duplicates and that's that's what's happening so union all would just pile everything together
ignoring duplicates Union would strip them out and will cause a performance You know change because it has to then do that stripping
So union all is usually Faster and if you know as in this case We do that the results are already unique and that's how we wanted them. Anyway, we can just say you know so you start with this
temporary view which just has One row in it and that row is a zero And then we say select i plus one from X that's our It's recurrence relation Where I less than a hundred and one
And that means if we we keep going Except that we have to check that I is less than a hundred and one What happens if I leave off this where clause? yes, it continues counting and well
It continues counting until you overflow integer, I think Yeah Yeah, so be careful when you're doing this so what I've done here is I've generated some number of points in the x-axis, I suppose How many of them?
It's it's it's zero to 101 because it can actually get to 101 before it stops Right, so it's a hundred and two points
It's really easy to get off by one, isn't it? I mean, it's not just see where you can do that It's it's all kinds of places. Okay, so we have this set of points Now we're going to Make some more points which are looking more Cartesian as we go along
So we select I X I Y X float Y float from Let's see Do a little affine transform here on both of the result sets and cross-join them with themselves
And has anybody done here here done a cross-join on purpose before Good for you How about by accident Yeah, it's really easy to do a cross-join by accident you just say from table one comma table two
And you forget to put in a where clause and then You get a Cartesian product Okay, so we've we've picked from this sort of affine transform of a hundred and or ten thousand 400 and
No, anyway, some was around ten thousand points And then we're going to do this recurrence relation. So we have our Our little union all here that signals that we're we're in a recursion And then we're going to sort of you know, do this little quadratic mapping thing and and bound it and
27 hmm wonder what that could be. Well, we'll find out I guess So we're So that's our next subroutine is we've we've taken the set of a hundred points and then we've somehow
Transformed it into this Block of 101 by 101 and then we've iterated over the block for some number of iterations so once we've done that we have this giant result set that we need to slim down because You know, we don't want to be in a hundred and twenty seven dimensions. We want to be in two
and the way we do that is we Project this into the Cartesian plane by just choosing I X I Y and the max I For so we just sort of collapse those results into the biggest one we found
From Z group by I X I Y order by I X I Y Okay, so we've now got a set of 101 by 101 points and we want to be able to Display them. Well in the grand tradition of
Asciimatic We're going to display them by indexing into a into a string, right so we're gonna You know things that are things that have one are gonna map to a space and to map to a dot
all the way out to 27 which maps out to a space or 26, I think Okay, and we say greatest of I comma one which is to say that if we happen to have Yeah, if we happen to have an I that was
Zero or negative. We just call that I a one Just in case because we already choked off I to be less than 27 Right, we just want to make sure it fits in the array bounds because otherwise we could have this we could have some sort of a stack overflow kind of attack and
Who knows what could happen? Right, so we mapped it we've we've grouped it into an aggregation and then collapse the aggregates into a string and voila Okay So that was actually a fairly easy problem as math goes
How about let's try a somewhat harder problem Like maybe yes Yes
Yes, so the Right, so the question was does the
does the Row set actually materialize into memory And I believe the answer is yes Well, you know the nothing comes for free and and
It's unfortunate but there it is. Okay. So back to our slightly harder problem. We're gonna try something that's NP hard Right. So we have a traveling salesman problem This here is a reminder To make sure that before you start embarking on an NP hard problem that you actually have to solve
that problem Everybody read this But assuming that you do have to solve the problem You have to set up data structures and you have to start, you know doing your computation on those data structures
So that's what we'll do I have a very simplified schema It'll have a table it's called pairs We got a from city a to city a distance in between
We're gonna say that from city and to city defines uniqueness on this table and then we're gonna check from city less than to city Why do we do that, I'm sorry, well one answer was from the city to itself, but that's yeah, that's part of it
You just want what you Several people said you just want to go either here to there or there to here, but you don't want to have different What could be different distances?
So we're assuming that the distance between say Bari and Bologna is the same as the distance between Bologna and body and so we just limit it to only only representing one of those Okay, so we have our table
Let's insert a little data in there a bunch of I forget where I got this but Sorry, yeah, well maybe Okay, so now that we have our data structure and then it's populated we need to do some computation on it
and by the way, the computation we're doing is sort of a cheating version of brute force and I'm cheating because I want to make sure that well that that the
Program terminates in a reasonable time and I'll show you how I cheated in a little bit okay, so Having made sure that we can't actually have The paths in both directions stored in our table. We now proceed to double the size of the table
So that we get that all the paths in all directions That yeah, okay So that's what we're gonna do here is we say from city to city and distance we said with recursive And there's a union all down there, but it's not actually a recursion
We just said with recursive means that we can use recursion somewhere in the common table expressions or temporary views or subroutines That we are about to create So
We select the distances in one direction from city to city distance and then we Hello select them in exactly the opposite direction to city from city distance, so we just doubled the size of the table We want all of those We want all of those edges for our graph
Right the the one way and the other way Okay, so we've got that data set and then we Try a little Initializing a path so
We're starting out from Rome and That's going to be our our from city We say to city and that's all the possible places you can get directly to from Rome The distance in between and then we're gonna just put Rome in its own
Postgres array, which I think is fairly standard I'm not sure I haven't checked with a standard on how arrays are handled, but it's a it's a very handy thing to have around And we pick that from the the both ways Result set which we've just created
By the way any Anybody want to guess why we had to pick this as a starting city? Thank you little geek humor. Sorry about that Okay, so now that we have a
Now that we have a the the paths the the the Paths that go from Rome to suck to one other city. We need to keep going until we get back so we've we've got those going outward and then we want to
Choose more so we starting from the the endpoints of our previous Our Previous Hello I'll just drop right here. I hope everybody can read it. So we have
Rome and we just did this To each of the cities which are directly accessible by one hop from Rome So then we're going to start from each of these and go on to other cities and so forth and so on
Everybody is that fairly clear on that more explanation less Faster slower Okay So we joined back to that previous result set that we just created
and We join on head to tail sort of For the For the edges right the the tail of the next edge is the head of the last edge
Then we make sure that The from city is not in the path Except for the first element, right? So we want the Well, okay. So why would we do that?
Sorry We do want to get back but only once So we don't want to have extra loops in there our Salesman is supposed to travel in as short a Time as possible and looping infinitely
It's probably not it So so we only want the one cycle and that's why we have left off the the first element in the array Which is where we're trying to get back to I also
Remember the part where I said I'd cheated about this. Can everybody read this? Okay Says array upper of P dot path comma one. That means the first dimension in the array its Upper bound is six. So basically
We're saying we got to get back there in six hops. And if we don't get back there in six hops It's probably not our shortest one anyway Why would we say only some? Specified number of hops Anybody nope
The question the answer was total number of cities Actually, what it is is we're trying to prevent the Sun from burning out while this query runs And Well, if it's less than the total number of cities we would find that we would hope to find a solution in
Some smaller thing then start at Rome go to every other city in Italy and come back to Rome Which Is the which is what you do if you're that would be the longest path basically
You actually want the shortest path that hits the cities described which I had to describe below because of some limitations in our Implementation so we hit we said start at Rome hit these three cities and come back That's that's what the traveling salesman problem is all about
So I'm living I'm living I'm saying that I have to hit three cities and I hope that In the process of hitting those three cities. I only hit at most six Okay, so This is just to prevent the Sun from burning out and me from getting bored and you from getting bored
While we're doing this exhaustive search, so it's not actually O of n factorial which is pretty good because there was like 60 cities in there and 60 factorials kind of big and long and boring
Okay, so now that we've got these paths We'd like to Find the final one where so we've got all the paths that have Rome is a starting point and are less than
six Long And don't have Rome in them so we have to finish this off and say well of those some of the paths will be one hop away from Rome again, and Those are the ones we want So we're picking
All the paths with Milan Florence and Naples in the path, so those are the cities we wanted to visit I
Still haven't gotten to Milan and Naples. I hope to someday Anyway, so we want to order by distance because the idea of the traveling salesman problem is that you is That you want the minimal distance. So we're saying distance ascending and then Order by path also, why would we why would we do that?
One answer was get the shortest pathway It's actually we already took care of that by distance any other Guess is why we would order by path along with distance
The answer was eliminate duplicates and that's right because if you go If you go one way through a path and then go back the other way through the same path, they will have the same distance according to our model so that's that that's one of them and Another one is if if it happens that to two completely distinct paths or at least two
Partially distinct paths have the same distance. We just want the same answer coming out Each time we run the query because if we have a tie from two different paths We just want we just want to trim it down to one
Okay, so that's our That's Our traveling salesman problem solved in 12 seconds, which isn't too bad for brute force
Well, I was going to go through Some things about posting on a forum, but apparently I have taken too much time. So right now I would like to
I'll post this later and I would like to oh, yes, by the way SQL is Turing complete. So just be really careful about what you let people execute SQL on
and I would like to Open up the floor for brief questions comments and of course the straight jacket. I've just turned questions comments All righty
Thank you so much