Postgresql: Lists and Recursion and Trees (oh my)
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 97 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/45742 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Electronic mailing listRecursionExpert systemDatabaseSelf-organizationSystem callFreewareComputer animationXMLLecture/Conference
01:54
Physical systemFunction (mathematics)Process (computing)Electronic mailing listTraffic reportingDatabaseData structureRecursionNetwork topologyBitLecture/Conference
03:51
Electric currentFunction (mathematics)Window functionLattice (order)Order (biology)Local GroupWindow functionCountingContext awarenessComputer fileQuicksortResultantNumberStability theoryLevel (video gaming)Function (mathematics)Query languageSound effectDeterminantRow (database)SubsetRankingEntire functionSummierbarkeitOrder (biology)Traffic reportingMereologyElectronic mailing listPhysical systemGroup actionMultiplication signTable (information)Lattice (order)Right angleInheritance (object-oriented programming)Computer forensicsHacker (term)Pointer (computer programming)CASE <Informatik>Standard deviationTriangleLoop (music)MappingClassical physicsAdditionElectronic visual displayQuadratic equationCartesian closed categoryPopulation densityLecture/Conference
12:38
RankingFunction (mathematics)Order (biology)Shift operatorExecution unitResultantWindow functionTraffic reportingRow (database)QuicksortOrder (biology)SummierbarkeitRankingSet (mathematics)Function (mathematics)1 (number)Partition (number theory)Frame problemContext awarenessStatisticsComplete metric spaceRevision controlDatabase normalizationGreatest elementNumberProgram slicingPopulation densityRecursionWordRight angleTable (information)Scaling (geometry)TesselationVector spaceDistribution (mathematics)Standard deviationSlide ruleGradientLecture/Conference
21:26
RecursionPoint (geometry)Local GroupOrder (biology)Maxima and minimaElectronic visual displayContext awarenessSet (mathematics)View (database)Computer configurationBlock (periodic table)Query languageReal numberDiagramQuicksortProjective planeResultantIterationNumberDimensional analysisTable (information)Point (geometry)RecursionDampingRecurrence relationQuadratic formAffine spaceOrder (biology)Hidden Markov modelElectronic visual displayCartesian coordinate systemMaxima and minimaGoodness of fitGroup actionUniqueness quantificationInitial value problemBuffer overflowRow (database)CASE <Informatik>MathematicsRadical (chemistry)Cartesian productSubject indexingSpacetimeDifferent (Kate Ryan album)Right angleCondition numberStack (abstract data type)MappingNegative numberString (computer science)Pointer (computer programming)Repeating decimalPlanningLattice (order)IntegerLecture/Conference
30:13
Table (information)IntegerKey (cryptography)Travelling salesman problemSymmetric matrixComputer programSoftware development kitMathematicsNeuroinformatikQuicksortTable (information)Multiplication signDirection (geometry)DistanceCheat <Computerspiel>Arithmetic meanExpressionData structureTravelling salesman problemNP-hardComputer programmingUniqueness quantificationMereologyRight anglePunktgruppeSemiconductor memoryRadical (chemistry)BitForcing (mathematics)View (database)Revision controlRecursionGrass (card game)Standard deviationQuantum stateResultantGraph (mathematics)Row (database)Set (mathematics)Array data structureLecture/Conference
38:05
RecursionTravelling salesman problemComputer programRadical (chemistry)Electronic visual displayDigital filterProof theoryTravelling salesman problemException handlingNumberDimensional analysisTotal S.A.Cycle (graph theory)ProgrammschleifeMereologyElement (mathematics)Gaussian eliminationDistanceEndliche ModelltheorieSpecial unitary groupForcing (mathematics)Query languageFactory (trading post)Disk read-and-write head1 (number)Point (geometry)Limit (category theory)Process (computing)ImplementationMultiplication signQuicksortOrder (biology)2 (number)Internet forumRight angleLecture/Conference
45:38
XML
Transcript: English(auto-generated)
00:11
Check one two. All righty. Shall we get started? Yes. No, maybe
00:29
Maybe okay, let's get started My name is David Federer I Work for an outfit called Postgres QL experts that concludes my marketing spiel for today
00:43
I would like to thank very much the FOSDEM organization for inviting me over here and for Very kindly paying for my air ticket. That was that was really above and beyond the call And I'm going to talk about some things which until now
01:07
Weren't really easy to do in SQL databases or at least not in the free SQL Databases that anybody had any experience with okay, so that's a lot of qualifiers is anybody using Firebird
01:23
Okay, well so there there there there is one free database that actually has this stuff, but I'm going to talk about it in Postgres so one of the things that you find yourself needing to produce is
01:44
These yes Somebody might notice that You don't normally have to produce any of these in production, but it's nice to know that you can
02:03
And yeah, it's a little bright in here, but that's a yeah Okay No, not so much I don't want to put anybody to sleep Anyhow, so the things that make these that make it possible to do both of these things with your SQL engine are
02:29
Windowing functions and Yeah, I don't see a light dimmer here
02:46
Yeah, the the the seat ejector is the one I'm scared to touch it's a I Don't know which seat it's gonna eject and whether it's gonna open the roof first or you know Yeah
03:03
No, you know if I touch one of these things I'm going to destroy it I better not Okay, so anyway use your imagination a little bit that there is a Mandelbrot set So windowing functions are things that let you deal with
03:22
Lists such as the kind you would of lists you would use when you're generating one of those TPS reports Recursion is something you would use to deal with tree like structures Has anybody ever tried to deal with the tree like structure in a database system and enjoyed it?
03:41
Okay. Well, we're we're gonna bring the joy Today is today we bring the joy to that to that process So what the what windowing functions let you do that that's the list part They let you see outside of the current row in your result set
04:03
which Lets you make those TPS reports better. So for example, you could have a running sum and A running sum is a kind of handy thing to have in a reporting system and in you know maybe a display on a on a
04:23
On the a point-of-sale system, let's say So Windowing functions operate on a window and a window is in general a Subset of a query it may not or of a result set it may not be a proper subset
04:45
So it may contain the entire result set, but it's some chunk of a result set which you've carved off It returns a value for each row in the window And then it calculates the value from the rows inside the window
05:03
Windowing functions let you use the new windowing functions, which all some of which I'll show you as we go along and Existing aggregate functions like sum and count and yes. Oh, okay
05:26
By the way interrupt anytime and and if I'm If I'm not paying attention just like scream or throw things that usually gets my
05:41
Okay You can also make your own user-defined windowing functions, although those you need to write in C right now and We don't really make any guarantees about the stability of the API So you're kind of you're kind of off the map there if you go to write your own windowing functions, but you can do it
06:04
And Of course you can use any user-defined aggregate functions, which has anybody made a user-defined Aggregate function for both. Oh great That's that's more than usually say so What is yours do doesn't remember
06:25
the So your aggregate function would ball up a set into or a bag into a set
06:41
It would unique a file list. Oh, that's neat. So anyway, you could use it in the windowing context Right away just because of how windowing functions work So basically There we go Here's how aggregates work. They sort of ball They crunch together things from a larger set down to a smaller one
07:08
With windowing functions You get sort of more of this effect you you get a kind of cross mapping thing where you Where you get an effect more like this, so
07:24
one of the first things you can do or the easiest things to demonstrate with windowing functions why they're helpful is Numbering rows and output now, I know Oracle has this thing called row number and that's kind of handy
07:40
It's I don't think it's in It's done in the standard way, but you know Oracle has has got its own standard but Has anybody tried to do this without with inside Postgres before? Get row numbers. Yeah, not not fun. Is it no
08:00
So we're gonna add some fun, but here's here's how we used to do row numbers So You have the classical employee table and You join it to the No, this is emp salary So basically what you want to do is is take the employee number department name and salary
08:22
count stars row number from Emp salary E1 join emp salary e2. So you're gonna cross or you're gonna join it to itself And then you're gonna it's sort of a cross join so the the So you get sort of the upper part of the triangle of the cross join
08:46
and then you group by this and that the other thing and So is this is this query gonna run really fast? anybody Fast yeah, it's quadratic. It's good. It's gonna get a nested loop join, right?
09:06
It's it's gonna be it's gonna be super unfun for for for performance and We got lucky right we noticed it's obviously wrong
09:22
If you're doing hacks like this In order to get windowing kind of behavior it may not be as obviously wrong as this is and When it is unobviously wrong the people who remind you that it was wrong could be
09:42
forensic accountants for example And By the time of forensic accountant is is reminding you of anything You can be in real trouble So you don't want you don't want it to you don't want to go there
10:02
This is why the SQL standard has these Windowing functions which allow you to do this Instead of join instead of instead of cross joining the table to itself You just select from the table and then you have this construct here where you have a windowing function in this case
10:23
it's called row number and Then it says and then over and here's where you you can define a window. You can also define it elsewhere We'll go over that later You say over order by salary descending nulls last
10:41
Okay, so you have an ordering in there and that's that that's how the row numbers are going to come out So if if two employees have the same salary What's going to happen here anybody
11:04
First okay. Yeah, the the answer was first one gets first, but since we haven't defined an ordering here It's unspecified and you will not be able to reproduce the result Or you won't be able to guarantee that you can reproduce the result
11:21
So when you're defining your windows You need to think carefully about the ordering that's in there. And is it sufficient to make it a deterministic? Output or You know, you could decide that you don't care about determinism at some level of the output But you have to at least sort of think that through
11:45
so that's what our that's what our query looks like and Of course our our row numbers have come out, right? In addition to row number which unconditionally increments as rows come out
12:03
You can also have rank which Which looks for ties and then and then orders by those so if if your numbers or let's say you have
12:22
A C C C F. So what you get there is a rank like one Four four four five. That's what rank would look like Corresponding to a C C C F
12:41
And dense rank you'd see one two two, two three In other words dense rank sort of collapses the ranks that you'd get questions so far comments Anybody still awake? Okay. Yes
13:06
No, you'd get one two two two three with a dense rank I Think oh, okay my mistake. Thank you
13:22
It actually piles the the ranks towards the top instead of the bottom Thanks for thanks for noticing that so here are the Built-in windowing functions that we that got added in eight dot four
13:42
Row number rank that's right there. Well, you can read all those here if you really want to I'll demonstrate some of these and and how they work So we've seen row number just bumps it unconditionally over the window This can be handy for you know numbering results
14:02
Rank we've seen So there's a gap So I guess they're yeah maybe I was mistaken about being mistaken here because Well because we have you have to hear yeah, I guess we do so we're
14:22
Yeah, there's there's there can be gaps in between them and I guess it piles it towards the top instead of the bottom Dense rank just Closes all those gaps in your ranking
14:42
Percent rank Scales the you know normalizes it not in the sense of database normalization But like in the same sense of a vector, so it it squeezes it all into the interval from zero to one That's percent rank
15:01
Which is the regular rank Normalized to that interval You can also look at cumulative distribution, which is any statistics geeks in here Okay, well good. Um, well then you know more about this than I do. I hope it's useful
15:25
Entile so, you know quintiles or quartiles or Or three aisles I forget That sort of divides things into buckets and says which bucket it's in
15:42
Let's see lag this one's kind of handy It basically returns the value of the row above or you can Parametrize it to say lag by how much or how many rows? So the first one doesn't really have a row above so the its lag is
16:06
No Similarly lead takes the row below and of course the last one in the window Doesn't really have a row below so we have to call that null to
16:25
First value takes the yes First value takes the first one in the frame or in the yeah in the frame Similarly last value You have to add some extra
16:42
Frippery here, which will be on the slides, which I will Publish as soon as we're done here But basically it takes the last value in the frame nth value I guess the
17:01
This was the SQL standards committee's version of humor Because I have not seen anywhere yet where You'd actually need the nth value. Has anybody else seen one? No
17:20
Okay, well for completeness I presented here, but it just Yeah, okay, so So windowing can affect aggregates if you make if you want it to One way you can so here's here's where it's not affecting our aggregates we're selecting value and the sum
17:46
Over the empty window, which basically means we're not specifying anything about order or partitions or anything like that We're just saying some and so it just acts like the the sum as it usually did
18:01
Except that it's Except that instead of bunching it all together. It's repeating the sum at each row There is that fairly? Okay But the sort of more interesting thing that you can do with
18:21
With this running with this windowing function over aggregates is that you can change aggregate behavior so What we've said is value sum of value over Order by value descending from table, so What's happening at these first two rows? Is that since you have not distinguished one from the other?
18:46
They can't really be separated one from the other So in the first two rows your sum is actually the sum of the first two rows because they tie that fairly Okay, so again you again it emphasizes that you have to be
19:05
You have to at least think about what the ordering is going to do to your result set And how how much you specify? So if I'd said order by value and maybe something else which distinguished the rows one from the other Then you would see five ten
19:24
Etc. Okay, so yes The other ones which would only work on a partition The question was the other ones which only work on a partition lag and lead
19:43
Why is that well You can partition that you can partition your result set into different slices and when you say You don't actually need
20:00
You Don't actually need to say you don't need to have a partition by clause in your windowing function But if you do have it lag and lead will only refer to the stuff inside the partition and not the whole window Right. So if you've said partition by ID or partition by last name
20:21
It's going to to do the lag and lead only in the context of the partition that you have sliced off Does that answer your question? Right, so if so if there's no partition is there's no partition by specified then the frame is the window and
20:42
You're done Thanks Okay, so that's that's enough about the TPS reports. I don't want to put you all to sleep too too much Well You know, it was a large night for some of you last night I imagine
21:00
So I'd like to talk about something else. That's that's really new and different in SQL and that's recursion Everybody read that Is that legible back there?
21:20
Okay, so we have a Google result set and it I'm looking for recursion and it says did you mean recursion? Yeah, okay So recursion is fun and and and you know, it can actually come up in real context I'm going to show you how we generated that little
21:44
Diagram earlier So we introduced a few little chunks of syntax and Here's the first one so we say with recursive X of I okay. So what we're doing here is we're saying
22:05
Here we are going to make ourselves a one or more Temporary views or subroutines whichever way you want to look at it Which are good for the context of this query and then we'll vanish
22:22
So we say with and then recursive is an optional keyword and that talks about whether the views or or Subroutines can refer to themselves And then we define the name of the the view and the names of its columns
22:42
Optionally, so you don't have to name the columns you do have to name the view So You've named the view and then we're going to see what we're going to Define what happens inside the view. So the first thing we do is
23:03
Get us a zero value up here. Sorry. Has anybody got a pointer? So that's the first step in the recursion is that when you recurs you have an initial condition And then you have some sort of a recursion recurrence relation
23:25
And then if you're smart, you'll also have a termination condition Just in case Yeah, so you say in the recursive view you say initial condition union or union all
23:42
What would be the difference between union and union all generally? Anybody? Yeah, the answer was unique strips strips out duplicates and that's that's what's happening so union all would just pile everything together
24:10
ignoring duplicates Union would strip them out and will cause a performance You know change because it has to then do that stripping
24:21
So union all is usually Faster and if you know as in this case We do that the results are already unique and that's how we wanted them. Anyway, we can just say you know so you start with this
24:40
temporary view which just has One row in it and that row is a zero And then we say select i plus one from X that's our It's recurrence relation Where I less than a hundred and one
25:02
And that means if we we keep going Except that we have to check that I is less than a hundred and one What happens if I leave off this where clause? yes, it continues counting and well
25:20
It continues counting until you overflow integer, I think Yeah Yeah, so be careful when you're doing this so what I've done here is I've generated some number of points in the x-axis, I suppose How many of them?
25:50
It's it's it's zero to 101 because it can actually get to 101 before it stops Right, so it's a hundred and two points
26:01
It's really easy to get off by one, isn't it? I mean, it's not just see where you can do that It's it's all kinds of places. Okay, so we have this set of points Now we're going to Make some more points which are looking more Cartesian as we go along
26:20
So we select I X I Y X float Y float from Let's see Do a little affine transform here on both of the result sets and cross-join them with themselves
26:40
And has anybody done here here done a cross-join on purpose before Good for you How about by accident Yeah, it's really easy to do a cross-join by accident you just say from table one comma table two
27:01
And you forget to put in a where clause and then You get a Cartesian product Okay, so we've we've picked from this sort of affine transform of a hundred and or ten thousand 400 and
27:20
No, anyway, some was around ten thousand points And then we're going to do this recurrence relation. So we have our Our little union all here that signals that we're we're in a recursion And then we're going to sort of you know, do this little quadratic mapping thing and and bound it and
27:45
27 hmm wonder what that could be. Well, we'll find out I guess So we're So that's our next subroutine is we've we've taken the set of a hundred points and then we've somehow
28:02
Transformed it into this Block of 101 by 101 and then we've iterated over the block for some number of iterations so once we've done that we have this giant result set that we need to slim down because You know, we don't want to be in a hundred and twenty seven dimensions. We want to be in two
28:25
and the way we do that is we Project this into the Cartesian plane by just choosing I X I Y and the max I For so we just sort of collapse those results into the biggest one we found
28:45
From Z group by I X I Y order by I X I Y Okay, so we've now got a set of 101 by 101 points and we want to be able to Display them. Well in the grand tradition of
29:03
Asciimatic We're going to display them by indexing into a into a string, right so we're gonna You know things that are things that have one are gonna map to a space and to map to a dot
29:21
all the way out to 27 which maps out to a space or 26, I think Okay, and we say greatest of I comma one which is to say that if we happen to have Yeah, if we happen to have an I that was
29:41
Zero or negative. We just call that I a one Just in case because we already choked off I to be less than 27 Right, we just want to make sure it fits in the array bounds because otherwise we could have this we could have some sort of a stack overflow kind of attack and
30:02
Who knows what could happen? Right, so we mapped it we've we've grouped it into an aggregation and then collapse the aggregates into a string and voila Okay So that was actually a fairly easy problem as math goes
30:25
How about let's try a somewhat harder problem Like maybe yes Yes
30:48
Yes, so the Right, so the question was does the
31:03
does the Row set actually materialize into memory And I believe the answer is yes Well, you know the nothing comes for free and and
31:22
It's unfortunate but there it is. Okay. So back to our slightly harder problem. We're gonna try something that's NP hard Right. So we have a traveling salesman problem This here is a reminder To make sure that before you start embarking on an NP hard problem that you actually have to solve
31:47
that problem Everybody read this But assuming that you do have to solve the problem You have to set up data structures and you have to start, you know doing your computation on those data structures
32:04
So that's what we'll do I have a very simplified schema It'll have a table it's called pairs We got a from city a to city a distance in between
32:22
We're gonna say that from city and to city defines uniqueness on this table and then we're gonna check from city less than to city Why do we do that, I'm sorry, well one answer was from the city to itself, but that's yeah, that's part of it
32:50
You just want what you Several people said you just want to go either here to there or there to here, but you don't want to have different What could be different distances?
33:02
So we're assuming that the distance between say Bari and Bologna is the same as the distance between Bologna and body and so we just limit it to only only representing one of those Okay, so we have our table
33:23
Let's insert a little data in there a bunch of I forget where I got this but Sorry, yeah, well maybe Okay, so now that we have our data structure and then it's populated we need to do some computation on it
33:47
and by the way, the computation we're doing is sort of a cheating version of brute force and I'm cheating because I want to make sure that well that that the
34:05
Program terminates in a reasonable time and I'll show you how I cheated in a little bit okay, so Having made sure that we can't actually have The paths in both directions stored in our table. We now proceed to double the size of the table
34:24
So that we get that all the paths in all directions That yeah, okay So that's what we're gonna do here is we say from city to city and distance we said with recursive And there's a union all down there, but it's not actually a recursion
34:43
We just said with recursive means that we can use recursion somewhere in the common table expressions or temporary views or subroutines That we are about to create So
35:00
We select the distances in one direction from city to city distance and then we Hello select them in exactly the opposite direction to city from city distance, so we just doubled the size of the table We want all of those We want all of those edges for our graph
35:22
Right the the one way and the other way Okay, so we've got that data set and then we Try a little Initializing a path so
35:41
We're starting out from Rome and That's going to be our our from city We say to city and that's all the possible places you can get directly to from Rome The distance in between and then we're gonna just put Rome in its own
36:02
Postgres array, which I think is fairly standard I'm not sure I haven't checked with a standard on how arrays are handled, but it's a it's a very handy thing to have around And we pick that from the the both ways Result set which we've just created
36:21
By the way any Anybody want to guess why we had to pick this as a starting city? Thank you little geek humor. Sorry about that Okay, so now that we have a
36:42
Now that we have a the the paths the the the Paths that go from Rome to suck to one other city. We need to keep going until we get back so we've we've got those going outward and then we want to
37:05
Choose more so we starting from the the endpoints of our previous Our Previous Hello I'll just drop right here. I hope everybody can read it. So we have
37:22
Rome and we just did this To each of the cities which are directly accessible by one hop from Rome So then we're going to start from each of these and go on to other cities and so forth and so on
37:44
Everybody is that fairly clear on that more explanation less Faster slower Okay So we joined back to that previous result set that we just created
38:07
and We join on head to tail sort of For the For the edges right the the tail of the next edge is the head of the last edge
38:24
Then we make sure that The from city is not in the path Except for the first element, right? So we want the Well, okay. So why would we do that?
38:43
Sorry We do want to get back but only once So we don't want to have extra loops in there our Salesman is supposed to travel in as short a Time as possible and looping infinitely
39:02
It's probably not it So so we only want the one cycle and that's why we have left off the the first element in the array Which is where we're trying to get back to I also
39:21
Remember the part where I said I'd cheated about this. Can everybody read this? Okay Says array upper of P dot path comma one. That means the first dimension in the array its Upper bound is six. So basically
39:41
We're saying we got to get back there in six hops. And if we don't get back there in six hops It's probably not our shortest one anyway Why would we say only some? Specified number of hops Anybody nope
40:04
The question the answer was total number of cities Actually, what it is is we're trying to prevent the Sun from burning out while this query runs And Well, if it's less than the total number of cities we would find that we would hope to find a solution in
40:26
Some smaller thing then start at Rome go to every other city in Italy and come back to Rome Which Is the which is what you do if you're that would be the longest path basically
40:42
You actually want the shortest path that hits the cities described which I had to describe below because of some limitations in our Implementation so we hit we said start at Rome hit these three cities and come back That's that's what the traveling salesman problem is all about
41:03
So I'm living I'm living I'm saying that I have to hit three cities and I hope that In the process of hitting those three cities. I only hit at most six Okay, so This is just to prevent the Sun from burning out and me from getting bored and you from getting bored
41:26
While we're doing this exhaustive search, so it's not actually O of n factorial which is pretty good because there was like 60 cities in there and 60 factorials kind of big and long and boring
41:41
Okay, so now that we've got these paths We'd like to Find the final one where so we've got all the paths that have Rome is a starting point and are less than
42:01
six Long And don't have Rome in them so we have to finish this off and say well of those some of the paths will be one hop away from Rome again, and Those are the ones we want So we're picking
42:26
All the paths with Milan Florence and Naples in the path, so those are the cities we wanted to visit I
42:40
Still haven't gotten to Milan and Naples. I hope to someday Anyway, so we want to order by distance because the idea of the traveling salesman problem is that you is That you want the minimal distance. So we're saying distance ascending and then Order by path also, why would we why would we do that?
43:03
One answer was get the shortest pathway It's actually we already took care of that by distance any other Guess is why we would order by path along with distance
43:21
The answer was eliminate duplicates and that's right because if you go If you go one way through a path and then go back the other way through the same path, they will have the same distance according to our model so that's that that's one of them and Another one is if if it happens that to two completely distinct paths or at least two
43:45
Partially distinct paths have the same distance. We just want the same answer coming out Each time we run the query because if we have a tie from two different paths We just want we just want to trim it down to one
44:04
Okay, so that's our That's Our traveling salesman problem solved in 12 seconds, which isn't too bad for brute force
44:36
Well, I was going to go through Some things about posting on a forum, but apparently I have taken too much time. So right now I would like to
44:49
I'll post this later and I would like to oh, yes, by the way SQL is Turing complete. So just be really careful about what you let people execute SQL on
45:05
and I would like to Open up the floor for brief questions comments and of course the straight jacket. I've just turned questions comments All righty
45:25
Thank you so much