We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Optimising Spatial Data Analysis With PostgreSQL And PostGIS

00:00

Formal Metadata

Title
Optimising Spatial Data Analysis With PostgreSQL And PostGIS
Title of Series
Number of Parts
95
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Production PlaceNottingham

Content Metadata

Subject Area
Genre
Abstract
In this talk we will demonstrate spatial data analysis on the relational database system PostgreSQL (http://www.postgresql.org) equipped with the spatial extension PostGIS (http://www.postgis.org). We will gradually introduce some of the optimisation techniques provided by PostgreSQL, by applying them to the solution of increasingly complex problems belonging to the PostGIS domain. Our aim is to point out as clearly as possible the main ideas behind each example, showing the link in both projects between development of new features and the need to tackle real-world problems. Topics mentioned in this talk include: the special index types GiST and SP-GiST; custom database objects, such as data types, functions and operators; query and workload profiling.
Data analysisStokes' theoremCausalityComputer animation
Database transactionConcurrency (computer science)Set theoryData integrityLogicData storage deviceData modelDatabaseNim-SpielSession Initiation ProtocolFreewareSpeech synthesisFinite element methodVarianceView (database)Operator (mathematics)Replication (computing)Subject indexingField extensionStreaming mediaPerturbation theoryType theoryRange (statistics)Thermal expansionTable (information)Relational databaseConstraint (mathematics)PredictabilityMetropolitan area networkOvalData structureMiniDiscFocus (optics)Fourier seriesBeer steinGeometryStandard deviationData analysisCore dumpBASIC-ASet (mathematics)Flow separationTheoryRelational databaseDatabaseData storage deviceSpeech synthesisPartition (number theory)Different (Kate Ryan album)Ocean currentElectronic mailing listSpacetimeNeuroinformatikExtension (kinesiology)Subject indexingInformationComputer filePhysical systemGeometryEndliche ModelltheorieMappingOperating systemPresentation of a groupPoint (geometry)Set (mathematics)Type theoryOperator (mathematics)Data typePlanningObject (grammar)Client (computing)QuicksortAreaOpen setComputer hardwareMultiplication signConcurrency (computer science)Web-DesignerLogicArithmetic meanTable (information)Query languageCalculationMiniDiscImplementationServer (computing)HypermediaSystem callState of matterHand fanMereologyDemosceneLocal ringNatural numberNumberFunction (mathematics)Adaptive behaviorData structureVotingCodeWhiteboardRenewal theoryWebsiteGame theoryDivisorWater vaporProcess (computing)FrequencyPopulation densityBasis <Mathematik>View (database)Customer relationship managementMathematicsSeries (mathematics)Level (video gaming)DistanceComputer animation
GeometryStandard deviationNim-SpielType theorySubject indexingCore dumpField extensionBASIC-ASet (mathematics)Flow separationPairwise comparisonPoint (geometry)Generic programmingPrice indexAlgorithmSpacetimeObject (grammar)CubeCirclePolygonSubsetAreaScalar fieldOperator (mathematics)Boolean algebraMathematical optimizationUniform resource nameNumberShape (magazine)Loop (music)Run time (program lifecycle phase)Lattice (order)Digital filterThermal expansionPort scannerProgrammschleifeNumberMultiplication signRow (database)NeuroinformatikSubject indexingSymbol tableMereologyReverse engineeringCuboidChemical equationType theoryPoint (geometry)Table (information)Set (mathematics)Query languageFunction (mathematics)2 (number)Network topologyResultantCASE <Informatik>Order (biology)QuicksortShape (magazine)Different (Kate Ryan album)DatabaseWritingVideo gameKeyboard shortcutPhysical systemCircleOnline helpIdeal (ethics)State of matterCausalityLimit (category theory)DemosceneSubsetLiquidDimensional analysisStaff (military)Computer animation
Thermal expansionLoop (music)Run time (program lifecycle phase)Shape (magazine)Memory managementPort scannerSubject indexingPoint (geometry)GeometryDigital filterLattice (order)International Date LineProgrammschleifeNumberAreaCuboidPoint (geometry)Online helpReading (process)MereologyMultiplication signSubject indexingWater vaporShape (magazine)Line (geometry)Computer fileSoftware testingSequenceAutomatic differentiationProgrammer (hardware)Basis <Mathematik>Volume (thermodynamics)ResultantCoefficient of determinationMusical ensemblePatch (Unix)Radical (chemistry)Different (Kate Ryan album)NeuroinformatikRectangleQuery languageRow (database)Variable (mathematics)Function (mathematics)Structural loadPort scanner1 (number)Table (information)Database2 (number)Operator (mathematics)AreaComputer animation
Nim-SpielPoint (geometry)Shape (magazine)NumberCuboidAreaLoop (music)Run time (program lifecycle phase)Thermal expansionMemory managementPort scannerInternational Date LineSubject indexingGeometryDigital filterProgrammschleifeData analysisElectronic program guideQuantum3 (number)Revision controlSubject indexingPoint (geometry)Shape (magazine)CASE <Informatik>Software maintenanceBeat (acoustics)Window1 (number)Type theory2 (number)Multiplication signDecision theoryMathematicsSlide ruleCodeFood energyProcedural programmingIdentity managementPartition (number theory)Schweizerische Physikalische GesellschaftQuicksortCircleRow (database)CuboidAreaLine (geometry)SpacetimeRectangleEquivalence relationComputer animation
Transcript: English(auto-generated)
Thank you for coming all the way from the main conference center to hear this talk. And this is actually a forehand talk with Simon, my colleague, who's here, probably, is traveling. But anyway, I knew I had to speak myself.
I'm talking about PostgreSQL and PostGIS, which I assume you already know broadly. But I'll do a brief introduction anyway. OK. The first question I was to ask you, actually, I want to answer the question, why should I use the database first? Because not everybody uses the database.
And the reason is quite simple. You don't want to reinvent the wheel. It's been 40 years that people ask themselves, how do I store my information on the computer? And then they come up with a list of clever ideas. And the database happens to be very similar to the way we think about problems. So we usually think about transactions.
So we have to do like 10 changes and want all of them to happen. Or if one fails, we want all of them to not happen. We don't want to leave things halfway through. We want things to happen concurrently. In modern times, we've got more than one computer. Tom Watson, the founder of IBM, he estimated the market of 12 computers around the world
when it started. But things have evolved quite differently. So in any system, you have one server and lots of clients that do work at the same time. So you want a system that's able to do that without making a mess, which is called concurrency. And the database is modeled on set theory.
That's where the relational world comes, which is very natural, apart from the fact that it's taught at primary school. But it's the way we think about things. We have a set of things, and then we take one of them, or we take two of them, or we take all the things here that match the things there. So it's not an exotic way of thinking.
It's just how we will think even if we didn't have databases. And the founder of relational databases theory is Edgar Codd, who said that these are the three main features why you would use a database rather than store your information on text files or find other ways of, cumbersome ways of recording them.
You want data integrity, so you want your data to be without inconsistencies. So you want precision in sense. You want to have an exact answer when you ask your database whether you've got one information or not. And you want to have a logical model. So you don't want to think about files, bytes,
and for byte here, two bytes there. You want to think about logical objects like numbers, strings, shapes, points, features. So you think about the logical model. When your system evolves and it provides the same logical model with the more performant storage,
you don't have to change anything because you're not tied to a particular implementation. Now, why PostgreSQL then? Because it's the relational database. It's already a good reason by itself, but it is free both as a beer and as speech. I'm not sure that free beer goes well with free speech,
but surely it goes well with speak freely. And also it's required by PostG. So I'm sure you all want PostG and if you want PostG, you have to have Postgres anyway, because it's like, I want a car and I need the wheels. It's robust and reliable. It's been developed for 27 years now. So it's probably older than somebody in this room.
And it's got a big community, not me unfortunately. It's got a big community. There are lots of Postgres only conferences. So not conferences where they talk about everything and then there's somebody that talks about Postgres, but Postgres thematic conferences this year, I can count these. There are more probably.
I didn't do sort of an inventory. So you won't be left alone if you use Postgres. Now, it's also advanced technology when you use Postgres because there are lots of new releases that are not like the same old stuff, Cook, again, rebranded with a different name. It's just new releases that happen regularly
with new features that weren't available before. And that's just the selected list of new features with releases and dates. So for instance, the latest release is like 9th of September. So it's probably 11 days or 10. What's the date anyway, 10 days. And they've got all these things here
that many of them are standard SQL. Many of them, some of them are Postgres only. Some of them are not Postgres only, but very clever. Some of them are very welcomed by web developers and so on. But this is just a quick overview. It doesn't mean that what we did before September 2010
is rubbish. It's just that this is not a list of feature. It's a way of telling you Postgres is evolving and every year is better. And it's regular, even if it's open source, but it's like clockwork. See, every year you get a new release. So it's not like...
You already know that if you have a new feature that's not here, but it's been developed, if it's ready by December, we know it will be probably September, 2014, because we already know that this is sort of yearly plan laid out. Now, to optimize queries, we use indexes. I'll be moving slightly quickly because I have 20 minutes.
An index is a way to optimize a database. So you do the same thing equivalently, but faster. And it's a persistent structure, which is on disk. So you have your data and then you add an index.
And this provides an optimal access to a particular table for a particular kind of query. So it's not like a generic thing. I create an index and everything is faster. No, I created an index on purpose for a particular way of reading the data and the index helps there. It depends on what data type you're using and what operators are using.
So you can optimize computing the area of the surface, seeing whether a point is included in a particular region or these kinds of things. Not all indexes are covered by this short presentation. We talk about spatial indexes. So indexes that are related to things that go with maps and mapping and all these things.
Currently in Postgres, there are two different kinds of spatial indexes. Gist, which probably is not unknown in this room and SPGist, which is similar to Gist, but it's space partitioned. This is very recent, even the theory is recent
and it's got advantages and disadvantages that we'll discuss. So if you do spatial things with Postgres, you'll be using lots of Gist and some SPGist as well. In Postgres, you've got spatial data types. You've got some legacy spatial data types
that were developed in the 90s. And then you've got POSGist, which is the standard one, the official one. So the difference is that this supports geometry and geography. So you have the idea of globe spatial reference systems and all the things that you know probably better than me.
This supports less index types. SPGist has been developed. I couldn't get in the room at two o'clock. So I don't know if that was mentioned, but it was included. They thought to include these in 2.1 and then they had to postpone it,
but it's been developed. So we can use the basic POSGist data types for simple calculations or use POSGist data types to have professional spatial data. It's an extension. So it can be installed very quickly in two minutes, really. I'm not joking.
It's really two minutes unless you have an exotic operating system with a complicated hardware. And then, oops. So we will use both of them because we want to demonstrate how indexes work. So the goal of this talk is not just to talk about things, but it's to show you explicit examples where indexes help in optimizing these queries.
What is spatial data? You've got two dimensional points, segments, parts, cubes, polygons, all the shapes. But the idea is that you can represent even three dimensional shapes.
I think there was a talk this morning which I couldn't make unfortunately, explaining what 3D features are available in POSG. We won't cover 3D in this talk. We were just covering 2D. And these two index types are overlapping. You can use both of them in some cases.
They have advantages and disadvantages. Gist is more flexible, but it has to be balanced, which means it's more expensive to maintain. So when you write things, the database takes extra work in order to keep the tree balanced. SPGist is not balanced, so it's faster,
but it doesn't support circles or reverse or complicated things. It just supports points and boxes. So if you happen to have simple computations like what points are in this rectangle, then you can use these. But if you have to intersect a council
with another district, you probably need to use Gist. And a database is a system where you type your queries and then they're executed. So you need to be able to specify things in using your keyboard.
So you can't use fancy graphic characters. You need to write everything in ASCII. So there are some symbols that have been put together by the POSGist and POSGist users to represent these notions. So for instance, if you want to say that something is at the left of something else,
you use double less than. If you want to say that something intersects something else, you use the double ampersand and so on. Every symbol has a geometric significance that you can use when you type your query. We'll see actually the query in a few seconds. So you can see there's all the things you need really,
because you can see whether two things are equivalent, equal actually. So if they have the same shape, even if it's written slightly different. So if you start from here or if you start from here, it's the same shape, but it's written differently. Or you can say this thing is a subset.
See this sort of ice cream is actually the best possible attempt to emulate the mathematical symbol of one set is inside another one. They were quite fun. Was creative, I wouldn't have done better.
Okay, so let's go into examples, which is clear. So if you have a large number of 2D shapes, and at one point we want to find all the shapes that contain that particular point. So this is a concrete problem. Large number is, I generated randomly data because what happens is that real life data
is either protected by complications or too small to do meaningful examples because these databases happen to be very quick. So if you want to demonstrate these things, you need to have millions of records, otherwise it's instantaneous. So I generated randomly one million shapes and one million points for the examples.
We obviously can generate more than one million, but this is enough to see the difference significantly. So this is our problem. And this is how you solve it in SQL, in PostgreSQL with PostGIS. So this part here is called the CTE. It's just you create a sort of a table called P,
which contains the result of this query here, which is actually the point that we selected. Remember, we have one point we selected. And then we select all the shapes such that the shape intersects the point.
It's quite readable and it's also efficient. That's a good thing. So if you run a normal query, sorry for the characters, but that's the actual output you get, but I'll point you to the relevant parts. So this is what happens when you ask Postgres
to execute the query and tell you the timing and the number of rows obtained at every step of the computation. So the table of shapes is scanned sequentially. So it starts from first one, second one, third one, one million times. It returns one million rows and it takes 276 milliseconds,
which is fast, but it's not fast in our problem because if you had to do it like 1000 times, it's not fast. Then you have, yes, you have your point. And then what you do here, you have the joint filter,
which puts together the point and the shapes and removes all the points that do not touch that particular shape. Sorry, all the shapes that do not contain that particular point. So see, we start with one million shapes. We throw away 997,000 and we keep all of them. And it takes one second, roughly, to do that.
If you use an index, what happens? It's totally different. The result is the same because it's the same query. Otherwise it would be a bug. But what happens is that when reading the shapes, an index is used. So the index only produces 5,800 rows.
So it throws away all rows, about 5%. And then after having eliminated 99.5% of the candidates, the remaining candidates are examined with a recheck.
And the final result is that 2,700 like before. But the total time is now 53 milliseconds. So it's 25 times faster. It's not magic. It's just that instead of looking at one million things, we throw away almost all of them and we look only at 5,000. That's why it's faster.
Second example, we have loads of points and one single shape. So we do the opposite. We have all these points. We have a shape. What points are inside that particular shape? It looks like a high school problem, but it's most of the problems you face when you do actual geographical work of this kind.
So find all people that have to pay a particular council tax because they live in a particular area or so on. So how would that problem be solved in Postgres and PostGIS? Same thing. We select one shape here.
This is a variable. So it means I select the shape I want and then it picks up one shape from the bag of all shapes and then it selects all the points that intersect that shape. If you get used to it, it's very readable. And that's what happens with Obgist. Again, you have another sequential scan.
So you have 1 million rows that are fetched in 233 milliseconds. Then you have a operation of a filter that removes lots of them. And then at the end, you only have 90 of them. So see, it's really a waste.
You pick 1 million pieces of paper and then you read all of them and then you throw away all of them except 90. No wonder it takes one second. So it takes only 230 milliseconds to read them, but then to examine all of them,
it takes most of the time. When you have an index, what happens is that the index can throw away all the ones that are surely unneeded and then recheck only the ones that might or might not be needed. And so you can see the big difference here. You got the index only selects less than 1,000 and it does so in less than one millisecond.
So the final output after all the computation, the recheck, so you have 700 and then you find the real 90 by a second examination, but in the end, you could do six milliseconds. So you see how relevant are these techniques. And these means that you don't need to know,
you don't need to be clever in analyzing your data. It's already factored inside the database engine. So you don't have to write an algorithm. You just ask for what you want and then the database will find out an efficient way of answering your question. Now, a third example is a bit edgy.
It's the same example, but we pick on purpose in unlucky shape. What does it mean unlucky? I haven't gone into details, but as many of you will probably know, the way indexes work, they replace complicated shapes by a rectangle and they do the computations on the rectangle,
which is easier. Checking if one point is inside a complicated shape is a huge computation. Checking if it's inside the rectangle is very easy. Just go, is X between the Xs and is Y between? Yes, then it's in the rectangle. Now, if your shape is a rectangle, this is perfect.
If your shape is not a rectangle, you will find some points that are not really useful, but they are inside the rectangle, but not inside the shape. And that's why in the recheck procedure, you throw away most of them because they are the ones in the rectangle and these ones are the ones that are in the actual shape you require.
Now, the weak case of these indexes is when the shape has an area which is very small compared to the rectangle. And so we'll show now a difficult case. So this is with gist, we don't do without gist because you already know that indexes are important.
So what happens with the index? The index is used, it fetches 57 rows, but because it's thrown away, 5,300 of them. So you can say that these index will take only one row every, what was that?
2,200, one row every 200 or 1 million, but still most of them have to be eliminated afterwards. So it's sort of a weak case. And I have a picture I'd like to show you, which is already here.
Oh, yeah. The only thing, this is a bit, yeah. Not the best size of the window. Now, so this is our shape, which is thin
in the sense of, if you consider the rectangle is much bigger than the shape. When you do the rectangle stuff, you get all these points here.
See, you get lots of points you don't really have to get. But when you recheck, you select only the points that are actually needed. So this is the 90 and this is the 5,000. So this is a weak point.
And the last slide will show about what happens with the box. Because we can use SPG only with the box, it's sort of linked so far. What you can see is that this is the index that we know gist takes about eight milliseconds and it does its job.
And SPG is to give the same time. So it's equivalent in this particular case. But the thing about SPG is that it's much more efficient because it's unbalanced, so it's low maintenance. So this last slide shows you that this type of index, even if it's still experimental,
because it's not into post-gist yet, it's very powerful. Okay, so. Thank you. Yep. The SPG index, when will be in the post-GIS?
To support only the point against box or to support only line and another feature against box? The SP index is based on the idea that you divide the space in partitions and then the sub-tree is only contained in the partition.
So it cannot support things that are not points. So because it's either here or in the other one, it can't be across them. It's limited to points, but the fact that so far it's a point in a box. And I think it's not possible to generalize to a point in a circle because it has to do the recheck anyway.
Yeah. And I think it will be in the next post-GIS because it was supposed to be and it was postponed a few months ago. Any other questions? Thank you.