We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Leveraging the Power of Uber H3 Indexing Library in Postgres for Geospatial Data Processing

00:00

Formal Metadata

Title
Leveraging the Power of Uber H3 Indexing Library in Postgres for Geospatial Data Processing
Title of Series
Number of Parts
266
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The Uber H3 library is a powerful geospatial indexing system that offers a versatile and efficient way to index and query geospatial data. It provides a hierarchical indexing scheme that allows for fast and accurate calculations of geospatial distances, as well as easy partitioning of data into regions. In this proposal, we suggest using the Uber H3 indexing library in Postgres for geospatial data processing. Postgres is an open-source relational database management system that provides robust support for geospatial data processing through the PostGIS extension. PostGIS enables the storage, indexing, and querying of geospatial data in Postgres, and it offers a range of geospatial functions to manipulate and analyze geospatial data. However, the performance of PostGIS can be limited when dealing with large datasets or complex queries. This is where the Uber H3 library can be of great use. By integrating Uber H3 indexing with Postgres, we can improve the performance of PostGIS, especially for operations that involve partitioning of data and distance calculations. This presentation will demonstrate the use of Uber H3 indexing library in Postgres for geospatial data processing through a series of examples and benchmarks. It will showcase the benefits of using Uber H3 indexing for geospatial data processing in Postgres, such as improved query performance and better partitioning of data. The potential use cases and applications of this integration, such as location-based services, transportation, and urban planning will be discussed. This talk will be of interest to developers, data scientists, and geospatial analysts who work with geospatial data in Postgres. It will provide a practical guide to integrating Uber H3 indexing with Postgres, and offer insights into the performance gains and applications of this integration.
DatabaseVector spaceFunctional (mathematics)Power (physics)Extension (kinesiology)PolygonScaling (geometry)CASE <Informatik>Raster graphicsOpen sourceDifferent (Kate Ryan album)Software developerElectronic data processingLibrary (computing)Subject indexingOperator (mathematics)Point (geometry)TesselationRight angle
Subject indexingPhysical systemLibrary (computing)Visualization (computer graphics)Computer programmingSoftware developerData analysisPrice indexHierarchyData structureHexagonCellular automatonAxonometric projectionSubject indexingSphereExtension (kinesiology)MereologyPhysical systemPolygonProjective planeHexagonAreaQuery languageProcess (computing)Square numberPoint (geometry)Distortion (mathematics)CuboidShape (magazine)HierarchySurfaceKeyboard shortcutBlock (periodic table)SpacetimeDatabaseFormal languageFiber bundleGeometryAngleRectangleCurvatureStandard deviationRight angleService (economics)Computer animation
HexagonTriangleDistanceSquare numberPerspective (visual)CodeTriangleSubject indexingWage labourFunctional (mathematics)PolygonHexagonNumberQuery languageRectangleSquare numberPoint (geometry)Set (mathematics)Shape (magazine)Right angleMeeting/InterviewLecture/Conference
Sparse matrixSubject indexingImage resolutionMedical imagingElectronic data processingDiagramSubject indexingAxiom of choicePolygonSatelliteHexagonAreaQuery languageGoodness of fitCASE <Informatik>Error messagePoint (geometry)Set (mathematics)Shape (magazine)Chemical equationGraph coloringBit rateDifferent (Kate Ryan album)Image resolutionMultiplication signContext awarenessRight angleComputer animation
CodeSubject indexingPolygonQuery languageCASE <Informatik>Point (geometry)Set (mathematics)Shape (magazine)NumberMeasurementComputer fileImage resolutionRight angleMeeting/InterviewLecture/Conference
Local GroupPolygonPoint (geometry)CountingCompact spaceMathematical analysisFunction (mathematics)Cellular automatonSubject indexingQuery languagePartition (number theory)AlgorithmMathematical analysisMedical imagingCodeDatabaseNoise (electronics)Network topologyString (computer science)Level (video gaming)IntegerSubject indexingUniformer RaumComputer configurationMathematical optimizationFunctional (mathematics)Sheaf (mathematics)Price indexInjektivitätMereologyMultiplicationPolygonSlide ruleHexagonTable (information)Virtual machineVisualization (computer graphics)NumberQuicksortQuery languageSimilarity (geometry)PlanningRange (statistics)CASE <Informatik>Normal (geometry)CurvatureSystem administratorNeighbourhood (graph theory)Point (geometry)Partition (number theory)Interior (topology)Cartesian coordinate systemShape (magazine)Direction (geometry)File formatParity (mathematics)Source codeView (database)Different (Kate Ryan album)Representation (politics)Image resolutionMultiplication signOverlay-Netz2 (number)MappingRight angleDemo (music)Quantum stateOptical disc driveComputer animation
CodeXMLUMLComputer animation
Transcript: English(auto-generated)
Leveraging the power of OBER-HT indexing, leveraging Postgres for geospatial data processing, is this, sorry, I'm just still, yeah. So, going to talk about OBER-HT indexing, but before that, let's talk about Postgres.
Like, we all know it's an open source database with 35 years of active development. It can handle data at scale for all kinds of data. And then you have PostGIS as an extension with Postgres, which can also handle all of your GIS database,
right, the points, polygons. And then you have also a bunch of spatial functions inside, built into Postgres, functions like doing SQL joins between points, polygons, shapes, a lot of other functionalities like extracting our vector tiles, raster tiles, from creating a bunch of different kind of use cases
and all, lots, lots, lots of use cases. There you can do almost anything inside PostGIS, right? So what's OBER-H3? As you already know, PostGIS can handle all your geospatial data, and it also provides an indexing on top of your geospatial data, which is something called an SPGIS-T.
It definitely improves your performance when you're doing a lot of spatial queries like point and polygon, comparing polygons, right? And accessing those polygons as well with HT intersects and all those kind of things, right? So OBER-H3 is complementary to your PostGIS stuff, right? It's something that is used to also index
your geospatial data, but it uses something of a hierarchical grid system using hexagons to index your data, right? It's performance, or the shining part of OBER-H3 is actually using when you're aggregating a lot of data or would also see, like it is able to change
the way you process or look at geospatial data from a geospatial aspect to something of more PostGIS stuff, like you have an index, like B-tree indexes in your PostGIS database, which is much more efficient,
so we'll also look how we can achieve those kind of efficiencies with H3, right? And again, OBER-H3 was developed by OBER and it has bindings in Python and other languages, so this is one of the bindings that it has in PostGIS, in PostGIS as well, and also PostGIS is a requirement
before you do the OBER, install the OBER-H3 extension inside it, right? So again, it's complementary to what PostGIS already gives. So when talking about H3 indexing, this is how you will look at the world from the H3 index angle, right?
You have a lot of, lot of hexagons. Now, why hexagons? Well, they obviously look pretty. It's a lot pretty to look at, right? But that's not the only reason we'll look why. In your standard PostGIS stuff, you take a shape and you build a lot of square boxes or blocks around it to index the shape itself.
In H3, you use hexagons. Why hexagons? Again, we'll have a look at that. Okay, before that, when we look at the Earth, right, it's a sphere, but when processing a lot of data, we're talking about 2D geometries, we're talking about 2D processing of the data.
In your normal setup, what you use is, you know different projections like 4326 and 3857, which take the sphere and put it on a flat surface, right? Because of those projections, you will also have a lot of distortions coming into your shapes, polygons, points, whatever, right?
Mostly polygons, not points per se, right? And so you see those projection 4326 and 3857, they are rectangles, but you see the shapes on the North and the South, they're widely different from what they actually are when you look at them on the sphere. So what Uber H3 uses
is something called a Dimexian projection, right? To create the 2D surface before it actually indexes the area of the Earth, right? Why this particular choice? Because if you look at all the points in your shape, all of them are in the ocean, they're not on the Earth.
And when you open it up, this has relatively less space distortion compared to your 4326 or 3857, right? So it helps them to move to make less spatially distorted indexing stuff on with Uber H3, right? So again, this is something
that is more on the academic side of things, so you might not even need to understand. This is more like a trivial fact or a fact that is just interesting to know, right? Finally, we're at the question of why hexagons and why not triangles, rectangles,
or any other shape in the world, right? So I mean, this is the reason why when you have a triangle and you're moving from one point, one neighbor to the next, to the center of the next polygon, you will have eight triangles. You cannot read this, but this, sorry, this is 12 triangles, right? So you'll have 12 different neighbors for each set.
And then for square, you will have eight neighbors, and for hexagons, you have six. So one, it has relatively less neighbors. And the other amazing thing is that for all the neighbors, you move between edges, right? So this helps you simplify a lot of your moving
from one neighbor to the next neighbor function. You only deal with edges and not with vertices from like the coding perspective. So that is why it shows H3, right? And yeah, that's one of the reasons. And also for aggregation, like if you're doing a lot of neighbor-related queries and neighbor-related aggregations,
then H3 is more helpful with that because you know there are less number of neighbors. And moving from one neighbor to the next is relatively cheaper compared to triangles and your squares, right? Right, so taking the indexing of a point is easy, right? Because you just have one point,
you put a hexagon on top of it, and that's it, you're done. But when talking about shapes, we have two different ways to index the polygon, right? So in this shape, you can't see it, but there are very, very small hexagons here, right? These are not points, this is not filled with color.
These are very, very small hexagons that you see here. And this is the same thing, but the inside hexagons are bigger, right? So the difference between this polygon and this polygon is that this takes about 10,000 hexagons to represent
the polygon, and this takes about 901. Depending on your use case, you might want to do this or this. If you want to aggregate data over a fixed set of shapes and you want to represent that polygon in a lesser, in a shorter, with a shorter value, then this would be the choice. But if you're comparing two different polygons
to different shapes, and also if you're comparing point with a polygon, which you might not know where it is, so this would make much more sense, right? We're gonna be talking about a query regarding this particular polygon, but this can be used for aggregating, or in specific to Postgres context,
this can be used to aggregate your data and lock your data to a same vicinity, right? So you'll have less access time to that data, so you can store it closer to each other, right? Okay, so when talking about H3, you also have to realize that there is something called resolution.
As we looked at the previous Earth diagram, you had bigger hexagons, you had smaller hexagons, so this is the difference between going from one hexagon to smaller and smaller hexagons, right? So this has an area of lots of zero and .895 kilometer, which is really, really small for any use case,
for any satellite images processing that you have, this can handle that. But in my particular use case that I worked on, resolution seven provided a good balance between performance and error rate, right? Because, so again, when you are looking
at this diagram and you see a lot of hexagons, when you go to the edge of this polygons, you will have some hexagons coming out of this shape, and you will have some inaccuracy when you're talking about on the edges, right? So depending on what resolution you choose, it will or might introduce some error rate
in your geospatial data processing or aggregation. This is something to take into account, but why I chose this for my particular use case, you will see it does not have any error rate for my particular use case, because seven resolution provided great performance with zero error rate for me,
but yes, it can introduce an error rate when you are indexing or when you're looking at very, very small resolutions, right? Right, so talk is cheap, right? We're gonna look at some of the things that I actually was able to implement and work with in H3 indexing.
This is the GitHub is there, all the code is there, so you can go ahead, look at it yourself, you can try and execute all of this, right? Some of the use cases that we look at is point and polygon, representing shapes with the H3 indexes is what we see, like a polygon can be represented
with like a set of H3 indexes and how those H3 indexes work, right? Then the second shape that we saw with the less number of same H3 indexes, you can represent that, and then aggregating points to a course resolution, how that thing helps, and how it can have, how we can have much faster queries in Postgres
by aggregating the data points, right? So for this particular use case, the right side is the screenshot from QGIS, right? I took OSM data of all of Asia, the PBF file, I think it was about 11 GB, maybe, like roughly around 11 GB,
and then I extracted out all the trees in Asia, and then I also got all the countries in Asia, which is about 165 countries, 165 polygons, and add about six million points, if I remember correctly, but something in the million range, right? And then I just want to look at
what are the trees in particular countries. Obviously, this is just for demo cases. There are a lot many trees, then, what are there on the OSM, but this is just to have a look, just to show what H3 works. This is not a data that you should be coding anywhere. This is just to see, right?
The query is pretty simple. You have trees, and then you have shapes of admin level two, and then what all you're doing is actually within whatever point is in shape, just count those trees, right? So, very simple query, runs really well, takes about 38 seconds to complete, right?
So, again, so I had a similar use case where this was, we were doing a point in polygon, and the points were coming, like every day we had new points, so this is the use case that I had that I wanted to solve, and then when I started looking at different indexing options for both shapes and polygon,
what I realized was the problem was not with the indexing part itself. The problem was for Postgres to handle all the geospatial functions, right? Postgres is built for indexing a lot of your normal tabular data, like string, integers, and all of that. Now, the way the same query works in H3
is something like this, okay? So you have the shape, and you have all little, little hexagons, and you have all little, little points represented as hexagons. So when you overlay on top of them, you just have to do an equal query
when you're looking at which point lies in which particular hexagon. Now, this query, or this table, this flat table, this is the flat table is what is representing this polygon as H3 indexes, is not indexed.
This is without index query. There is no, like, and H3 index, when you're talking about H3 index in Postgres, it is a base 64 int, right? So if you actually index this, this will further go down, right? So what you're looking at when you're looking at the representation of polygon
in Postgres using Uber H3 index, you will have an index, and you will have a shape ID similar to it, right? And then it'll be a flat table. And then you're not working, I mean, it is geospatial data, but it technically is your normal join
when talking about two different tables, right? So that is why we have this huge performance benefit, because in sgwithin, we use an entirely GIS algorithm, where the point, where a ray is traced out of a polygon, and if it cuts the polygon in odd numbers,
then you know the point is inside the shape. But in this particular index, all you do is compare the value of H3 index for your point and your polygon, and this is for resolution seven, okay? So this query is for resolution seven. All the code is there in the GitHub. You can go and see it, but just for the slight purpose,
I have made it slightly shorter and just to cut the number of noise that we have, right? So this takes about three seconds, right? And again, so you will see in this hexagon, it's not sure that we have one point here. There could be multiple points inside a single hexagon. That we'll also see in the coming slides.
But this makes it really fast, and even the points that have similar H3 indexes, right? And so even though Postgres will plan the query much better than compared to a H3 within function, right? If you want to look at the plan of how it, how it sees the, how it sees what kind of optimization
it needs to make to query the table, right? Talking about, so again, in this particular section, we are representing the polygon with a uniform H3 index, right? This is a resolution seven. All the shape has H3 resolution seven index,
but in this particular image, which is also taken from, okay, so again, this is for visual purposes. In your Postgres table, when you look at them, you will have a flat table, right? And the H3 index value would be a base 64 int, right?
This is just for visualization purpose. I have created it so that I can visualize it on QGIS and how it looks, right? You have all the countries. China is missing, I don't know why. When I filtered the USM data, maybe I also accidentally deleted some of the countries, right? So when you look at a particular shape, right? So this can be any shape, it doesn't, really?
Okay. Sorry. Sorry, sorry, okay. So when you look at a particular shape, you can represent a much coarser resolution, right? And when you're talking about aggregating the data in that particular shape and we're not comparing it with other shapes, this is something you can use
to put the data in a similar vicinity so you'll have much faster access time, right? Okay, so this is how the points were initially and this is how they're separated out in H3, okay? So you have one point representing one H3 hexagon. This is the same image, but zoomed out and just to see where the points are, right?
So, and yes. So when you have, so you can also create a query which is just aggregating the data at a particular level and rather than having like joining it with a polygon, you can have your own, you can create your own maps of aggregating data at a various resolution, right? So this will also provide a much faster access time to you
when you're creating maps with like aggregating maps over a particular region, right? And again, in Postgres, I'm also using a lot of materialized view to aggregate this data, right? And refresh those materialized view and this is, again, creating an H3 index for a particular point is really fast so you don't have to worry about a lot of locks
on the database itself so you can create these multiple aggregation levels in Postgres using materialized view and then query the data directly from the materialized view which is, again, much faster. Right, more things to try. These are only the two examples that I show. I thought this would be much shorter but it's only five minutes left
so I'm gonna hurry through this, right? So again, I talked about moving to the nearest neighbor. So if your data, whatever your use case involves moving around neighbors or working with a neighbor like a machine learning algorithm where you're talking with the neighbors and doing a neighbor analysis like a KNN analysis, this is really good to store that data
in this particular format, right? So you can actually give your, you can actually use H3 index data in your machine learning algorithm to make sort of a spatial analysis, right? This is another thing, edge function. So when moving from one particular hexagon to the next hexagon, you can also have directed edges
and you always know it's gonna pass from the edge. That is something that Uber used when their cabs were moving from one region to the next. They knew which hexagon they were gonna enter so the analysis in that way, whatever analysis they ran on the data, they already had an understanding where the cab is moving from which direction
to which direction. It can also be used to create, you take two H3 indices and you want to know which hexagons you wanna go through. Yes, and another thing is, again, I mentioned indexing. It is a base 64 int. H3 is represented as base 64 int so you can use your traditional Postgres indexes on your H3 indexes. And partitioning is also something I mentioned
when you're talking about data in vicinity or you can have, or when going back to the resolution, the entire Earth, like the zero resolution, represents the Earth in 122 hexagons. So if you want to partition your spatial data into different, different tables,
then this, again, can be used. And also it can be used for your smaller images like continent-wise or country-wise if you want to partition your data. This, again, can be used for that. Applications already using H3, Kepler GL, Pydex, this is just I wanted to mention. If you are familiar with these applications, they already use H3 indexes, right?
These are all the sources I used for creating all the slides and all the data that you see here. And that's it from my side.