Building a Geocoder on top of PostGIS - a Field Report
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 351 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/69091 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2022 |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
BuildingField (computer science)ExplosionService (economics)Server (computing)Address spaceOpen sourceInstallation artLaptopLimit (category theory)Virtual machineSystem administratorPhysical systemSoftware maintenanceTable (information)DatabaseGeometryTable (information)GeometryProper mapWordNominal numberService (economics)Software developerFormal grammarBoundary value problemCuboidSoftware maintenanceReading (process)CodePreprocessorSubject indexingFormal languagePartition (number theory)Proof theoryQuery languageMereologyStapeldateiScaling (geometry)Different (Kate Ryan album)Address spaceData storage deviceNumberFront and back ends2 (number)TheoryObject (grammar)Matching (graph theory)Server (computing)LaptopWindowSystem administratorLimit (category theory)Just-in-Time-CompilerResultantInformation1 (number)Revision controlParallel portTerm (mathematics)Partial derivativePlanningString (computer science)Mathematical optimizationRaster graphicsSpacetimeGoodness of fitRule of inferenceTesselationCombinational logicTransformation (genetics)ParsingBitSoftwareKeyboard shortcutLevel (video gaming)DatabaseUniform resource locatorTranslation (relic)Data structureDebuggerElectronic mailing listView (database)ParsingArray data structureDistanceField (computer science)RankingState of matterAreaToken ringProcess (computing)Operator (mathematics)Heegaard splittingProduct (business)Moment (mathematics)Speech synthesisOpen setLipschitz-StetigkeitElasticity (physics)CodeStreaming mediaMultiplication signElectronic visual displayError messageLibrary (computing)Virtual machineCountingFile formatReverse engineeringExterior algebraType theoryWeb 2.0AdditionScripting languageSeitentabelleStatement (computer science)Expected valueFunctional (mathematics)Point (geometry)Mixed realityWebsiteHierarchySearch engine (computing)Interface (computing)Inheritance (object-oriented programming)StatisticsComputer animationUMLDiagram
Transcript: English(auto-generated)
00:02
Okay. Ah, now it's working. All right, sorry about that. So this is talk about Nominate him. If you haven't heard the name, you might have used it. So if you go on OpenStreetMap org and search for something, this is the software which is behind it.
00:21
So Nominate him does basically three things. One is the searching I just mentioned. It can also do reverse geocoding. So you get some location information about any location on Earth. And what it can also do is, if you're working for OpenStreetMap data and get your data from somewhere else, just set the ID. It can give you as well address information
00:41
about the place. If it's in the database, if it's somehow interesting for geocoding. Yeah, you might know Nominate him mostly as a service because it's on OpenStreetMap org and under Nominate him OpenStreetMap org, we also allow others to use it. So for example, QGIS is also using our service.
01:03
It's only for light users, so we have a very strict policy of one request per second. Afterwards, you're just kicked out. That's been running since 2012. We are now up to three servers and serving about 1,000 requests per second. Interestingly, mostly more reverse geocoding than search.
01:24
So all your packages, if you wanna know where they are, people just hitting the servers like mad. But Nominate him is also a software. You can install it yourself. You need a Linux. Windows doesn't work yet. And it's really, you can do it on your laptop if you just wanna do geocoding for city or small country.
01:42
You can have it on a big server if you wanna get around our limits. And if you have your own installation, it's highly customizable. So a little bit about me. I'm Sarah Hoffman. I've been an active member of the OpenStreetMap community since a long time, since 2008.
02:01
In 2012, I started taking over the administration of the service for Nominate him. And that's when I started to do more development there. So nowadays, I'm the maintainer for Nominate him. OSM to PGSQL as well, because Nominate him is using it. Pyosmium, which is a Python front end for libosmium,
02:20
which you can read OSM data on it. And since two years, I'm doing this full-time before it was just a hobby, trying to really raise money to do full-time development on Nominate him to get it forward. So let's look under the hood a little bit.
02:41
How do you do geocoding? So much of this is basically a general geocoding, how does it work thing. So what you have, if you have a geocoder, you have two parts always. One part which pre-processes the data and puts it into a format, some geocoding tables which you can use, and you have the query part. Starting with the import, Nominate him
03:01
is using Postgres for the storage. It started as a proof of concept, and it turned out that's working really well. So we've never looked back basically on this. What we're doing is taking OSM data, so Nominate him is purely OSM-based. All the data's in there, we can find everything else less so.
03:22
Import is done by OSM to PGSQL, which you might have used to create tiles, raster tiles from OpenSeedMap data. Simply, yeah, it gives updates, it gives all the parsing and everything, so we don't need to care about this, this is nice. And then the interesting part starts is from the pure OSM data, create the geocoding tables.
03:44
This used to be a long time, simply PGSQL. So really SQL transformation on triggers and everything. And last year I started to add a little bit of Python, especially if you want to do pre-processing on names, it's really nicer to do than writing SQL.
04:02
So what's a database layout look like? You start with a table of places, so that's all the places you want to find. With all the information, whatever you want to give back to your users. So we have the object types, we have the different names, the geometry, extra data, whatever was in there.
04:22
Next thing you need is a table where you can do search, and that's really the heart of it all. And what you have there in the search table is two important columns. One where you have all the names in there, so any name you can search for this place under, it's basically a big area, and you have it in there.
04:42
And you have the same for the address terms, so that's the terms you add in your query to say, oh, it's the Florence in Italy, not the Florence in, I'm sure there's one in the US, there always are. And that's the heart of it, so that's basically two arrays of names you want to find. And then you put an index over this,
05:01
and suddenly you have your inverted index, it's the same thing that Elasticsearch and so on is using, so you can, for any name you have, find the places which are in there. In addition, there are additional columns when you want to do filtering, we have this for geometry, for countries, couple of other things.
05:21
Thing is, if you have this areas over GIN index, doing this over strings is a little bit inefficient. So what's additionally done is we have a word table where any string you can search for is converted into an ID, into an integer, so that makes the whole thing, the lookup, efficient.
05:40
There's an additional advantage to this, this word table also has additional information about the terms, so already looking into this word table we know, is this a house number, is this a house code, is it just a word, is it a partial name, or something like this? And this, we will see later, makes parsing more interesting. Now, I've skipped over these address terms,
06:00
which I said, yeah, you have this additional information you can search by, and this is the hard work of the import, because if you look in the OSM data, it's not in there directly at the object. Sometimes you find some address information, but mostly it's not there. So what Nominatum does is finding the nearby OSM objects, which are interesting for search,
06:22
and putting them together. So if you know about OSM data, that's mostly place nodes and boundaries, which we're using here. And this is the part where Postgres is really, really helping because this is heavy lifting with geometric functions, finding the places and everything. In addition, we then do some magic treatment
06:43
to put these places we found in hierarchy, and also kind of determine what could the official address be, because that's what people want. They wanna know what's the address with the street, with the house number and everything. And there's really where the magic comes in. I'm not going to talk about this.
07:01
If you're interested, there was a talk at State of the Map last year about it. You have to know a lot about OSM to get this right. If you wanna know what they're doing, what Nominatum does, the interface on nominatum-obstate-map-org, you can actually have a look into the data, which is in the database.
07:21
If you click on Details, and there you see it. There is a field called Address, and you have the table. Okay, these are the OSM objects, which make up the address. And the biggest advantage of using the OSM objects themselves to do the address is that we get all the additional information. We get all the translation in every language
07:42
that is in there, so when you do your search, you can mix languages, you can use exotic languages or whatever, and it will work. Okay, so this is roughly the structure of the data and how we import it. So, let's look at the query part.
08:01
The query engine at the moment is a huge blob of PHP, and so you basically can serve it via the web. This is something I wanna change. I wanna go to Python here as well. One thing is that's the little piece that is missing, so it will work on Windows if you want to. Also, what we get is a lot of language processing
08:23
libraries, so hopefully the pre-processing when you get queries in gets better. And what I think is really interesting, because we have a lot of batch geocoding going on, you can just use it from the command line with Python then, and there's a library. So yeah, still need some funding, but I think this is going to come next.
08:43
Searching, how does it work? Let's take this example. We have the Piazza Novella and Firenze. That's the query that comes in. And of course, nominatum is playing rather dumb. It doesn't know what a Piazza is or whatever. It just takes the string and splits it at the spaces
09:02
in all combinations that there are. So you have different combinations, and that's the first step we have. And then we have our word table where we can look if the strings, which we have just split up, actually exist in the database. So that means already some of the variants just go away
09:21
because we don't have a Novella Firenze. It's not in the database, so yeah, forget about this. So the ones which work out, we kind of create a machine-readable version of the query. So we know the IDs, and we know, okay, this Piazza, this is a name.
09:42
Might also be, if it's a name, it might appear in the name or in the address. So we have a structured way of expressing the query. And again here, we can do some filtering and some ranking, because we know a little bit about the words. For example, a house number never appears close to a country.
10:02
That's a stupid thing to do, so we can just throw this away or rank it very low. And this gives us a rank of something which is basically an SQL, only needs to translate it very dumbly into SQL. So we go to the search table and we say, look for these names, for these address terms,
10:21
and just give me the results. Once we have the results, you go to the place table, ask for the full information of the place. So we now have a little bit more information than we find in the search table. This again can be compared with the original query, saying, oh, yeah, is it really the same script, for example, does this match up very well?
10:42
And you do some refiltering and re-ranking, and that's the result, in the end, the user gets. So this is the basic way how the query works. It's also the basic way how geocoders in general work.
11:01
The interesting question is, how do you scale this up to the planet? So I think we are at 250 million places now in the place table, so that's a lot of data you have to do, especially if you have the search table. So we're talking really gigabytes of data in the indexes themselves, so how do you scale this up?
11:21
So the three basic things I've learned is, if you have to do geometry, keep your indexes small. Try to split it up, because Postgres has done a lot of good stuff, so with Postgres 3, it's getting really, really fast, but it's still the bottleneck. The other thing to avoid is queries where Postgres starts using multiple indexes.
11:42
This is also very expensive, especially when you have very, very short-running queries, which every query a geocoder does is very short-running. And the same problem is with JIT and parallel processing, so the query parser of Postgres really, really likes to use this, but this is something very good
12:03
for big queries, not for small queries, as we have. So in the end, yeah, we ended just switching it off in the database as much as possible. Okay, let's look at some shortcuts we've done to optimize the whole thing. The most important shortcut's house numbers.
12:22
Two-thirds of the search index are basically house numbers, or the places. If you wanna have the search table for each of the house numbers, it gets really big. So there's a simple shortcut where you say, we don't search for the house number, because every house number is attached to a street.
12:42
So we search for the street, and then from the street, go to the house number. That's the theory. In practice, you realize, oh, wait, not every house number has a street. There are house numbers which, in a village, if you have a small village, the houses are just consecutively numbered, and so the house number's attached to the village.
13:00
It's solvable, and we've solved it, but it's something to think about. Or, especially in Italy, there's this interesting thing where if a street goes along a boundary between two communities, they give the street different names. I have no idea why they're doing it, and I'd rather fix the government than the database, but yeah, that's not possible. So these are the little pitfalls you have to work around.
13:25
So then you do the search. As I said, it's really simple. You have something like 10 Main Street in Crockwright. Just first search for Main Street in Crockwright, get some results back, and in your place table, for all the house numbers, you say, oh, that's the parent street I have, so you just do a join, find for your results
13:42
where there's a house number 10, and you're done. Easy peasy, unless somebody starts searching for 10 Main Street. Now we suddenly have Main Street you have to search for first, and there are 40,000 objects with Main Street in the OSM database. And doing these joins on 40,000 objects
14:02
is really, really slow. And there came the includes, the indexes of include came to the rescue. So I think they exist since Postgres 11. And basically what you do is you put the house number into the index, and that means you don't have to go to the place table to find the house numbers
14:22
which are attached to a street. And so basically you could say, this is a dynamic view of which street has which house numbers. And this is working pretty well. Main Street is still difficult because there are really lots of them. But for less problematic examples, it's working very well.
14:47
Then I said, keep your geometry indexes small. And I was talking about these, making the address out of OSM objects. So that's the part where we really have to do geometric operations. And that's the part where you have to start
15:01
splitting up your database, and you can't just go to the huge place table and find your objects. So there's a couple of things we have done there. There are, first of all, extra tables for the objects which are interesting because of the 250,000 million that's, I don't know, there's 10 millions or something which are interesting.
15:21
So this, put this into extra tables, and your index gets smaller. The second thing we've done is split up the huge geometries. So we are cutting them by grids, which really helps the index. You don't have the huge B boxes when you have a country which goes like this
15:41
and gets in the way. And the third thing we've done is country partitioning. So we have extra tables for each country. If you look into the nominatum database, you might have seen these 500 tables we are having. So this was already done when we started this thing with Postgres 8.4, so that's why we're still doing it
16:02
by hand. I tried to do it with Postgres 11, but it was still too slow for our 250 tables then. It might work now. So yeah, we have this crazy code which basically expands to a huge if statement, looking and if you're in this partition, then do the thing on this table and so on.
16:21
But it works pretty well. That's been doing okay through all the years. And another thing I wanted to mention is manual query planning. So I mentioned you shouldn't use two indexes, and this really hits us with the name column and the address column. Both has a gin index.
16:41
Gin indexes can also be problematic because they have not the best statistics over them. And so generally what you want to do is if you have a name which is rather not so often known, you just want to query the name table and not the address table, or at least not the index of the address column.
17:02
And however, if the name is very frequent, if you have something like, I don't know, Starbucks, there's a lot of results there, you really want to include the index of the address to get the results faster. So what we did, we ended up doing manual counting of terms. We have the word table, so that's something where we can just put this in.
17:23
And basically looking in the word table, how often is the term we are looking for in the database, and then changing the query for the address part so that the query planner either uses the index or doesn't, which, yeah, basically working around. So if I had one wish for Postgres,
17:41
it would really be, can I say which index to use, please? That's about optimizing, and I want to say some final words on a question I'm getting really frequently, why don't you use Elasticsearch? That's a text search engine, and geopoding is text search.
18:03
And the reason is, geopoding is not the same as searching in a normal text. First of all, we're exclusively dealing with proper names, and proper names are complicated. They are spelled in the wrong way, they use 300-year-old spelling sometimes, they have no grammar which fits to the language.
18:24
The other thing is, we're dealing with mixed languages. I don't know what language is coming in, so you can't really use the optimizations for languages like stemming or something like this. Yeah, and the spelling rules I was already mentioning. So it's not like you can use Elasticsearch
18:40
out of the box here, use the tokenizers they have and everything. The other thing is, the grammar of geopoding, it's really, geopoding request is really special. First of all, we get very, very different queries. So sometimes people just put in a name and expect, yeah, we know what it is about.
19:02
Sometimes you have queries which are almost like a Google search, oh, give me a supermarket, next to the station, please. And then we have the people who are just dumping in there in batches, lists of addresses, which again, have a very different structure. So the thing is, Elasticsearch is cool,
19:22
but the important part is how do you parse your query? And if you wanna do it, get it right with Elasticsearch, it's the same thing. You have to do the parsing yourself, you have to do it manually. And I'm speaking from experience here. So there's a second geo product called Photon, which basically is an Elasticsearch front-end
19:43
for Nominate. So it lets Nominate do the heavy lifting of processing the addresses and finding out about that, and then dumps everything in an Elasticsearch database. The nice thing you get about this, and yeah, Elasticsearch is really better,
20:00
is fuzzy searching and the spelling suggestions of places. I think it's possible to do this with Postgres as well, especially spelling correction. We do have the word table, so we do have a base where to look up which words could be alternatives.
20:21
The problem is a little bit, we're dealing with terms in all the languages, so it's really easy to just, if you do a Lievenstein distance with just one editing error, you end up in a completely different place and a completely different language, so that's the difficult part there.
20:41
With suggestions, the problem is that here we are hitting the limits of GIN indexes. They're really not fast enough to do a very quick lookup, which you have to do with suggestion. And that's for me, so at this point, if you're interested, there's a website, nominate.org. I also want to thank OpenCaged, Graphhopper, and Komoot.
21:04
So they've been supporting me in the last two years, just continuing the development, and if you're interested in also helping Nominate.org to get funded, to get developed, yeah, talk to me, please. And there's questions, thank you.