We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Automatic Improvement of Point-of-Interest Tags For Openstreetmap Data

00:00

Formal Metadata

Title
Automatic Improvement of Point-of-Interest Tags For Openstreetmap Data
Title of Series
Number of Parts
183
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Producer
Production Year2015
Production PlaceSeoul, South Korea

Content Metadata

Subject Area
Genre
Abstract
Geo-search engines and location-based services allow to query for points-of-interest (POIs) in a certain region or next to the current user location. Hereby, search queries often ask for classes ('hotels New York', 'supermarket Berlin', 'Italian restaurant London') rather than single points ('Hotel Belvedere New York'). In OpenStreetMap (OSM), one can specify the basic class along with every POI e.g. via the amenity tag (amenity=fast_food), via direct tags (shop=supermarket) or several other specialized tags, as the cuisine tag for restaurants. These tags are mandatory for a certain POI to show up among the search results for a class-based query. Moreover they are useful to categorize search results, e.g. searching for 'Venice beach' should inform the user that there are beaches, hotels, fitness studios and clothing stores with that name. Unfortunately in OSM, there are plenty of POIs where the class is not provided. But many of those POIs exhibit a name tag ('Sunset Hotel', 'Wal Mart') which already contains some information about the respective class. In this paper, we investigate methods for automatic extrapolation of class, amenity and specialized tags solely based on POI names. For example, 'Pizzaria Bella Italia' most certainly indicates an Italian restaurant while 'Tapas Bar' indicates Spanish food. We use machine learning tools to extract for many amenities typical words and phrases that occur in associated name tags and learn respective POI classifiers. For example, learning indicators for 'shop=hairdresser' on German OSM tags led to high scores for 'fris', 'cut', hair' and 'haar'. While 'studio' and 'design' also appeared in many name tags, they are not suitable to distiguish between 'shop=hairdresser' and 'shop=beauty' with the latter including nail spas. For other kinds of POIs as supermarkets or gas stations, names of large chains ('ALDI', 'Aral') showed up as typical indicators. We empirically prove that with the help of our learned classifiers, tags for POIs with unknown class can be extrapolated with high accuracy. For example, amongst all hairdressers 8% were untagged but could be identified by our approach.
126
Type theoryLevel (video gaming)Electronic mailing listOpen setTouch typingGame theoryAddress spacePrice indexPoint (geometry)InformationGeometryFrequencyWordSpacetimeSearch engine (computing)Error messageWeb pageProduct (business)Virtual machineSqueeze theoremSet (mathematics)Right angleTotal S.A.Lattice (order)ThumbnailGreen's functionVideo gameEvent horizonMetropolitan area networkCASE <Informatik>Multiplication signDemosceneSequenceVariable (mathematics)Complete metric spaceSpecial unitary groupWave packetParameter (computer programming)Endliche ModelltheoriePolar coordinate systemInstance (computer science)Numeral (linguistics)Sampling (statistics)InferenceDependent and independent variablesExecution unitInsertion lossPhysical systemRoundness (object)Disk read-and-write headMusical ensembleResultantDialectPairwise comparisonGoodness of fitStudent's t-testInterface (computing)Receiver operating characteristicHand fanTerm (mathematics)SummierbarkeitComputer animationSource codeXML
Probability distributionType theoryInformationCASE <Informatik>ResultantNumberTheoryWechselseitige InformationSummierbarkeitEvent horizonPhysical systemCategory of beingGoodness of fitMultiplication signForestData managementProcess (computing)Level (video gaming)Student's t-testUniverse (mathematics)Rule of inferenceRhombusBasis <Mathematik>FamilyPasswordEndliche ModelltheorieDistribution (mathematics)State of matterBinary multiplierComputer fileCalculationInsertion lossLink (knot theory)Mathematical analysisElectronic mailing listAmenable groupWordSqueeze theoremOpen setRight angleVirtual machineForm (programming)Thresholding (image processing)Vector spaceConstructor (object-oriented programming)Price indexWebsite
Overlay-NetzBit rateComplete informationSymmetry (physics)Neuroinformatik1 (number)Level (video gaming)Error messageDifferent (Kate Ryan album)Normal (geometry)Software testingTouch typingSpacetimeFreewarePoint (geometry)Matching (graph theory)Descriptive statisticsPrice indexNumberLengthInformationSatelliteGame theoryElectronic mailing listMaxima and minimaBitIntegrated development environmentACIDAnnihilator (ring theory)Mathematical analysisInterface (computing)Validity (statistics)SummierbarkeitArithmetic meanLogic synthesisExecution unitPredicate (grammar)AutomationVideoconferencingOpen sourceGroup actionSpectrum (functional analysis)Branch (computer science)Dimensional analysisDistribution (mathematics)Atomic numberPattern recognition
Source codeComputer animation
Transcript: English(auto-generated)
Okay, so our talk is about automatic improvement of point of interest tax for open street map data. And the point of interest in open street map data is more or less just a node. So normally with geo information like latitude and longitude, I just submitted it here. And then you have something like the name. I won't pronounce it right now.
And then there are many additional tags you could add. For example, here we have a manatee restaurant. And these tags are really important if you have something like a search engine and you query restaurants in Seoul. And this tag is not present, you won't find it. Okay, and there are other things like the cuisine tag. This is Korean food, I guess.
And so if you search Korean restaurants in Seoul and the cuisine tag is not present, it will not show up. Okay, so it's really important to have all of those tags. But unfortunately, many of those people just create those tags, yes. So people go around and see, okay, there's a restaurant, it's not an open street map data. So I won't tag it right now.
But very often it's just a name provider, though. Some arbitrary information like the address and maybe the opening hours. But very often it's a manatee or the cuisine or some other really important tags in this thing. Our basic idea was, or basic question was, if you just have the name, maybe we already have some information about these other tags. And maybe we can find them automatically once we look at the name.
And to illustrate it a little more, we play a little tag game if you can game. So what I give you is the name of a restaurant, so I already tell you it's a restaurant. And you just guess the cuisine, okay? So for example, if you have Izumi Sushi Bar, I mean everybody just knows, well, it seems to be sushi in Japanese.
So for the cuisine tag, we have kind of two types of cuisine. We want the type of food, like sushi, pancakes, and so on. And we also have the ethnicity of the food, like Japanese, German, Korean, and so on and so on. Okay, so these are the typical cuisine tags. They can have both, they can have only one.
So, total blank or tapas bar, I guess, okay? The food will be tapas, and the ethnicity will be Spanish, more or less likely. Okay, then, very easy one. Okay, this seems to be pizza in Italian. Okay, up to now, already the name of the food was always contained in this, now we have not. A tag like this, 50's Diner.
What do we think is served there? Well, it's burgers and it's American. I think everybody knows it just because the name kind of has some information, even if the name of the food is not really in there. Anyhow, Subway, everybody knows the brand, so we know it's sandwich and it's American, even if the name doesn't tell us anything about the food, right? And we have chow's wok.
Okay, we don't know it either, but we know, okay, this sounds really Chinese. And wok sounds really Chinese too, so that's that. So fresh fish buns, no matter where it is, it seems to serve seafood. And then, yeah, Mykonos restaurant. Well, this is quick, why? Because we know Mykonos is increased, right? And the same thing here with the Taj Mahal, okay?
It's really unlikely that some German guy or so names this restaurant Taj Mahal. Okay, so that's the idea. So what we hope is these clues we get from the name, maybe we can automatically identify them with a machine learning approach, and then we can not do it just manually for our list, but we can have it for thousands of points of interest at the same time.
Okay, so, and the question is, what kind of tags can we hope to extrapolate? So we always have this cuisine thing as an example right here, but it works for all of many kinds of other tags too, but just illustrating example, okay? So we have the Ozan Miki, which tells us what are reasonable tags,
within the parameters, of course there might be new kinds of special food, and there might be new tags, and of course it's a dynamic thing, but these are, so this is just a cut out of course, but these are suggestions what you should use, and then we had a look at a data set of Germany, so we only looked at Germany, and we considered tags which occurred at least 200 times,
because if you want to learn something, you have to have some information already there, so if some tags appear only once or twice or very few times, we don't think we can learn from that, and interestingly there are about 1,500 cuisine classes in the Germany data set, and most of them were not really in the suggestions,
so we only used 25 of them, we didn't consider for example homemade cake, this is much too specific, you should use cake or dessert there, and there were sometimes cuisine tags, a page long or something like that, so people really didn't use that, and we have people who want to be really cool, and do something like German bohemian, which is not an ethnicity and not a food type,
and we have lots of people putting in German names, and I think the same thing was when I looked at a data of South Korea, a lot of Korean names, not English, and not names but cuisine, so this is Brugel, you should use regional or German for that, then you have some really strange things like cuisine music,
don't know what it's supposed to be, but it doesn't appear too often, and we have a lot of spelling errors in there, I think one could take care of that, but we didn't put the effort in because normally we had enough of the other tags there, and some were really not clear what it smells like, okay this one we cook, we solve for lots of products now, and then there are some which is really rare,
so it's Italian food, there are not too many places in Germany, so we just admit that. Okay, so now is the question, how did we learn, so what should be the features? So what you see here is just a list of Indian places, so Indian restaurants in Germany, not all of them, just the first ones,
and just looking at this list you might find some indicator words or some indicator phrases, here for example, let's see, palace appears at least twice, and we have of course Indian, or India, this looks kind of good, like here we have India house, India mango, we have like here Maharani, Maharachi Maharani,
this all starts with this Mahara thing, so maybe just by looking at this list we could come up with indicator phrases like this, okay? More or less. But the other one thing, it's really painful to extract them by hand, and we do not know if they're really good, and of course there are not only these, but there are thousands of them, so looking at them manually is not a good idea,
and the question is how we can extract them automatically. And the first thing we did is, for every of those names, we constructed the k-grams of size 3 up to 10, so k-gram is just the consecutive subsequence of the letters, so if I touch my heart, it's the example here, to use the space here, so this is a four letter word,
and these are the remaining ones, and we do this for all of the sizes, and then we count the frequency in this list, of course in a complete list, not only in the one there, so this information what we have here, so for example touch appears two times, and Maha appears nine times, and then we say okay, k-gram is significant, then it's contained in at least 2% of the names,
so remember that our list is at least 200 names long, so for this list this won't make much sense, but if the list is really long, this works fine. Then we make one modification, because if you look at this, so you have PC burger, you have on it, it appears 753 times, and you have McDonald's, it appears the same amount of times,
and it just means that this is not really significant in a way, but it contains additional information, it just says okay, this is always a substring of this, and then we throw this away, because for just later on it's called a result, so for example if you have Burger King, burger will appear more often than Burger King, so we keep it, but in this case here, we throw out the substring.
Okay, so now we have this information, so we have for each class, like Indian, Greek, Italian, Chinese, and so on and so on, everything that's reasonable according to the OSEM wiki, we have certain indicator phrases, so these are just examples in our calculations, we had thousands of them, not only three or four, and then we have the percentages,
so in how many of the words in the list they appeared. Okay, now we want to use this to learn, and therefore we construct a feature vector for every name, feature vector construction works like follows, if in the name the phrase is contained, then we use the phrase length, so this would be six here,
and multiply it with the percentage, and get this entry, and if it's not contained, like Maha does not appear in Indian manual, it's zero, okay, so this is very easy, and we see okay, for example in Chinese, we see this N is pretty popular in about 8%, and it also appears here in the middle of Mango, okay, so we have also entries like there,
and you can see if there are thousands of them, you will have entries there and there, but we hope that like here for this pizzeria thing, that most of the things will show up in the right place, and then we use the random forest to learn, so just standard machine learning approach, we use the scikit-learn package and apply that,
and we use it in a way that we don't get a classification right away, but we get a probability distribution over all of the classes here, so for example for any Mango, it will say well maybe 90% sure that it's an Indian place, and 10% sure it's a Chinese place, and for pizzeria maybe it's 100% sure, because nothing else appeared,
okay, but we get such a distribution of all of them, which sum up to 100%, okay, and now we evaluated it in two steps, so at first we evaluated on the data where we already know the results, so we know the cuisine, here you see an example, so it fits the sandwich burger and so on and so on, and you see the correctness numbers,
so this is precision and this is recall, and you see it works pretty well, but there are some typical mix-ups, like pizza and ice cream, can mix up a lot, and I think coffee and ice cream, so ice cream is evil, and we have I think kebab and burger also, because there are a lot of these barbecue things which serve both,
but you see the correctness numbers are fine, not really perfect, but this is because it has to decide for one of those things, and not every place we put in really was one of those, and there are of course names which don't tell anything, so maybe for some the probability distribution was like really close for many of those classes,
but we have to decide for one, and this makes it a bit worse when we could do it, so what we did on data that we really did not know the cuisine tag is, we assigned a tag only if the classifier was sure enough, so we used this probability distribution to decide if we can assign a tag or not,
and to say okay for an ethnicity tag, we already assigned it if the classifier is 75% sure, this is because we think the ethnicity list we use is kind of exhaustive, so all of the places we have we think are one of those, but for the food type it's not the case that we have places which don't belong to one of the food types we use,
and so we say okay then the probability has to be 100% to assign them, and this resulted in over 90,000 new ethnicity cuisine tags, which is quite a lot if you compare it to the number of restaurants without cuisine tags, and we have about 1,500 new food type cuisine tags,
and then we googled the restaurant name, so maybe the website was already linked in the OSM data, so there's a tag where you can just put in the address, then you just look what kind of food do they serve, and see it worked pretty well so we have an accuracy of 98%, which is much higher than in our analysis of the data we already know,
the right information because we only use them if they are above the threshold. Okay now here you see some examples which were automatically classified by our approach, like here okay, some of the names are in German so you won't find them, but for example this 50's diner that we had in the beginning it was correctly classified as burger,
okay because I guess there are a lot of places which had diner in them, and were classified with burger and with American and so on, so it really received the correct tag, and a lot of other examples, I just want to draw your attention to three examples that did not work, because the funny thing is, so in Germany like old castle it's named Vork,
and if you have it in a certain form it contains the word burger, even if it's not there it's like hamburger and hamburger, yeah so this is so common thing, this is a real problem because a lot of these places were classified with burger, and they serve German food, and there's another thing, this is called Schweizekammer, it's just a place in Germany where you store your food, and it contains ice which is also pretty bad,
so maybe for those things you just have to use something else, and I think if you go to other countries and analyse other data of other countries, you will have other kinds of typical mix-ups, maybe one can take care of this by doing some extra rules or something like that, so yeah, but especially for this ethnicity thing it worked really really well,
we did not really find examples that was wrong, so yeah, I can see here, so some are really easy, like Meitai is probably okay, but then you have socialitosus mexican, or this example that we had, the three conoces three, so there seem to be enough places which already have this information contained. Okay, so as I mentioned in the beginning,
we did not only do this for cuisine tags, but for a lot of other tags, like other food related tags, like classify it with a manatee restaurant or bar or beer garden or cafe, and we assigned 461 new things, we had to do some tricks here, because they are really close together, there's a lot of overlap here,
so we had to learn at first, a classifier between those and those, and then do another step for them, and then we assigned over 4000 new amenity and chop tags, like supermarket, bakery and so on, and about 3400 new tourism leisure tags, the overall accuracy is about 85%,
which is nice, but not like we should do it in an automated way, just put them in open street map data, because this might not be the accuracy we aim for, we want to have of course everything correct, so we think that one useful thing we could do with our approach is, we could integrate it in a dialogue system, and as soon as somebody creates a new tag or a new note,
we could just use this thing to make useful suggestions, like if somebody puts in a name Walmart, our approach would say, oh well, this is a supermarket, and would just please add this tag, and then as if it's store, people would just say no, this is not a supermarket, this is my restaurant, and I just named it Walmart. And then it would work fine.
And then we could do a lot of things, because we have this probability distribution, maybe two things would be really likely, and then we could just suggest both, and the user should select them. But it would be much better, because often people don't even know what would be necessary to have a complete information for a certain point.
If you have certain suggestions, we think that people, there's a stronger incentive to put information there. Okay, so conclusion, well we see a large portion of the post already contained information, about 60 to 80%, which is quite a lot, and I would just assume it works for other countries as well,
but of course we should check. And we could, to improve the quality, we could consider additional tags, besides the name tag, like opening hours, could be really good to distinguish, for example, restaurants and bars, because bars tend to be open later and so on. And then there's the brand tag, which might be useful. And also there are these free text tags,
there's note or description tags, where people put all kinds of information. I guess it's much harder to pass them to get interesting information, but it might be worth it to distinguish better between certain classes. And, yeah, well, that's what I want to tell you, so thank you very much for your attention. And if you ever see places like this,
say the owner, it's not cool. Thank you, Sabina. Are there any questions? We wanted a microphone. I wanted to ask you something.
You said something about new tags. Did you add them somewhere, or is this just new tags that you discovered? Yes, so we did not add them in an automated way. I mean, there is this, these guidelines, how you should do automated edits from street map data, but it's really not recommended, so we think it would be better to have it in some,
even in an overlay or then, yeah, some additional thing, because we don't want to add wrong data. I mean, the percentage for this cuisine text was really, really small. I think the normal error rate might be even larger, but still, we don't want to upload wrong data, so we just have it on the side. But, yeah, I think if the community would check, I mean, manually checking all of those things, of course, is not fun,
but if the community could just check it, so we would just have an overlay with all the suggestions, and people could say, oh, yes, this is correct. I think this would improve the data quality. Have you thought about adding a human computation step in to just weed out those wrong ones?
Yeah, well, so, yeah, somebody should look at them, definitely, so. I like, there's just one paper that springs to mind that covers something that seems like it has quite a nice symmetry to it, but, I can help you out. At the conference? No, no, just a bit over here. I can just, yeah, you have the list out there, and then you can all go and have a look.
Have you tried emailing to the German OSM list? No. Maybe you could make some kind of game or something like that, and people score points. Yeah, I think there could be nice incentives to do that, but, yeah,
yeah, do it on the exercise sheets, and give points for that, yeah. Yeah, possibly in Europe, we did a game where we were looking at satellite classification. Oh, okay. Yeah, so then you take them to say whether the classification, the two, which the two classifications was right. Huh. And it got quite competitive, but you couldn't believe it's able.
Okay, we'll think about that. Any other questions? Then I would like to, okay, well, sorry. So the search class, the search terms, the minimum of three characters, right? Yes. And some of them were four. Did you decide on this number,
or was it, how is it based on three or more characters? Like for instance, if you like Yes. So, yeah, so it depends on how often it occurs in the list, so if PI said that occurs more often when PI said,
then PI said and PI said that will both be in the list. If PI said, occurs, oh. If there's a substring, and it occurs exactly the same number as some larger string, we just dismiss the smaller one. But otherwise, we just keep all of them. So if PI said that appears more often, then pizza, the whole word,
we keep just both. So we keep all of them to have the full information. Because this is a common thing that you have just, yeah, the same prefix, but then the end is completely different, and so, yeah, we just keep all of them. No, we just filter to the maximum,
but only if they have the same frequency. So then we filter to the longest string, which contains the information. Oh, is it not what you're asking? No, it's not. It's not the number of letters. Ah, okay. You choose the numbers. No, I mean, for a certain indicator phrase, we have a fixed length because the number of letters.
So for PI said that, it's four. I mean, it's nothing to choose. But we just use it as a multiplier. because we think it's important, I think it's important if it occurs very often in the list, and it's also important if it's a long match. So, yeah, this is the idea, so this is why we multiply. Is your question about some of them look like they're in green? If you go back,
I think it's because of that space you explained. Ah, okay. Yeah, I think this was also in the paper, and people got confused by that. So this thing here, because you think it's three letters? no, it's four letters because there's this space, so, yeah, so,
yeah, we're four characters, right? Why did you choose three as your lowest count? if you use two, I mean, you get a lot of, I was thinking maybe four or five would be. Yeah, but there were some places where you really have, like, this touch thing or so. This is like one. So we really saw, especially for those Asian things, where all, they have really short names, and if you start with four or some,
some places do not even show up, so we have places which are just named Lee, or something like that. So, going even lower is difficult. But, and then with this filter message, and the ones that really have long, significant ones, normally the small ones were thrown out. But, yeah, we did a lot of testing with different trade-offs, and then we came up with this.
But, yeah, there might be better ways. Maybe you should learn it also, you may have to start from.