Bestand wählen

Mapping Words and Phrases from Geographic Knowledge on the Web

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Erkannte Entitäten
sorry amicus get started now and so my name is Ben Adams and from the Center for EU research at the University of Auckland and today I'm gonna talk about mapping words and phrases from geographic knowledge on the Web so I like to start with this quote that have at the bottom of the screen here this is by a geographer and Nicklaus chicken farm and he's done a lot of work on the study of place and of particularly on of looking at the intersection between how we can represent a place of in a computing system and in comparison to that how we as humans are think about places and but is when he's talking about that and iii he's he he talks about the different quantitative ways in which we can represent places but then he also makes the point out that in the context of places of narrative understanding is quote a distinct form of knowing that derives from the description of experience in terms of a synthesis of heterogeneous phenomena now I'm what I'm going to be talking about today is about looking at these kinds of narratives things people write about places and so on and and actually tried it maybe get away we can represent that of and interact with information about that so made motivate
this little bit more concretely of when we represent all places in a GIS generally speaking we have different ways we could do that we can represent places spatial footprint or it could represent various features of the place so from when it comes to asking queries about places like say for example are the cities of Zagreb in Portland of similar we could look at various at tributes tho cities have on I happen to pick some attributes that make these places seem like they're very similar but you know if you ask this question to somebody on the street who maybe had been to these places they're not going to look at it necessarily in in these terms that might look at it but in terms
of effective Zagreb has of medieval history it's predominantly Catholic capital city in the center of creation industry MIMO Portland's very different place it's a craft breweries some mountains but it's a larger room pockets known for indie rock and hipsters is actually funny thing when I was looking for pictures of for Portland on Google images of it suggested hipster as an additional search terms up so the question then
is of can we some way operationalize um place knowledge is situated we can I interact with it in a GIS or on some web mapping platform and do that from the natural language descriptions that are out there could be any kind of think of newspaper articles in the travel blogs the encyclopedia energies I and then create some interactive maps to explore the
so another way we can think about this problem is that so when we describe a place because of our experience of being in lots places in the world of reading about places we have we can just read a description of it based on on what's there and we have a pretty good sense of where that places so the question is can we do the same thing could a computer to the same sort of thing and so then that read this whole text but this is a travel blog entry so for a place where some is talk about going to the hills around town in the past lots of local houses and shops full of interesting fruit further on down the talk about rummaging around and found a large machete and chopped some bananas and and uh so I like to a drinking rum under a tree and playing dumbness and so what I want to to get at is to be able to separate computer look at there's something there's no place name in here it turns out this is actually Montego Bay in
Jamaica and crate of so we're predicting in where this text is about and we can actually do that if we have enough descriptions out there because a regularities in the language that people use about places that I this is a real result that we got but that spike is not actually in Jamaica but it's close so
the for the for the rest the talk of the basic idea is essentially that we want to move beyond the had a highly structured data that we have about a GIS and tried explorer of this heterogeneous geographic information that we have on the web in particular I wanna focus on natural language descriptions you could look at other kinds of information like images of there are people in computer vision already doing this and doing really interesting work on that of where they try to find characteristics of of places based on just photographs but then tied back to that that quote that of from necklace chicken to really build these representations from these human synthesis of experience and particular many different of synthesis of experience and when a focus on 2 applications of the 1st is an application for finding similar places so this is the original problem was Zagreb like Portland and then another 1 of that Tyson this a little bit more of the technologies that talked about in this conference that is an interactive thematic map application breach to ad hoc search will add up search is just kind of a keyword search for it's completely constraints are like a google search but I'll show how we create an interactive map kind of platform like that so according to the
details that a little bit so for the 1st problem of how we might find similar places places of we done some work on mining topics out of lots and lots of descriptions of places of and the basic idea is as follows just like we saw that travel blog entry different words and phrases are going to be indicative of certain places and certain topics which in the interpretation of using you could define topic in 2 ways is just collections of words that are semantically related to 1 another so for example then in italics below there's Industry factory company manufacturing so if you get a general just of us so that these kinds of topics are going to be spatially heterogeneous so is this mean it just means that they're gonna cluster in certain areas of and not be present another is that the uniform spatially around the world and this is of and of course it depends on the documents that you are looking at you how these topics and manifest over space and so you have individual topics that are going to be represented around of the world differently but and then in individual place will have worked at the call of characteristic topic signatures this just means that there's a certain different kinds of topics that are characteristic for a particular so this is met below here just shows some up the based on some travel blog entries that this topic about industry out to talk about in a moment like our country we would with that topic is and how we get that of this topic if you map it out based on that your references the documents of looks something like this in I succeeded you highlight certain areas that are of more there's more industrial language so how do
we get these topics so i in the work we've done the we've used up latent Dirichlet allocation which is a type of text mining that takes up large numbers of documents and in an unsupervised manner finds of groupings of words that are that tend to co-occur in a document so for example this is these that this is just a word cloud representation of what those topics look like that but but you will have topics have to do with for example deserts to say and other ones hiking climbing up what about wine bottle to and so so to these topics are not set up ahead of time it would what they are is that the topic modeling will derive the topics out of that and will find of it interesting topics out of the documents our excuse me I'm but furthermore it will characterize each 1 of those documents as some kind of distribution over the topics so this is really a well established technique in text mining and there are lots of open source tools for doing this that and the 1 that I've used most is mallets out of the university Massachusetts but there's also Stanford NLP tool box this acts using so
of my case so with the industry example some but they have topics this is a and we know where documents and where documents are located where the geo referenced in some way we can map out places associated with the topics so we can create these kinds of static maps so what's interesting is that with the topic model and we can also get different senses of words so for example we might uh it that that out of the top topic modeling on the travel blogs that we had on that to War topics where the first one the associated words related to war had to do more with 4 history like symmetries and museums memorials where's the other 1 of them is less rooted in the history it's just has other words related to word in terms of military gotten soldiers so so that's it has it doesn't really is you would necessarily associated with with places that have more history but it could be in present-day complex only map that out that actually borne out in the maps it might be a little bit hard to see but on the right the right topic is showing up in travel blog entries of there are actually travel blog entries from Iraq and Afghanistan and places like that where is the other 1 you don't see it there at all now so this is this is the individual topics in the mapping out of that
but if we have a number of documents that is associated with a particular some place we can of the topics that are talked about in each of those documents into an overall signature of or histogram of topics for that place and so the for example it you would get something like this where the topics are there this is showing 5 topics with probability value associated with each 1 of those and so this would be a place where people tend to talk about mountains with user hilltops and there's also talk about festivals but there's not so much talk about a small town and so that would be an example of maybe of a city that has mountains this is known for a festival and houses represented in the variables just on that this is a vector of values that the probability vector and and so you can do you could measure similarity of 2 places of as some measure of similarity of those factors of and because it's a probability vector I mean you could do different kinds of measures to a cosine distance measure or other kinds of measures but generally speaking to this sort of more based on the relative entropy measure is a good 1 way to do that and so we
built that into a to all where we train a system on travel blog entries and you can actually from start querying for a place and say give me a similar places based on this topic space associated with with the places from so just an example of what you can get out of this is if you do a search for example for Baghdad on this data from the travel blog entries you find out in terms of a ranking of similar places the Potsdam Germany is actually very similar and so this is most interesting to me because I didn't see the connection but it turns out that the Potsdam at the round the world war 2 was where the Allied forces met to talk about reconstruction of Germany after the war and get there also they met also a palace that's famous there and so on so there's these war topics but there's also palace topics going on and so there's the Baghdad also as a palace and of course people are talking about water so it's an interesting way to start exploring places and finding a similar places move onto
the on to the next arms to spend the rest of the time on the next and application of going to a little bit more of the details of how I do it and that is this idea of doing an interactive thematic map for ad hoc search so this is a general search system you can put in keywords and it will map out of to help you find documents by mapping out where there are places similar or were there are places where that topic is being talked about so the basic idea is that we have oxygen if you think about something like google search you get a top 10 documents and and that works well and that that's kind of become the dominant paradigm since Google came around like so you know in the early 2000 studied used to be that guy if you remember like but things like Yahoo where they would have these faceted searches were things we put into categories and you can work your way through that way and so what I'm would like to I C is a little bit more of that faceting coming back but not like so I not such a detailed way of categorizing documents but just look at sort of like the geographic varsity and that means just look at geographic space and time as a way of ordering documents and the reason why I think this is a generally usable way of going about it is that space and time research common ordering principles and how we talk about things and there are massive number of documents that have some kind of spatial or temporal scope associated with them and then on the flip side of that is you can create an interactive map interface very easily because there's all of this open source mapping of software and the data different interaction is great as we've
seen so I before I the show the system and how it works on the as 1 of the highlight all the different open source components that go into it so this system I built on the data the data source and looking at is the English Wikipedia data don't and I used as sophical DG grid but which is out of Southern Oregon University as a way of I mapping out where a different documents are located at different resolution levels so this creates a discrete broke global grid that I kind is but 1 of the logos to the upper right a greater discrete global plateau over the earth so that at different resolution levels either hexagons or triangles and then through the proper preprocessing so I'm using posts crescent PostGIS and then finally had a Lucene index of that actually handles but the queries and on the front and I have amusing flat with steam and map and then as a CIA 53 over that of and gj query as well so the basic idea is
like this you have a interface from very simple you can put in searches and of what but on the back and what you have is
something like that so this Discrete Global Grid creates a grid over the earth and it looks strange like this I mean those are equal area but that's because of the projection right so so it yeah this Discrete Global Grid at different resolution levels and what I've done on the back end is mapped all of these text documents by passing through place names are associating locations with with documents to various open source of the features like from Geonames or natural birth depending on where the things are and kind of it's collecting all these different sources together and then once you've done that say I map to some polygon some spatial area and a text it's maps the spatial area then I find all the grid cells that intersect with that area so and then I map the text into those grid cells and then I build a Lucene index where with the text that's associated with the grid cells so the documents so to speak in the scene index are not the individual documents but there these grid-cell documents the the aggregations of the text that altogether and then I also spatially index that some so that as you interacting with the system so you can just click that very quickly based on what you're viewing window it's and so this is what you can
get so you can start doing a query like say for civil war in this case In in this that what what it will do is it will represent a bunch of locations from in this case in a chloroplast of representation of of the places where there are documents georeferenced some related to this topic of of the Civil War come and I happened is different degrees based on the school were coming from the Lucene index but now this overlay part the actual hexagons from the grid of is being rendered with T 3 on top of a leaflet map so you can interact with it you can click on a hexagon come and pin it so to speak and then you get the list of the documents that match the query within that Texaco was cool about this is you get not just some top 10 documents but you could actually like move the mouse around and find new in different locations what are the documents associated to query now this great is that is nice but I actually thought it was a it it's a little bit too precise for the kind information you're getting so I
reworked it into more of a heat map representation and now I'm still using the grid but for the interaction as you moving your mouse around and so on but it's transparent I and so the heat map is that is the representation so it interacts the same way and this is using a heat map of plug-in that was part of leaflets so there's different answer queries you can do and so on a history of about 5 minutes so on I just wanna go through some examples of the kinds of things can be used for us this is a search for my already in seal and Marriott the people that were there before the europeans of and you can do this very interesting kind of research with the site I found out all kinds of information about other places in northern part of the North Island of related to our history just by can a my mouse around the screen what you could do
comparison searches so the top is a search for confederate vs Union Army and you get different you know slightly different maps and you can try to find out about so going to the documents what's what's the difference on it is it for something
like travel type thing if you're interested in wine you could search for wine over Europe and I found out there's a uh a region in eastern Hungary that there is a wide region that I I never knew about so so that the key point of infamy when I think it's it's a neat sort of weighted explore ad hoc information is that you have some background knowledge about geography right you know something about places so in C use that added information on top of your normal kind of whatever type scene that the topic that you you're interested in and the try to find the intersection of that and that gives you added value I think this would be cool to use force
digital humanities type work along with this isn't really a search of but you know there's a lot of great data sets out there I'm working with the Wikipedia dataset now but like the hot trust university Michigan is this collection about 300 thousand of them public public domain books that I hope to to index so in terms of the
representation of the heatmap there is an issue that I came up with and so I had this this equal area grid and you know I would all that trouble and then I go and I represent this in spherical Mercator and I doing the heat map over that and I found out here that would be terrible actually when you get up to the northern latitudes so I I made some modifications to take some like as velocity goes higher II's increase the radius of the heat map it's not I mean there's a problem with that and there because then your i gets drawn to the northern part of the of the screen and not so much the southern and so I mean ideally would be doing things more interesting projections to begin with and so that's of maybe except all and
this is that I mean I thought this would that be an interesting of slide just for people ought to work with leaflet on is that we actually put this up on a multiscreen display in our our center and it works fine so that says something more about me flipping a light weight system that has anything to do with my that implementation so so
what next so I actually set this up on a web page so you can actually access that if you want to play around with that Franken place that come that and multi of that name actually comes from brain sitting in the back of the room of but he but I mean the basic idea this you know the places are are Frank places because they're made up of all these different descriptions in different experiences of people there's not really like 1 place and but there's lots of things I'd like to do with this in get better interaction combining that kind of similar so play search with this more recent of something that I've built make it more scalable from the Geo parsing is a big deal there are some open source of tools doing geo-parsing that is identify place names in text but they really don't work well in many situations so there's a lot of interesting work to be done in the and i'd like that actually build a so the people can contribute what information data into it and so it's more of an you know not just your searching on some data sets out there but actually get people involved more and of course temporal faceting would be great I think that make it even better a scoping the time associated documents and using the so under there and take questions that have time otherwise if you don't go to lunch that's perspective yes in I like to demo to your interests is there any reason you chose loosing over Elastic Search and those due special capabilities so no actually I'm I'm importing it ElasticSearch now I mean it well ElasticSearch is losing right at the front but I just it's just because when I was building it I knew the scene and so I just it that way but some that's that's part of what I'm planning to do is to Sweeney news is not right I don't have a bill in such a way that it can scale very well so that that would be perfect that the I so excellent talk on 3 thought about some sort of doing multiple sort of languages some kind of matching ontologies between languages obviously this is English centered and so some of the results together then be skewed towards so that the filter yes sure I mean I would love to I mean I think ideally it would be set up in such a way that would be I mean it did the workflow that's involved could work with any collection of documents of and hot but I haven't I haven't sort of got together and yet in a way that people could go you easily pushed whatever they want into it and I think it would be actually really great that different languages because you could also do microsphere search terms can be used differently but in different places but I I think that's great I mean more data the better and you know I think you could do some interesting kind of you can actually do interesting kind of geographic research right where you look at the awarding shielding with type things where you say use the German version of BAO might Bergen mountain and compare work where they are then that's creative the other the do you want the microphone array of the time and and it this the a lot right now is so there were 2 parts of the 1st part of this section sp repeated out it and just as if you could say a word about Due to the geo-parsing of unstructured text right so that there's kind of 2 different things I should I don't know of 0 . but but data but usually a means for this there's a
kind of 2 levels of which I've done it and I've been working with documents that are already geo-referenced to begin with so there's a global reference to to the document but more recently what I've been doing with Wikipedia's you've got an added benefit the like I could take the DB pedia Don defined all the place articles and then find references to those in the text and use that it's not really geo-parsing this is kind of leveraging Wikipedia has to offer and so that's what I started with and I've played around with some passing about tools but there was really messy and they don't work so well so I feel like you know there needs to be in the media by utilizing some of this all the information from DB pedia and so on about places to get you know it could build something a little bit better that's kind of where I am with what all of the things I don't think of up in all the mobsters showed so far but what struck me is that the it is difficult to make sense of the results to the hell is the mean wine for reasons of cost London stood out as a right why and how place will of central Warren incurring loss and it's a New England was shown in 9 oranges there is more i it as much as a serial or of any so and it was the so this is not the geo-parsing he of course more and this mature but we is of the debate entries which were ready and Jill and your referenced so what to make of those markers well so that's an inch that's really interesting question so I mean this kind different levels which shows that thought about this and me 1st of all some you're right because in a way so the question is what is this map saying about the places that are represented and and that's hard because of course the the indexing mechanism is a bit of opaque to the viewer right you don't understand like scoring mechanism behind what's going on on and so on my answer that I guess is that it mean it is interactive so the point is is that you move your mouse over it and it will pop up the documents associated with it so that's the way to understand why that term is related to the place is by going back to the scene the source material I could changes the might the and so if you if here's the web page on and I to the wine in search of a lot of Internet
well OK so I could show it to you off the light but I want to get in a good but that is you can move it over and it'll show in the box on the left the documents so you that's your way of understanding why is why I'm being talked about here don't things that but I also think it actually is dangerous to try to find 2 comparative analysis based on that thing that I wouldn't recommend it actually because I think it's better as a means to get to the documents utilizing the geographic context that makes sense so it's about the ad hoc search to the documents more so than it is what it's telling you about the places but people will interpret it that way so I mean it's it's a some question I don't have a good answers about ghost before he sold maybe using these clusters of would you could use that disambiguation when you do opposing because presumably documents back Paris texas have different words in them than the health of the Paris France yeah that that's a really good idea and actually so that ties into some other work that i've done where I mean just trying to like find characteristic topics since all associated with places yeah you could you could use that to to sort of give a confidence level that this is in fact Paris Texas yeah that that's good this if you have more of documents in 1 particular area with a certain keywords and less documents another does skew the results well yes so that the London effects yeah I am a living this wikipedia so I've I play at work and trying to work with this so this has to do with largely with that the index at the scoring mechanism in the scene because the way it works essentially is that like the if you've some grid square over London plots documents there's a lot of Texas like sort of a huge document and as you have like other places where you don't have so much text and so on you have the kid is just TF-IDF or something me that I like I have been playing around scoring mechanisms and I found some information uh based measures for scoring of documents that tend to like normalized based on the number the number of words in the document and those work better but and that's kind of I I haven't quite figured that out of you do get some of that skewing like you people say why is like the UK showing up too much of it has to do with that there's just so many documents and the words shots to get some of that it but I think you have a good
Deskriptive Statistik
Charakteristisches Polynom
Interaktives Fernsehen
Faktor <Algebra>
Kategorie <Mathematik>
Güte der Anpassung
Temporale Logik
Kontextbezogenes System
Natürliche Sprache
Rechter Winkel
Zellularer Automat
Charakteristisches Polynom
Trigonometrische Funktion
Wort <Informatik>
Diskrete Gruppe
Räumliche Anordnung
Demoszene <Programmierung>
Open Source
Elastische Deformation
Cluster <Rechnernetz>
Attributierte Grammatik
Open Source
Faktor <Algebra>
Wort <Informatik>
Data Mining
Web log
Natürliche Zahl
Natürliche Zahl
Kartesische Koordinaten
Komponente <Software>
Regulärer Graph
Natürliche Sprache
Mapping <Computergraphik>
Maschinelles Sehen
Web Site
Plot <Graphische Darstellung>
Elektronische Unterschrift
Text Mining
Arithmetisches Mittel
Funktion <Mathematik>
Automatische Indexierung
Projektive Ebene
Web Site
Gewicht <Mathematik>
Zellularer Automat
Interaktives Fernsehen
Data Mining
Syntaktische Analyse
Digitale Photographie
Front-End <Software>
Zusammenhängender Graph
Bildgebendes Verfahren
Einfach zusammenhängender Raum
Ontologie <Wissensverarbeitung>
Physikalisches System
Mapping <Computergraphik>


Formale Metadaten

Titel Mapping Words and Phrases from Geographic Knowledge on the Web
Serientitel FOSS4G 2014 Portland
Autor Adams, Benjamin
Lizenz CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
DOI 10.5446/31681
Herausgeber FOSS4G, Open Source Geospatial Foundation (OSGeo)
Erscheinungsjahr 2014
Sprache Englisch
Produzent Foss4G
Open Source Geospatial Foundation (OSGeo)
Produktionsjahr 2014
Produktionsort Portland, Oregon, United States of America

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Extremely rich and diverse knowledge about places across the world is available online in a variety of forms, including structured data, image, and natural language description. Map-based exploration of this knowledge has potential to aid a number of applications from education to marketing. In this presentation we describe a system to map geographic regions associated with arbitrary keywords, phrases, and texts by computing topic surfaces over the Earth from unstructured natural language text. Our methodology combines natural language processing and geostatistics and is built using freely available open source tools. We train our system on Wikipedia and travel blog entries and demonstrate it with a general-purpose geographic knowledge exploration tool.
Schlagwörter geographic knowledge discovery
geographic information retrieval
text mining

Ähnliche Filme