Merken

Extracting geographic data from Wikipedia

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
we then everybody thanks you come in mind how Mueller IMO 1 software shot up in Seattle and most of what I have been working on the last few years has been IOS and software I keep keep 1 foot in the open-source world and about 1 minute talking today about is some some geographic
data production that I I went through for an application mind history quarter which is an Iowa up when I was mapping up with particular focus on the history pointer is a portable version of the National Register of Historic Places which is a data set of about 100 100 thousand of properties and districts so it's a
database it's administered by the National Park Service that contains all kinds of cool stuff his old buildings factories historic districts maybe that be commercial buildings of of building associate with a particular person maybe their birth place the place the died a place where the civil war ended and maybe a particular architecture that if if you wander around in just in Portland with this application you can see lots of really cool commercial buildings can the minute mechanism Minuteman 2 missile silo in the National Register uh and and there's also the house where home on the range was composed so it's pretty diverse dataset it's not but in my experience been a real friendly dataset to work with because it was originated by starting back in the sixties ones while we were really uh we were really doing GIS they were really doing geographic information so my
original idea Thailand's we started working on this project our or when I just a word on the 2nd version was to link to Wikipedia in order to get some richer information so if you look at the variance of the Broadway Bridge which is just our just over about a year the original nomination information original history information is not available at the National Park Service website just the it's a pile of paper that hasn't been scanned it but I was able to link to wikipedia so you can find out a little bit about Broadway Bridge so when we get over to the
geographic data side we have the some problems that that of the dataset is inherently noisy it's outdated I I can go home to Seattle and I can find out that the USS Missouri is supposedly more right across the water and remedy retrieval actually that's where the Missouri was 1 when she was entered the national start register which is now over in Honolulu has been there for some time but this is not a dataset that Park Service's funded to maintain a certain often and maintained the information and sometimes it was just entered wrong sometimes there was a typographical error and some of these uh descriptions are just inherently will really really hard to geocode and is that the last there I if I had a map I couldn't find it can road 326 between Delaware 12 and carry Road 83 Duck Creek 100 the but good luck with that but what I have
found is that the the Wikipedia people really care deeply about the articles they're writing and so on have been trying to do is pull the relevant uh geographic data out of the Wikipedia article so the material in the Convention Center and we can
find out information about the Portland which is the total more depth of the river and then go straight to the Wikipedia article also I've
mentioned you reference the Portland which again is a ship the location have in the original database for the Portland is not correct that if we look at this Wikipedia
article but I know and mean the normal people actually look closely at Wikipedia articles by with was not something I had ever done to or started this project there there is a quite a bit of pretty rigorous structure and in addition to the plain text and there numerous infoboxes infobox of Wikipedia is software-driven it's where you'll put in certain keywords with values and then to the Wikipedia renderable presented information based on the requirements in the infobox template and also based on the key values can those in the infobox so if you look here we have in fact a standard information box for national strategic properties which gives us
location often the coordinates uh significant years who a governing body when it was admitted them and this and this
catalog number that 90 700 0 8 47 which has its unique identifier within the National Register so I'd like to do is a couple things 1st I wanna link the relevant Wikipedia articles to the relevant to the appropriate national register properties to have done that much now I've enriched by product I've I've gotten a nice Wikipedia description for the property and then I also would like to pull out the coordinates if they're available so the coordinates that are embedded in this
Wikipedia markup and it looks
like it ought to be pretty straightforward to pull out I wasn't always straightforward for but it worked out pretty well the so Wikipedia
you else have several forms the there's the normal Wikipedia URL which you would type in most of us type in your over hit the home page in the search box or will do a Google search something but this this 1st level of this 1st Uriel that's the normal top global reference from article the but then also a more specific reference which is that Oracle euro with a petite with an ID appended to it so now what I've done is I don't have in a fight the article on the steam to Portland and a particular revision so if you all go out now and tour the portal and find an interesting stuff and everybody goes out of Wikipedia articles that old ID number is going to get but the original URL is not going to change but and then there's also a numeric ID which is uh point reference the report 1 even if maybe the the oracle gets renamed or suppose we decide whether to have an article on the ship anymore and have instead 1 article on all the steam turns up and down the line river and but that current ID number that's going to stay the same as the numeric ID and that's kind of the gold standard for as for many top that's the number that you won't be referring to if you wanna pull a Wikipedia article rigorously and finally there's an API which allows you to iterate through a series of of of Wikipedia articles so this API is this is where is your starting point if you want to pull out all the article titles that reference this low 1 in the 2nd the alienation readers to restore places so this is where you would start if you wanna pull out some subset of From Wikipedia articles we pull this article 1
of you you had the option and fixed of pulling down the XML the XML is going to contain the revision info who provides a page what was the date will cause Internet address of air and the plain ASCII marker the so Wikipedia is not gonna send you the rendered mark up it's not going to give you the ASCII text version of this thing alright so I've got a math degree from a reputable school of taking a lot of computer science classes the I find myself a parser and figure out this parser and just pass a marker and it
turns out that writing a parser for Wikimedia Wikipedia Mark up or Wikimedia water is is a pretty popular idea that in a lot of projects that have tried to do this and this several of them have achieved some success the 5 so what you all this I thought well maybe I don't really need a parser the this this is just a power right this is the sort of thing I'm looking for so maybe I can just write a regular expression based thing to fall to pull these numbers out of and the the reason this quote is so famous isn't will you where your expressions I and I can I give you the computer science serialize the regular expression is a really elegant way to say something but it's devilishly difficult to get it right and of so the I
did listen to Jamie then the fundamental problem is that what I what I need to be do able to do is extract is balanced delimiters and so I 1 0 if I see the opening brace on the the closing braces and all that information out with the regular expression with the the the richness of that power at that particular engine it's impossible for me to tell the difference in the bottom line between open brace close race the 1st close brace and open brace and the Fox always the final close brace and so what we have in the Wikipedia markup we we can get a real mess we as a nested infoboxes and we get maybe multiple historic properties the real into 1 article the cordons might be a different infobox but it's we see this especially battlefields where the the main infobox can be the battlefield infobox that's have coordinates and then 0 by the way this is also in the national store register and and we also just barely formed input not gonna Wikipedia and hit a page and say sorry there's a syntax error in the when the market by Michael tell you anything we would we could Wikipedia always going to display something but what I ended up doing them was was was plan B. was going for what I called a good enough regular expression that is instead of trying to write perfect that understands every new the article I just need to throw away the stuff I don't care about and throw away everything but the NRHP infobox and then from within that throw away everything but the catalog number and location some as an working pretty well so when I ran that to that API call I showed you a few slides ago retrieve 64 thousand articles that were referred to the National storag register after running through this good enough processing I ended up with uh and of they will handle about three-quarters of them ended up with 16 thousand some of articles articles I couldn't match to a particular historic property and that's what's going to be a on artificially high number because many of those are articles all the national straight register properties in King County Washington all historic ships on the west coast so I could maybe Donnamarie Romania can exploit those articles but these are not the 1 1 matches I'm looking for so this is this is pretty good I was pretty happy with that and that's where I was as of the submitting abstract the the so I could stop here but in the 8 minutes on going so
while back a couple months ago I discovered DB pedia parents and find a lot more but the BPD in last 2 days and that I knew when I walked in the the what in the building of the dbp is part of project based on the notion of semantic web its passing all of Wikipedia and creating RDF articles written in I so creating facts it can vary formal specification that creating records of the facts that are in Wikipedia articles but this this project has been going on for a i I wanna say about 10 years it's it's it's quite mature there are 2 versions 3 . 9 0 but all of the all the extraction been the forms of triples so it'll be at and now a predicate over so a tug poor 1 is a ship cut Borland has latitude of 45 . 3 north would did PD also has a SPARQL endpoint SPARQL was an SQL issue looking language what you do these these a semantic media queries and
so I'll post the slides I want to give you a couple of quick links to get back to DB pedia because a lot of other information in there and that the 2 big ways to get to it are either is a through their their bulk downloads or through the SPARQL endpoint of the all of this stuff is based on that numeric Wikipedia page ID
there also got some canned datasets all mountain peaks so large cities and various theme and tables in about 500 of and unfortunately none of the 500 per of direct use to make the the by 1 make just a quick aside if you're
interested in this wiki in this semantic web stuff at all that into great talks here but 1 of them was yesterday at the time that using semantic web for humanitarian assistance and the other was this morning talking about pulling words and phrases out at such instead what I'm saying now you're you're no also interested in going back and watching these 2 talks on video and so the
Portland reference in DB pedia um is pretty easy fall so here's are at the top of the there's original Wikipedia article ID that a stint in 1947 way correct answer from that a URI which is the BPD . org that's not really something you can open a web browser but that's the specification is used to search and that's the key that they use for all of the the triples time and then we also have a page link from and From Wikipedia page 2 different project wiki data which and talk about all the later or so here's the a
nicely formatted version of all the facts the truth DB pedia was able extract from the Portland Wikipedia articles and 3 tabular form and then displayed in the general form here's what some of the uh the triples end up looking at singers were keen to always on that DB pedia dot org slash resource flesh named ship and then the 2nd element is going to be what predicate we
have but the 2nd and
then finally the value of for the prick so I chose to work with the downloads filed because didn't quite have to deal with lots and lots of network queries to sparkle and 1 have pawns recall so unloading all the time the role downloads into an Objective-C programming our program enquiring that they on based on page Iadies page ID gives me the DB pedia ID ideas now making into these other tables to extract information I care about information I care about and edition the coordinates was on the abstract the article short abstract the article on some media links right so how are you going to do this for your own project you start with
idea some of the major going to explore and I need to find either a template or common article somewhere pull out all of those relevant Wikipedia articles from those in America it to pull up nite included DB pedia and grab their relevant facts and this is going to be a key to the abstract some of the the text versions of the I also dig version the title text version the abstract that any other properties and so
so some numbers here the i and up with the of about 36 thousand DB pedia articles that were referenced back to the national storag register we have 2 different downloads in DB pedia 1 is what they claim to be geo coordinates and the others what they claim to be keyed bits of information this mapping based properties there was about an 80 % overlap between the G coordinates and the actual coordinates that were in this this 2nd key file on my haven't figured out what that what the reason for that is that the bomb wise you need look at both of those so that change for me was another 15 hundred points 50 100 properties richer referenced and some of the years that I was able effects were oracles were have a location but you know what property that matched sunlight could match the property but know location so this is this is a decent improvement in is also wanted point you to a similar project I haven't worked with it and all but I'm aware of this called with data so it's another Semantic Web project their idea is to have two-way communications or Wikipedia so DB pedia is reading wikipedia and then creating product and at the end of it the notion of wiki data is you can edit the database and that automatically updates the infobox and the role of Wikipedia article I haven't found any API have found good downloads but here's what the wiki data entry on the Portland looks like and in fact they were able to identify the ship or the property and they were able to identify the locations in the so the the week
Wikipedia has some pretty well structured information if you know how to find out if you if you follow the rules it start with a list of Wikipedia articles generated somehow maybe manually maybe by following the rules of the form and then you got a choice you can either extract that with your own parser so you could use the DTP or sparkle adapt DVD downloads SPARQL 1 . maybe it's something and Wikidata finally DDP is not human readable the updates to a pretty tedious and uh the download only come anyway so I've been working with a represen came out last August wiki data appears to be pretty well funded I think they've got some Google mining that's another pretty heavy hitters but there for why they will see so for what data is not very far along but this a a challenge a or centers at this conference we tend to spend a lot of time thinking about geometry geospatial is not just about geometry is not about projecting rostas and points and vectors and it's about getting information and so this is a up pretty handy way to get information in the hands of users and that's already organized geographically a couple questions
which like Warwick and right the the question thank you the have you looked at Wikimapia to help you find these places but I have not looked at which a market which a map in detail now but 1 appear to appear so you might be that do the EPA find overweight if if you have existing coordinates to see what wiki articles are around you you have to start with a known article and get the coordinates of that particular article from what I've seen so far there's no geographic searching in DB pedia or what data and might be that's geographic search capability if he had to sparkle from point but I am drug use SPARQL endpoint the that emerges curious city got about 15 hundred X results using the the BPD approach where any results that you found using a regular expression search not with the media I think there are some results of a family of my regular expression could not DB pedia but I'm not sure what that means because I'm working with can the pedia results from a year ago there's a capture Wikipedia year ago on what he had to articles and updated since then and sigh I and seeing some results I'm getting that Wikipedia which DB pedia is not finding that but I mean I don't have an explanation for it OK we think of a comedian
Software
Ein-Ausgabe
Computeranimation
Assoziativgesetz
Kraftfahrzeugmechatroniker
Kategorie <Mathematik>
Datenhaltung
Gebäude <Mathematik>
Versionsverwaltung
Kartesische Koordinaten
Dienst <Informatik>
Identitätsverwaltung
Biprodukt
Fokalpunkt
Computeranimation
Eins
Spannweite <Stochastik>
Menge
Faktor <Algebra>
Zeiger <Informatik>
Information Retrieval
Bit
Web Site
Wasserdampftafel
Versionsverwaltung
Green-Funktion
Bridge <Kommunikationstechnik>
Smith-Diagramm
Computeranimation
Mapping <Computergraphik>
Deskriptive Statistik
Wort <Informatik>
Projektive Ebene
Information
Ordnung <Mathematik>
Varianz
Fehlermeldung
Nominalskaliertes Merkmal
Total <Mathematik>
Information
Computeranimation
Addition
Bit
Quader
Kategorie <Mathematik>
Datenhaltung
Template
Kardinalzahl
Computeranimation
Projektive Ebene
URL
Information
Datenstruktur
Standardabweichung
Deskriptive Statistik
Kategorie <Mathematik>
Speicherabzug
Zahlenbereich
Identifizierbarkeit
Online-Katalog
URL
Biprodukt
Binder <Informatik>
Computeranimation
Dualitätstheorie
Punkt
Quader
Beschreibungssprache
Reihe
Versionsverwaltung
Zahlenbereich
Strömungsrichtung
Computeranimation
Homepage
Übergang
Teilmenge
Bildschirmmaske
Datentyp
URL
Normalvektor
Verkehrsinformation
Gerade
URL
Orakel <Informatik>
Standardabweichung
Physikalischer Effekt
Wasserdampftafel
Beschreibungssprache
Klasse <Mathematik>
Mathematisierung
Versionsverwaltung
Regulärer Ausdruck
Parser
Bildschirmfenster
Quick-Sort
Netzadresse
Computeranimation
Konfiguration <Informatik>
Homepage
Regulärer Ausdruck
Arithmetischer Ausdruck
Minimalgrad
Projektive Ebene
Information
Informatik
Figurierte Zahl
Leistung <Physik>
Subtraktion
Prozess <Physik>
Beschreibungssprache
Formale Sprache
Versionsverwaltung
Zahlenbereich
Abgeschlossene Menge
Online-Katalog
Information
Computeranimation
Formale Semantik
Homepage
Datensatz
Bildschirmmaske
Vererbungshierarchie
Ontologie <Wissensverarbeitung>
Speicher <Informatik>
Gerade
Leistung <Physik>
Umwandlungsenthalpie
Prozess <Informatik>
Matching <Graphentheorie>
Kategorie <Mathematik>
Gebäude <Mathematik>
Zwei
Abfrage
Systemaufruf
Ein-Ausgabe
Dateiformat
Rechenschieber
Regulärer Ausdruck
Prädikat <Logik>
Ein-Ausgabe
Hypermedia
Mereologie
Projektive Ebene
URL
Information
Versionsverwaltung
Semantic Web
Fehlermeldung
Rechenschieber
Linked Data
Raum-Zeit
Information
Kardinalzahl
Ontologie <Wissensverarbeitung>
Binder <Informatik>
Computeranimation
Tabelle <Informatik>
Homepage
Umwandlungsenthalpie
Subtraktion
Wort <Informatik>
Browser
Wiki
Binder <Informatik>
Computeranimation
Videokonferenz
Homepage
Linked Data
Notepad-Computer
Wort <Informatik>
Projektive Ebene
Schlüsselverwaltung
Semantic Web
Gleitkommarechnung
Hypermedia
Bildschirmmaske
Verschlingung
Versionsverwaltung
Koordinaten
Element <Mathematik>
Computeranimation
Gleitkommarechnung
Transinformation
Datennetz
Kategorie <Mathematik>
Template
Versionsverwaltung
Abfrage
Koordinaten
Binder <Informatik>
Template
Computeranimation
Homepage
Hypermedia
Mailing-Liste
Verschlingung
Hypermedia
Projektive Ebene
Information
Optimierung
Schlüsselverwaltung
Tabelle <Informatik>
Telekommunikation
Bit
Subtraktion
Punkt
Mathematisierung
Parser
Zahlenbereich
Information
Textur-Mapping
Räumliche Anordnung
Template
Computeranimation
Homepage
Data Mining
Bildschirmmaske
Punkt
Speicher <Informatik>
Auswahlaxiom
Soundverarbeitung
Kategorie <Mathematik>
Schlussregel
Mailing-Liste
Vektorraum
Kontextbezogenes System
Wiki
Elektronische Publikation
Parser
Mapping <Computergraphik>
Zahlenbereich
Projektive Ebene
Information
URL
Semantic Web
Orakel <Informatik>
Motion Capturing
Regulärer Ausdruck
Resultante
Mapping <Computergraphik>
Punkt
Hypermedia
Familie <Mathematik>
Vorlesung/Konferenz
Wiki

Metadaten

Formale Metadaten

Titel Extracting geographic data from Wikipedia
Serientitel FOSS4G 2014 Portland
Autor Mueller, Hal
Lizenz CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
DOI 10.5446/31691
Herausgeber FOSS4G, Open Source Geospatial Foundation (OSGeo)
Erscheinungsjahr 2014
Sprache Englisch
Produzent Foss4G
Open Source Geospatial Foundation (OSGeo)
Produktionsjahr 2014
Produktionsort Portland, Oregon, United States of America

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract A large fraction of Wikipedia's millions of articles include geographic references. This makes Wikipedia a potentially rich source for themed, curated geographic datasets. But the free form nature of Wikipedia's markup language presents some technical challenges. I'll walk through the Wikipedia API, show how to get to the various places where spatial info might be found, and show some blind alleys I've followed. Examples are from a project that uses Wikipedia to enhance a map-based iOS app of some US National Park Service data.
Schlagwörter Wikipedia
data processing
mashups
National Register of Historic Places

Ähnliche Filme

Loading...