Merken

Who's On First - because sometimes geo is not spatial

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
but
some of the slides will the Office provided a screen and hi everyone my name's article uh it's really history Beckett prosper GE and I 1st attended in 2007 in Victoria and life in circumstances that prevented me from coming back since then so and it's great to be here at 1 of those 1
of the things that this talk provoke was the realization inside the company that nobody actually knows what my title is and so were trying out editor a large uh we'll see how that goes and this is where I can be reached and I'm going to talk today about up to an ongoing project were doing um which is a gazetteer uh and before I get into that I want to do crash through the last 10 years of my professional life in 3 slides to try and give some context to the work that I'll be talking about that's about a million
years ago I worked for a small photo-sharing website called flicker and suitable for the teachers of cats there and 1 of the things that I worked on while I was there was that you project so to allow people to put their collars on maps and and 1 of the by-products of all of that were were what we call the alpha shapes and which of these crazy shape see In this slide and those were all shape files that were derived from nothing that you tagged photos and then we release that as a public domain datasets uh and it was a global geographic datasets for neighborhoods countries regions that sort of thing and we had we have our own gazetteer and flicker but it only returned bounding boxes which may doing geocoding really hot so this is an this is 1 of the things that we ended up doing that a little after that I
worked for a design studio in statistical statement and we did a lot of mapping work that we loved maps and with a lot of work with OpenStreetMap and data and a lot of the work that we did sometimes for clients um and sometimes just for ourselves was really about you have an opportunity to revel in an abundance and availability of data that had never been their so this is a project that I did called pre maps which was basically taking all of the flicker shapefiles that quite a lot of data from the OpenStreetMap some data from natural earth and then just pushing it all together and then just recently I did a
bit of a detour into the museum world for about 3 years and where we were closed for 3 years and and as part of the reopening we built a custom hardware which is a whole other story and many huge infrastructure that allowed visitors to take this customers Aqaba was not fun hand within a c tag in it and you could walk run museum and collect objects and then you visit would be saved basically forever and museum website so every object had a commonly every visit had a family an and
what all these projects share in common is this idea of a network of documents this is not a new idea it's basically the Web and it's the Web from 20 years ago the web is sort of evolved to it's a weird state right now where the Web is evolving to become television and in all the good and bad ways but it is worth remembering that 25 years ago what the Web was was in the ability for people to Rekall documents at the laundry and pace of their own choosing and that's a big deal we had never been able to do that before everybody had together at the same time the same place to watch the same television show and what the web allows you to do is to go back and look at something when the urge strikes M. and so you know it that idea that access to recall is very much a part in and it always has been on the Web was sort of a liberating force in that way
the so fast forward to now and and I'm working at maps and and we are building a gazetteer and 1 way to think about a gazetteer is that it's just a big list of places and each place has a unique ID In a series of properties associated with that said minute a huge list of stuff and but that's really important that ability to refer to something with the shorthand at the subtitle of the of the talk was sometimes geo is not spatial which is a deliberately provocative subtitle and but I actually do believe that sometimes geo isn't about spatial queries because you don't have the data for because the data is too big for because the infrastructure burden to work with that data is prohibited sometimes it's nice to simply be able to say California and refer to it by Short Line ID and and so were building the big dataset for the entire world we are not the 1st gazetteer we're not the 1st open gazetteer but what we have endeavored to do is to take about a half a dozen projects that already existed and merge them where it's appropriate at end to do coverage all the way down so that means continents countries regions localities neighborhoods and then use and we're starting with for shapes we have merged in natural earth uh where we can we're taking Yahoo's GeoPlanner dataset and incorporating their names they have much better names than anyone else in terms of quality and coverage and we're also taking smaller projects like David Blackman uh zeta shapes which from our neighborhoods in the US come and I mentioned venues which is sort of the Holy Grail of Open Data and the sad sad truth is that there is no the new data it doesn't exist with the 1 exception of the work that simple geo did fire 6 years ago to release a 20 million . dataset of business listings and and we are importing that we're incorporating it into our gazetteer and so when we're done for about 4 and a half million records into a 2 million record dataset and every 1 of those venues will have a hierarchy associated with them 4 countries and localities and then we are building as much as anything we're building a scaffolding for place for all the other services that we offer so the example that I've been using lately is it's 1 thing to be able to geocode Denver and had that disambiguated down to a locality in Colorado the but let's say you want and then asks for the weather in Denver uh and it is insane 10 packs test as the weather service to perform that same disambiguation request if the geocoder can return a stable permanent IDE for Denver at all you should need to do is hand at all to the weather service or any other service calls and so a gazetteer is a big topic I'm not going to cover all of it in 20 minutes but 1 of the things that I'm I do I talk about is some of the 1st principles that we can starting from and 1 of them is this that the
data is not in the database and this is really important we are not optimizing for anyone database databases come and go and 1 of the things that we are not concerned but that were mindful of is that we want our work the work that we produce either a software data packages to exist beyond or never and were it to succeed but you know life happens and the reality of open-source Geo is that were not the 1st people to do this and and lots of other the tensor failed and people just get burned by so the most important thing about the data is portability portability longevity durability and so he chosen to standardized on textiles every single computer everywhere can deal with text you should be able to look at the data in a text editor choosing you should be able to edit this data in microsoft word if that's what you need to do um and we have standardized on duties on only because you that is the it is the least amount of formatting out there at the moment for structured data there's nothing special about it it's just the minimal model for us and importantly there's lots of other tools for converting it to other things so the reality is that this makes the data difficult to work with in the short term especially when you're dealing with huge amounts of data and so that's part of the work and part of the work going forward will be building tools to marshal all that data into different databases and to make it easier for people but at the end of the day this is text box these are text of these could be printed out and put in a book somewhere if we need to that and the other 1 is stable permanent ideas and these are numeric ideas the 64 bit integers there's nothing terrible special terrible special about them except that it means that at things are identifiable uniquely and it makes it really easy to generate your notes and here's just a quick example
but this is the shape data for California and behind it is the shape data for the United States the relevant bit and all using man is this if you can't see that and what's happening is the data is being sent down to the browser for California and that data includes a number of pointers to all of its apparent records including the United States and and then the client in JavaScript what we're doing is we're turning around and were saying nuities please have the document for the United States and we're downloading it extracting data rendering shape file on the client side and this is not necessarily the most efficient efficient way to do this but 1 of the things that we're trying to do in this project is to kick the tires every step of the way to make sure that every 1 of those documents every 1 of those resources dispatchable on line and that you could be able to do something with it so apologies this franchise probably
completely legible the from like the 2nd row on and another example of this is and this is a tool for browsing the data and an almost left-hand side or search results and on the right hand side or a facets are aggregations of that data and and so the 2nd set of results there is all of the unique regions that match this search results and and what we do is return ideas because that's what stored in the database at but then what we're doing client-side is again looping through each 1 of those ideas turning around asking the network for the corresponding document and updating the name in place an again this can be super super inefficient we know this well this tool is well talk about in a moment and then part of what this tool is designed to do is to just the the infrastructure with a stick over and over and over again so that we can figure out what works and what doesn't 1 of the advantages to
having a document based network with permanent Heidi's is that we have a convenient place for putting all of the names and all the spelling mistakes we we don't want to get into an argument with people about what something should be called or how it's spelled we have a room we have room to put all all names likewise all the geometries and 1 of the things that we have decided on is that any given record will have what we're referring to is a consensus geometry consensus is not the right 3rd term but we haven't had a better 1 yet but this will be essentially the default geometry for a place but we also have a corresponding file for all the alternate geometries not everyone agrees about the boundaries for places uh and are issue is not to make those decisions for people but simply to be a place to reflect those discussions likewise some geometries are better for certain functionality you don't necessarily need a hybrid detailed coastline to do that you could inquiry for reverse geocoding concordancers and we have concordancers with GeoNames GeoPlanner watcher shapes eventually will have concordancers with them the recently released Getty Thesaurus of Geographic Names and we will hold hands with pretty much anybody and there's lots of room likewise that every record has hierarchy and in fact some places have multiple hierarchies that not everybody agrees on the relationship of a given place and and so again and we're not trying to make those decisions for people we're trying to read as many of those decisions to the edge to the edges rather in simply reflect what people are saying about place the 1 of the things that the 1 of the things that I have argued for for a long time in gazetteers is the notion that every record has 2 properties supersedes were superseded by and what that means is that we have a mechanism we have a framework that allows a place to change now there is a large philosophical question has nothing to do with your per say which asks when is something simply updated versus when does it fundamentally change right is the caterpillar the butterfly at the answer that question is I think right model get a drink and talk about it and again we're not trying to answer this question but we are trying to provide breadcrumbs so that you know when Yugoslavia stops becoming Yugoslavia the nation that people knew until 1992 that record still exist there still pointed to it it's still a durable reliable and point for people to do something with so again just a
just to repeat we're trying to reflect the debate we're not trying to decide for people
then so there are about 400 thousand administrative place types at the moment and for 1 half million business listings at so that a lot of data and it's pretty hard to so the record had around and it's free hard remember where anything is and so we have started building tools have for use internally and but everything is open source so um the source code for this is that publicly available to use to explore the data and not everyone may know what the terms belong the refers to uh is a term used in cave explorations for essentially feeling your way around and unknown cave in the dark works exploring by touch intuition and and
so this is what it looks like it's pretty straightforward and at the moment it indexes data in last search only actually it's not true we index inelastic search and we also index that imposed yes and this is part of the attempt to put all data in all the databases and to figure out what works and what doesn't and to rinse and repeat and as of today this Banca doesn't do spatial queries and all this does is index on properties In the judicial pile 1 but you can do some pretty amazing stuff and and 1 of the things that this starts to demonstrate is 1 of the lessons we learned when I was working at the museum and and the dirty little secret about museums is that everybody's metadata is terrible nobody likes to say that out loud in public but it's true and 1 of the things the museum did before I got there was we simply C C 0 all the metadata so is that the pace the set of 2 there is no putting it back but I was actually OK because 1 of the things we learned working with all that data and building collections website was that the value of the aggregate data vastly outweighs the sum total of a perfect subset them and so you know we know that quite a lot of the data in the maps and gazetteers is incomplete or sometimes incorrect there's some pretty amazing stuff that you can find so this is just a screenshot of at 11 localities in Korea that have been flagged as megacities this is on all of the descendants of South Korea that are localities so this is just a paginated view and these are all in the descendants of the neighborhood that I had to create when I lived in New York called the wants heights an and I show this because the official neighborhood around here that is called the
1 us yeah and and 1 of the things that we will do shortly we descend on the work is every record will contain a list of pointers of other records of the same place tight whose geometry breaches that records geometry so for example and it may be a little hard to see but there's a yellow dotted line that overlaps the pink shapes of pink shape is 1 the old of Linus quantites we will do this for every record we'll do this 1 so that we have something to start with in terms of doing uh editing and data quality work just a flag to say these 2 things intersect and maybe they shouldn't but also to reflect the reality that nobody agrees about neighborhoods ever the it likewise you may be able to see the 2 centroids in the hall done for the 1 as 1 of them is the arithmetical young centroid just the actual center of polygons and the other 1 is we don't OK for a moment and and the other 1 is uh what we're referring to is a label centroid which is derived from that blocks back shipper and which is pretty great because sometimes the label needs to go in a different place uh
venues and this is an example from the simple dataset and the longest canal i which is actually a toxic waste site in New York is listed as a venue and it's also a feature that people have a warm fuzzy feelings about parented like on this also incorrect but
there's actually a Korean topic place at the corner of nite Smith and that this is 1 of the things that the longer allows us to see that there's this data that needs to be fixed but also the
notion of ground truth and this is a map that uh to artists in San disco made about micro goods in the line neighborhood of San Francisco and they have a lot of phonemes so we imported them into the gazetteer and we parotid them all by the tender line and
this is what it looks like so i record for the tender line doesn't actually reflect what people in temperatures go thank that's a useful thing for us to see that this this is just another example of out the linkage In my provided and the parent of it is a there that's likewise this is just the raw properties don't consider the duties on file what's nice about this is just an arbitrary bag of data we can put whatever we want to it and comes with and the speech up and 1 of the things we started doing on the client side is predefined that data as it comes down to you can toggle list you back and forth that and part of the reason we're doing this is because the next step is to think about how we build editing tools what is an editing interface this look like we don't know yet
right so I think I'm out of time and money to put up to links uh 1 is that there is a 5 thousand word blog post about this subject there's you know at least another 5 thousand words to talk about it but not today and and then this may or may not be alive as I'm speaking if it's not lie right now we later today or tomorrow and this is a public version of this Blanca so you can poke around the data have a look at it and then there are links to the source code and most importantly all the data itself thanks the few it yeah 2 true with the microphone the it I just wondered how you generate you'll identifiers and how you keep the covenant so in Africa Gulf California record another 1 comes in using LDA money they all have additives that so this 2 questions there 1 is and how we generate the ideas and it's just the ticket server it's mice QL yet I just randomly generated Totoro incrementing or OK also agreements and carrier and and have been added you and the the updates if you got presumably you using secondary data as well he not solve primary sorry have had EDG Finetti so that's California that's kind of what you're not states that had we do did you bring the that's uh some of that is the next step can and we've been able to we've been able to rely on data that has strong concordancers right now and I mentioned that we have concordancers read your planner and 1 of the things that I'm we wanted to is identify which records in Geo planet we don't have an idea importance for and and then figure out which of those records we already have copies of so essentially becomes a you problem but it becomes a geocoding problem that we can then spoke to a locality or region or country because we already have a heritage of places OK thanks and and other sector that question I guess that is isn't feels you centroid CCD city of C stop with the 40 guns in some cases how do you sign them the centroids cheese point on cities I stress it went on surface all had either so the the the geometric centroid yet uh is done using cheaply I can't so where shown thanks OK and the other 1 is um Matlock's Nechyba so it's whatever man thinks I can uh and the idea is that there is room for lots and fluids and and you know reserve overloading the term for centroid do not necessarily mean a a mathematical point but uh of a point of focus I can do this by thinking and what I don't think that have lots of questions about part of facilitate collaboration across that people teams countries the thing but I suppose the low-hanging fruit be I have you considered you mention you can store Wenger different out what my name's in are not as text is the format matches our I would about ways of still daylight localization of names like more formal uh name judges rather than just having you know things like typos between names but the actual formal like localization localized into 1 that sure at least to start and you know we'll see how it plays out but to start with we have adopted what and you kind it did which is pretty basic convention of a 3 letter language code followed by a suffix and stuff because either at the preferred name in that language uh colloquials or variants and then we just include all of them and the next step would be to identify the languages spoken in those places and then include that information so word as in the future do you imagine there being toolsfor people to add to these places around the world as as you add terms like the absolutely and you know community editing is is absolutely something we'd like to do and it's desirable it's good it's also hard problem a hand you know were trying to do this sort of 1 step at a time and you know 1 of the things that we talked about in the blog post was all the data is available on GitHub and what we said was to please don't get too attached to get out or get per say because it's probably not the right place for a data this volume the but until we figure out an alternative what did have does is it demonstrates the goals that we have for the project which is that you should be able to fork you should be able to download a copy you should be able to submit a pull request and and after that it's a lot of detail work that we will figure out as we go into questions of give a plan for the getting people use it beyond if we build an expanded the Robinson is that has very much to the population of people who are using it inter I think it's relevant and then back in the start of the changes that will use that and revised stuff growing so visiting beyond if you build it they will come to you I mean we're talking to people and you know do we have like a fully formed strategy know think part of it is and so this is some of the services that we need internally depend on having this kind of data available and so that's the 1st step and then you to over use an expression where your own dogfood demonstrate what's possible um and we you know we there was an expression that a former colleague of mine had about at a clicker which was to create an a community if you know that the generous community a spirit of generousness and openness and the so we think that it's a problem that everyone hats and around place so I am entirely at you questions but the will have chop it and yet I mean my experience has been that this is a problem that just keeps coming up every single time served your project um you know that's 1 of the things that maps and is here it will do which is you you are sort of project imagine if you didn't have to reinvent every part of the wheel and that includes the right thank you thank
you you know and and and and and and and
it
Rechenschieber
Videospiel
Räumliche Anordnung
Computeranimation
Touchscreen
Office-Paket
Nachbarschaft <Mathematik>
Videospiel
Web Site
Shape <Informatik>
Quader
Systemzusammenbruch
Kontextbezogenes System
Dialekt
Quick-Sort
Computeranimation
Rechenschieber
Mapping <Computergraphik>
Texteditor
Digitale Photographie
Jensen-Maß
Projektive Ebene
Computerunterstützte Übersetzung
Public-domain-Software
Objekt <Kategorie>
Mapping <Computergraphik>
Web Site
Bit
Statistik
Befehl <Informatik>
Client
Hardware
Natürliche Zahl
Mereologie
Familie <Mathematik>
Projektive Ebene
Computeranimation
Nachbarschaft <Mathematik>
Stabilitätstheorie <Logik>
Hierarchische Struktur
Räumliche Anordnung
Term
Benutzerbeteiligung
Datensatz
Datennetz
Permanente
Gerade
Softwaretest
Shape <Informatik>
Datennetz
Kategorie <Mathematik>
Reihe
Stellenring
Systemaufruf
Abfrage
Ausnahmebehandlung
Mailing-Liste
Quick-Sort
Dialekt
Dienst <Informatik>
Forcing
Offene Menge
Projektive Ebene
Programmierumgebung
Aggregatzustand
Bit
Subtraktion
Quader
Momentenproblem
Spielkonsole
Browser
Zahlenbereich
Computer
Term
Räumliche Anordnung
Computeranimation
Datenhaltung
Datensatz
Informationsmodellierung
Client
Tensor
Software
Zeiger <Informatik>
Mobiles Endgerät
Gerade
Metropolitan area network
Caching
Videospiel
Shape <Informatik>
Datenhaltung
Open Source
Dämon <Informatik>
Elektronische Publikation
Texteditor
Mereologie
Projektive Ebene
Wort <Informatik>
Resultante
Punkt
Momentenproblem
Hierarchische Struktur
Term
Räumliche Anordnung
Framework <Informatik>
Computeranimation
Datensatz
Informationsmodellierung
Reverse Engineering
Permanente
Default
Kraftfahrzeugmechatroniker
Lineares Funktional
Parametersystem
Datennetz
Kategorie <Mathematik>
Elektronische Publikation
Dialekt
Entscheidungstheorie
Randwert
Menge
Laufwerk <Datentechnik>
Rechter Winkel
Mereologie
Räumliche Anordnung
Maschinenschreiben
Datensatz
Open Source
Datentyp
Systemverwaltung
Mailing-Liste
Quellcode
Term
Computeranimation
Nachbarschaft <Mathematik>
Total <Mathematik>
Gewichtete Summe
Momentenproblem
Natürliche Zahl
Term
Räumliche Anordnung
Polygon
Computeranimation
Metadaten
Datensatz
Fahne <Mathematik>
MIDI <Musikelektronik>
Zeiger <Informatik>
Gerade
Shape <Informatik>
Sichtenkonzept
Informationsqualität
Kategorie <Mathematik>
Datenhaltung
Gebäude <Mathematik>
Stellenring
Abfrage
Mailing-Liste
p-Block
Mapping <Computergraphik>
Teilmenge
Hierarchische Struktur
Menge
Automatische Indexierung
Mereologie
Web Site
Computeranimation
Nachbarschaft <Mathematik>
Kategorie <Mathematik>
Güte der Anpassung
Sprachsynthese
Mailing-Liste
Elektronische Publikation
Mapping <Computergraphik>
Datensatz
Client
Speicherbereichsnetzwerk
Mereologie
Vererbungshierarchie
Gerade
Schnittstelle
Subtraktion
Punkt
Web log
Fluid
Formale Sprache
Mathematisierung
Versionsverwaltung
Automatische Handlungsplanung
Räumliche Anordnung
Sommerzeit
Term
Code
Computeranimation
Data Mining
Arithmetischer Ausdruck
Datensatz
Flächentheorie
Äußere Algebra eines Moduls
Spezifisches Volumen
Metropolitan area network
Addition
Freier Ladungsträger
Güte der Anpassung
Stellenring
Quellcode
Binder <Informatik>
Fokalpunkt
Quick-Sort
Web log
Mapping <Computergraphik>
Kollaboration <Informatik>
Dienst <Informatik>
Offene Menge
Ein-Ausgabe
Mereologie
Server
Strategisches Spiel
Dateiformat
Identifizierbarkeit
Wort <Informatik>
Projektive Ebene
Information
Aggregatzustand
Vorlesung/Konferenz

Metadaten

Formale Metadaten

Titel Who's On First - because sometimes geo is not spatial
Serientitel FOSS4G Seoul 2015
Autor Cope, Aaron
Lizenz CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.
DOI 10.5446/32053
Herausgeber FOSS4G
Erscheinungsjahr 2015
Sprache Englisch
Produzent FOSS4G KOREA
Produktionsjahr 2015
Produktionsort Seoul, South Korea

Inhaltliche Metadaten

Fachgebiet Informatik

Ähnliche Filme

Loading...