Who's On First - because sometimes geo is not spatial
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 183 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/32053 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer | ||
Production Year | 2015 | |
Production Place | Seoul, South Korea |
Content Metadata
Subject Area | |
Genre |
00:00
Slide ruleTouchscreenVideo gameOffice suiteXMLComputer animation
00:24
Random matrixProjective planeQuicksortSlide ruleText editorContext awarenessCrash (computing)Video gameComputer animation
01:01
QuicksortBitStorage area networkAlpha (investment)MappingDigital photographyShape (magazine)Set (mathematics)FlickrPublic domainComputer-assisted translationClient (computing)Projective planeNatural numberNeighbourhood (graph theory)DialectSlide ruleWebsiteCuboidStatisticsStatement (computer science)Computer animation
02:39
BitElectronic mailing listOpen setExistenceException handlingSeries (mathematics)Object (grammar)WebsitePoint (geometry)MereologyQuicksortRow (database)Projective planeSet (mathematics)SoftwareWeb 2.0Category of beingQuery languageHierarchyService (economics)State of matterShape (magazine)Stability theoryRight anglePermanentGeometryLocal ringTerm (mathematics)Multiplication signNeighbourhood (graph theory)Power (physics)Forcing (mathematics)GeometryDialectLine (geometry)Computer hardwareSoftware testingFamilySystem callIntegrated development environmentComputer animation
08:09
GeometryIntrusion detection systemText editorDatabaseUniform resource locatorData modelNeuroinformatikOpen sourceMereologyVideo gameComputer fileWordSoftwareMoment (mathematics)Term (mathematics)Endliche ModelltheorieDifferent (Kate Ryan album)TensorPortable communications deviceCuboidBitComputer animation
10:36
Metropolitan area networkResultantWeb browserRight angleNumberSet (mathematics)Pointer (computer programming)Line (geometry)Shape (magazine)Client (computing)MereologyProjective planeBitRow (database)DialectSoftwareComputer fileMoment (mathematics)Inheritance (object-oriented programming)DatabaseArtistic renderingZoom lens2 (number)Computer animation
12:49
Decision theoryRow (database)Query languagePointer (computer programming)GeometryFunctional (mathematics)Multiplication signDefault (computer science)HierarchyReverse engineeringConcordance (publishing)PermanentSoftwareComputer fileIntrusion detection systemSoftware frameworkParameter (computer programming)Continuous trackCategory of beingMechanism designTerm (mathematics)Right angleEndliche ModelltheorieBoundary value problemPoint (geometry)Computer animation
16:04
Source codeType theoryElectronic mailing listRow (database)Term (mathematics)Open sourceSystem administratorTouch typingQuicksortDisk read-and-write headMoment (mathematics)Computer animation
17:02
Local ringView (database)SubsetMappingLevel (video gaming)Storage area networkNeighbourhood (graph theory)Row (database)WebsitePointer (computer programming)Radio-frequency identificationElectronic mailing listClient (computing)Inheritance (object-oriented programming)Category of beingComputer fileGeometrySubject indexingLine (geometry)Shape (magazine)MereologyType theoryQuery languageDatabaseInterface (computing)Data qualityFlagPolygonMetadataDot productTerm (mathematics)SummierbarkeitTotal S.A.Speech synthesisSet (mathematics)Goodness of fitBlock (periodic table)Moment (mathematics)Computer animation
22:21
Different (Kate Ryan album)File formatDaylight saving timeCodeLocal ringFormal languageMultiplication signLink (knot theory)IdentifiabilityInformationBlogSource codeTerm (mathematics)WordRevision controlExterior algebraGoodness of fitQuicksortVolume (thermodynamics)AdditionGeometryServer (computing)Point (geometry)Charge carrierState of matterRow (database)CASE <Informatik>MathematicsStrategy gameMereologyProjective planePlanningService (economics)ExpressionData miningSurfaceMetropolitan area networkFluidMappingFocus (optics)Open setCollaborationismFormal grammarData structureIntrusion detection systemConcordance (publishing)Absolute valueHierarchyPolygonRight angleMoving averageSet (mathematics)GeometryBlock (periodic table)Coefficient of determinationSpeech synthesisComputer animation
30:03
Computer animation
Transcript: English(auto-generated)
00:03
Some of the slides will be off the side of the screen. Hi everyone, my name is Aaron Cope. It's really nice to be back at Phosphor G. I first attended in 2007 in Victoria, and life in circumstance has prevented me
00:20
from coming back since then, so it's great to be here. One of the things that this talk provoked was the realization inside the company that nobody actually knows what my title is. So we're trying out Editor at Large. We'll see how that goes. This is where I can be reached.
00:42
I'm going to talk today about an ongoing project we're doing, which is a gazetteer. And before I get into that, I want to sort of do a crash through the last 10 years of my professional life in three slides to try and give some context to the work that I'll be talking about.
01:02
So about a million years ago, I worked for a small photo sharing website called Flickr. Some people uploaded some pictures of cats there. And one of the things that I worked on while I was there was the geotagging project, so to allow people to put their photos on a map. And one of the byproducts of all of that work
01:21
were what we called the alpha shapes, which are these crazy shapes you see in the slide. Those were all shape files that were derived from nothing but geotagged photos. And we released that as a public domain data set. And it was global geographic data sets
01:40
for neighborhoods, countries, regions, that sort of thing. We had our own gazetteer at Flickr, but it only returned bounding boxes, which made doing geocoding really hard. So this is one of the things that we ended up doing.
02:01
A little bit after that, I worked for a design studio in San Francisco called Stamen. We did a lot of mapping work. We loved maps, and we did a lot of work with OpenStreetMap and data. And a lot of the work that we did, sometimes for clients and sometimes just for ourselves, was really about an opportunity to revel in an abundance
02:23
and an availability of data that had never been there. So this is a project that I did called Pretty Maps, which was basically taking all of the Flickr shape files, quite a lot of data from OpenStreetMap, some data from Natural Earth, and just pushing it all together. And then just recently, I did a bit of a detour
02:42
into the museum world for about three years, where we were closed for three years. And as part of the reopening, we built custom hardware, which is a whole other story, and a huge infrastructure that allowed visitors to take this custom piece of hardware,
03:01
it was an electronic pen with an NFC tag in it, and you could walk around the museum and collect objects. And then your visit would be saved basically forever on the museum website. So every object had a permalink. Every visit had a permalink. And what all of these projects share in common
03:22
is this idea of a network of documents. This is not a new idea, it's basically the web. It's the web from 20 years ago. The web has sort of evolved to, it's in a weird state right now, where the web is sort of evolving to become television in all the good and bad ways.
03:41
But it is worth remembering that 25 years ago, what the web was was an ability for people to recall documents at the luxury and pace of their own choosing. That's a big deal. We had never been able to do that before.
04:01
Everybody had to gather at the same time in the same place to watch the same television show. And what the web allows you to do is to go back and look at something when the urge strikes. And so that idea that access to recall is very much a power dynamic, and it always has been,
04:23
and the web was sort of a liberating force in that way. So fast forward to now, and I'm working at MapsN, and we are building a gazetteer. And one way to think about a gazetteer is that it's just a big list of places, and each place has a unique ID
04:42
and a series of properties associated with it. That's it. I mean, it's a huge list of stuff. But that's really important. That ability to refer to something with a shorthand. The subtitle of the talk was
05:01
Sometimes Geo is Not Spatial, which was a deliberately provocative subtitle. But I actually do believe that. Sometimes geo isn't about spatial queries because you don't have the data or because the data is too big or because the infrastructure burden to work with that data is prohibitive.
05:21
Sometimes it's nice to simply be able to say California and refer to it by a shorthand, by an ID. And so we're building a big data set for the entire world. We are not the first gazetteer. We are not the first open gazetteer. But what we have endeavored to do
05:42
is to take about a half a dozen projects that have already existed and merge them where it's appropriate and to do coverage all the way down so that means continents, countries, regions, localities, neighborhoods, and venues.
06:02
We are starting with Quattro shapes. We have merged in natural earth. Where we can, we are taking Yahoo's Geo Planet data set and incorporating their names. They have much better names than anyone else in terms of quality and coverage.
06:20
We're also taking smaller projects like David Blackman's Zeta Shapes, which are neighborhoods in the US. And I mentioned venues, which is sort of the holy grail of open data. And the sad, sad truth is there is no open venue data. It doesn't exist.
06:40
With the one exception of the work that SimpleGeo did five or six years ago to release a 20 million point data set of business listings. And we are importing that and we are incorporating it into our gazetteer. And so when we're done, we're about four and a half million records
07:01
into a two million record data set. Every one of those venues will have a hierarchy associated with them for countries and localities. And we are building, as much as anything, we are building a scaffolding for place for all the other services that we offer. So the example that I've been using lately is
07:22
it's one thing to be able to geocode Denver and to have that disambiguated down to a locality in Colorado. But let's say you want to then ask for the weather in Denver. And it is insane to have to ask the weather service to perform that same disambiguation request.
07:43
If the geocoder can return a stable permanent ID for Denver, then all you should need to do is hand that off to the weather service or any other service that follows. So a gazetteer is a big topic.
08:00
I'm not gonna cover all of it in 20 minutes, but one of the things that I do want to talk about is some of the first principles that we've been starting from. And one of them is this, that the data is not the database. And this is really important. We are not optimizing for any one database. Databases come and go,
08:21
and one of the things that we are not concerned but that we're mindful of is that we want our work, the work that we produce, either as software or data packages, to exist beyond our endeavor. We're in it to succeed, but life happens.
08:45
And the reality of Open Source Geo is that we're not the first people to do this, and lots of other attempts have failed, and people just get burned by it. So the most important thing about the data is portability.
09:01
Portability, longevity, durability. And so we have chosen to standardize on text files. Every single computer everywhere can deal with text. You should be able to look at the data in a text editor of your choosing. You should be able to edit this data in Microsoft Word
09:22
if that's what you need to do. And we have standardized on GeoJSON only because it is the least amount of formatting out there at the moment for structured data. There's nothing special about it, it's just the minimal amount of fuss.
09:40
And importantly, there's lots of other tools for converting it to other things. So the reality is that this makes the data difficult to work with in the short term, especially when you're dealing with huge amounts of data. So that's part of the work. Part of the work going forward will be building tools to marshal all of that data into the different databases
10:02
and to make it easier for people. But at the end of the day, this is text files, or these are text files, rather. These could be printed out and put in a book somewhere if we needed to. And the other one is stable permanent IDs. These are numeric IDs, they're 64-bit integers,
10:22
there's nothing terribly special about them, except that it means that things are identifiable uniquely and it makes it really easy to generate URLs. Here's just a quick example. This is the shape data for California.
10:45
And behind it is the shape data for the United States. The relevant bit, and I'll zoom in, is this. If you can't see that, what's happening is the data is being sent down to the browser
11:00
for California, and that data includes a number of pointers to all of its parent records, including the United States. And then in the client, in JavaScript, what we're doing is we're turning around and we're saying, may we please have the document for the United States? And we are downloading it, extracting the data, rendering the shapefile in the client side.
11:20
This is not necessarily the most efficient way to do this, but one of the things that we are trying to do in this project is to kick the tires at every step of the way. To make sure that every one of those documents, every one of those resources is fetchable online, and that you can be able to do something with it.
11:41
Apologies, this screenshot is probably completely illegible from the second row on. Another example of this is, this is a tool for browsing the data, and on the left-hand side are search results, and on the right-hand side are facets or aggregations of that data. And so the second set of results there is
12:03
all of the unique regions that match this search result. And what we do is we return IDs, because that's what's stored in the database. But then what we're doing on the client side is, again, looping through each one of those IDs, turning around, asking the network
12:20
for the corresponding document, and updating the name in place. Again, this can be super, super inefficient. We know this. This tool is what I'll talk about in a moment. And part of what this tool is designed to do is to just beat the infrastructure with a stick
12:43
over and over and over again so that we can figure out what works and what doesn't. One of the advantages to having a document-based network with permanent IDs is that we have a convenient place for putting all of the names and all of the spelling mistakes.
13:02
We don't want to get into an argument with people about what something should be called or how it's spelled. We have room to put all the names. Likewise, all the geometries. One of the things that we have decided on
13:21
is that any given record will have what we are referring to as a consensus geometry. Consensus is not the right term, but we haven't found a better one yet. But this will be essentially the default geometry for a place. But we also have a corresponding file for all the alternate geometries.
13:41
Not everyone agrees about the boundaries for places. And our issue is not to make those decisions for people but simply to be a place to reflect those discussions. Likewise, some geometries are better for certain functionality. You don't necessarily need a hyper-detailed coastline
14:00
to do a geocoding query or reverse geocoding. Concordances. We have concordances with geonames, geoplanet, quattroshapes. Eventually we'll have concordances with the recently released Getty Thesaurus of geographic names. We will hold hands with pretty much anybody.
14:22
There's lots of room. Likewise, every record has a hierarchy. And in fact, some places have multiple hierarchies. Not everybody agrees on the relationship of a given place.
14:41
And so, again, we are not trying to make those decisions for people. We are trying to leave as many of those decisions to the edge, to the edges rather, and simply reflect what people are saying about a place. One of the things that I have argued for
15:03
for a long time in gazetteers is the notion that every record has two properties, supersedes or superseded by. And what that means is that we have a mechanism, we have a framework that allows a place to change. Now, there is a large philosophical question
15:22
that has nothing to do with geography per se, which asks when is something simply updated versus when does it fundamentally change? Right, is the caterpillar the butterfly? The answer to that question is, you wanna go get a drink and talk about it?
15:42
Again, we are not trying to answer this question, but we are trying to provide breadcrumbs so that when Yugoslavia stops becoming Yugoslavia, the nation that people knew until 1992, that record still exists. There is still a pointer to it. It's still a durable, reliable endpoint
16:01
for people to do something with. So again, just to repeat, we are trying to reflect the debate. We are not trying to decide it for people. So there are about 400,000 administrative place types
16:20
at the moment and four and a half million business listings. So that's a lot of data, and it's pretty hard to sort of wrap your head around, and it's pretty hard just to remember where anything is. So we have started building tools for use internally, but everything is open source, so the source code for this is publicly available to use.
16:44
To explore the data. Not everyone may know what the term spelunking refers to. It is a term used in cave explorations for essentially feeling your way around an unknown cave in the dark for exploring it by touch and intuition.
17:02
And so this is what it looks like. It's pretty straightforward. At the moment, it indexes data in Elasticsearch only. Actually, that's not true. We index it in Elasticsearch, and we also index it in PostGIS. This is part of the attempt to put all the data in all the databases and to figure out
17:22
what works and what doesn't and to rinse and repeat. As of today, the spelunker doesn't do spatial queries. All this does is index on the properties in the geodeson file. But you can do some pretty amazing stuff. And one of the things that this starts to demonstrate
17:40
is one of the lessons we learned when I was working at the museum. And the dirty little secret about museums is that everybody's metadata is terrible. Nobody likes to say that out loud in public, but it's true. And one of the things the museum did before I got there was we simply CC-zeroed all of the metadata.
18:01
So the paste was out of the tube. There was no putting it back. But that was actually okay, because one of the things we learned working with all of that data and building the collections website was that the value of the aggregate data vastly outweighs the sum total of a perfect subset. And so even though we know that quite a lot of the data
18:22
in the maps and gazetteer is incomplete or sometimes incorrect, there's some pretty amazing stuff that you can find. So this is just a screenshot of 11 localities in Korea that have been flagged as megacities.
18:40
This is all of the descendants of South Korea that are localities. So this is just a paginated view. These are all the descendants of a neighborhood that I had to create when I lived in New York called Gowanus Heights.
19:01
And I show this because the official neighborhood around here is called Gowanus. And one of the things that we will do shortly, we just haven't done the work, is every record will contain a list of pointers of other records of the same place type
19:21
whose geometry breaches that record's geometry. So for example, it may be a little hard to see, but there's a yellow dotted line that overlaps the pink shape. So the pink shape is Gowanus. The yellow dotted line is Gowanus Heights. And we will do this for every record. And we will do this, one, so that we have something to start with
19:41
in terms of doing editing and data quality work, just a flag to say these two things intersect and maybe they shouldn't, but also to reflect the reality that nobody agrees about neighborhoods, ever. Likewise, you may be able to see there are two centroids in the polygon for Gowanus.
20:05
One of them is the arithmetic geom centroid, just the actual center of the polygon. And the other one is, are we done? Oh, all right, well, I'm almost done. And the other one is what we're referring to
20:20
as a label centroid, which is derived from Map Block's Map Shaper, which is pretty great because sometimes the label needs to go in a different place. Venues, this is an example from the Simple Deo dataset. The Gowanus Canal, which is actually a toxic waste site in New York,
20:40
is listed as a venue. It's also a feature that people have warm, fuzzy feelings about. Parented by Gowanus. Also, incorrect. There's actually a Korean taco place at the corner of Knight and Smith, but this is one of the things that the Spelunker allows us to see,
21:01
that there's data that needs to be fixed. Also, the notion of ground truth. This is a map that two artists in San Francisco made about micro-hoods in the Tenderloin neighborhood of San Francisco. They have lots of funny names.
21:20
So we imported them into the gazetteer and we parented them all by the Tenderloin. And this is what it looks like. So our record for the Tenderloin doesn't actually reflect what people in San Francisco think. That's a useful thing for us to see. This is just another example of the loin pit.
21:40
There's a micro-hood. And the parent of it is over there. Likewise, this is just the raw properties dump that comes out of the GeoJSON file. What's nice about this is it's just an arbitrary bag of data. We can put whatever we want into it. I'm just gonna wait for him to take a screenshot.
22:01
And one of the things we started doing on the client side is predefying that data as it comes down. So you can toggle this view back and forth. And part of the reason we're doing this is because the next step is to think about how we build editing tools. What does an editing interface for this look like?
22:20
We don't know yet. Right. So I think I'm out of time. I'm gonna put up two links. One is, there's a 5,000 word blog post about this subject. There's at least another 5,000 words to talk about it, but not today. And then, this may or may not be live as I'm speaking.
22:41
If it's not live right now, it will be live later today or tomorrow. This is a public version of the Spelunker. So you can poke around the data, have a look at it. And there are links to the source code and most importantly, all the data itself. Thanks.
23:11
Yeah. Should we wait for the microphone?
23:24
Yeah, I just wondered how you generate your identifiers and how you keep them permanent. So you've got a California record and then another one comes in. Is it all done manually or how do you do it? So there's two questions there. One is how we generate the IDs.
23:41
It's just a ticket server. It's MySQL. Yeah, so just randomly generated. Just auto-incrementing. Oh, okay, auto-incrementing, okay. And how then do you update? So if you've got, presumably you're using secondary data as well, it's not all primary. So how do you do Jupyter and how do you say that's California, that's California and then update it?
24:02
How do we do deduping? Some of that is the next step. We've been able to rely on data that has strong concordances right now. I mentioned that we have concordances for GeoPlanet. One of the things that we want to do soon
24:20
is identify which records in GeoPlanet we don't have an ID concordance for and then figure out which of those records we already have copies of. And so essentially it becomes a geocoding problem. But it becomes a geocoding problem that we can then scope to a locality or a region or a country because we already
24:41
have the hierarchies for both of those places. Okay, okay, thanks. And the other, or the third question I guess then is for your centroid, so you said obviously you start off with polygons in some cases. How do you assign your centroids? Do you use postures, is it point on surface? Or how do you do it? So the geometric centroid is done using Shapley.
25:05
So whatever Sean thinks. And the other one is Mat Block's MapShaper. So it's whatever Matt thinks. And the idea is that there's room for lots of centroids. And we're sort of overloading the term.
25:22
For centroid to not necessarily mean a mathematical point, but a point of focus. Okay, that's great, thank you. Hi again. Hey.
25:41
I have lots of questions about how to facilitate collaboration across people, teams, countries, everything. But I suppose a low-hanging fruit would be, have you considered, you mentioned you can store a bunch of different names and whatnot as text
26:00
because of the format chosen. And what about ways to facilitate localization of names, like more formal name structures rather than just having to deal with things like typos between names, but actual formal localized names and whatnot? Sure, at least to start, and we'll see how it plays
26:21
out, but to start with, we have adopted what Geoplanet did, which is a pretty basic convention of three-letter language code followed by a suffix. And the suffix is either the preferred name in that language, colloquials, or variants. And then we just include all of them,
26:42
and the next step would be to identify the languages spoken in those places, and include that information. So would, in the future, do you imagine there being tools for people to add to these places around the world as new terms might be needed?
27:03
Absolutely. Community editing is absolutely something we'd like to do. It's desirable, it's good. It's also a hard problem. We're trying to do this sort of one step at a time.
27:21
One of the things that we talked about in the blog post was all of the data is available on GitHub, and what we said was, please don't get too attached to GitHub, or Git per se, because it's probably not the right place for data of this volume. But until we figure out an alternative, what GitHub does
27:42
is it demonstrates the goals that we have for the project, which is that you should be able to fork it. You should be able to download a copy. You should be able to submit a pull request. And after that, it's a lot of detailed work that we will figure out as we go.
28:04
Any other questions? Do you have a plan for getting people to use it beyond if we build it? I mean, the relevance of a data set has very much to do with the population of people who are using it, and in turn, think it's relevant, and then feed back in
28:22
and start to add their changes. Because if no one uses it, then it will die as soon as you stop growing it. So is there anything beyond if you build it, they will come in your plans? I mean, we're talking to people. Do we have a fully formed strategy? No, I think part of it is some of the services
28:44
that we need internally depend on having this kind of data available, and so that's the first step. And to overuse an expression, we eat our own dog food, and we demonstrate what's possible, and we,
29:04
there was an expression that a former colleague of mine had at Flickr, which was to create a community, a generous community, a spirit of generousness and openness, and we think that it's a problem that everyone has around place.
29:24
So, I haven't entirely answered your questions, but. Sure. Yeah, I mean, my experience has been that this is a problem that just keeps coming up every single time
29:40
you start a geo project. That's one of the things that MapsN is here to be able to do, which is, you want to start a project, imagine if you didn't have to reinvent every part of the wheel, and that includes the data. Alright, thank you.