Data Science with OpenStreetMap and Wikidata

Video thumbnail (Frame 0) Video thumbnail (Frame 716) Video thumbnail (Frame 5028) Video thumbnail (Frame 6327) Video thumbnail (Frame 14594) Video thumbnail (Frame 22409) Video thumbnail (Frame 23895) Video thumbnail (Frame 27070) Video thumbnail (Frame 27717) Video thumbnail (Frame 30542) Video thumbnail (Frame 31425) Video thumbnail (Frame 36470)
Video in TIB AV-Portal: Data Science with OpenStreetMap and Wikidata

Formal Metadata

Data Science with OpenStreetMap and Wikidata
Title of Series
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
This talk will be about how to use OpenStreetMap and Wikidata in common data science questions using Python. We will go through the similarities and differences between OpenStreetMap and Wikidata, explore the structure of both data sets and go through some key figures and statistics. The goal is to provide a birds-eye perspective including a practical outlook. Some results will be presented in various ways the data sets could be utilized to fuel further data analysis.
Keywords General

Related Material

Video is cited by the following resource
Sign (mathematics) Goodness of fit Mapping Open set Wave packet
Meta element State of matter Texture mapping Graph (mathematics) Source code Data analysis Open set Mereology Computer font Area Formal language Neuroinformatik Digital photography Different (Kate Ryan album) Finitary relation Square number Cuboid Information Category of being Amenable group Area Computer icon Computer font Theory of relativity Mapping Mereology Element (mathematics) Connected space Category of being Type theory Order (biology) Chain Website Classical physics Point (geometry) Drop (liquid) Element (mathematics) Number String (computer science) Ideal (ethics) Energy level Statement (computer science) Data structure Computer-assisted translation Tunis Graph (mathematics) Key (cryptography) Information Lemma (mathematics) Projective plane Grass (card game) Normed vector space Statement (computer science)
Presentation of a group File format Set (mathematics) Database Insertion loss Public domain Open set Mereology Total S.A. Geometry Sign (mathematics) Different (Kate Ryan album) Finitary relation Vertex (graph theory) Sample (statistics) Library (computing) Area Predictability Source code Service (economics) Pattern recognition Theory of relativity Mapping Linear regression Point (geometry) Data analysis Instance (computer science) Statistics Formal language Connected space Category of being Pattern language Physical system Laptop Laptop Point (geometry) Statistics Robot Interactive television Mathematical analysis Streaming media Machine vision Number Computational physics Goodness of fit Pi Permanent Integrated development environment Operations research Scaling (geometry) Key (cryptography) Projective plane Numerical analysis Mathematical analysis Database Number Uniform resource locator Visualization (computer graphics) Integrated development environment Statement (computer science) Table (information) Window Library (computing) Extension (kinesiology)
NP-hard State observer Building State of matter Multiplication sign File format Set (mathematics) Database Instance (computer science) Open set Mereology Medical imaging Geometry Different (Kate Ryan album) Query language Website Category of being Library (computing) Amenable group Formal grammar Social class Area Mapping Linear regression Software developer Data analysis Menu (computing) Median Bit Instance (computer science) Element (mathematics) Electronic signature Category of being Frequency Website Convex hull Pattern language Bounded variation Physical system Sinc function Laptop Point (geometry) Interactive television Computer-generated imagery Mathematical analysis Streaming media Element (mathematics) Internetworking Cluster analysis Integrated development environment Software development kit Operations research Histogram Addition Dialect Distribution (mathematics) Key (cryptography) Information Projective plane Numerical analysis Mathematical analysis Uniform resource locator Wiki Extension (kinesiology)
Greatest element Consistency State of matter Multiplication sign Computer-generated imagery Design by contract Complete metric space Data analysis Open set Streaming media Mereology Rule of inference Plot (narrative) Hypothesis Different (Kate Ryan album) Bounded variation Social class Rule of inference Focus (optics) Dialect Mapping Information Projective plane Data analysis Complete metric space Hypothesis Wiki Prediction Table (information)
Time zone Link (knot theory) Arm Slide rule Knowledge base Multiplication sign Set (mathematics) Complete metric space Complete metric space Measurement Plot (narrative) Attribute grammar Mathematics Hill differential equation Social class
Uniform resource locator Mapping Multiplication sign
Laptop Web page Arm Scaling (geometry) Information Mapping Projective plane Open set Frame problem Formal language Carry (arithmetic) Personal digital assistant Different (Kate Ryan album) Core dump Videoconferencing Cuboid Geometry
the i phone when he was still in training sessions afternoon think you all very much for coming and have you had your coffee we're going to be very exciting afternoon so ladies and gentlemen i would like to give you know.
this. but you are so my talks going to be about data signs with open street map and we couldn't go on just for good show of hands who has worked with the kids or a third of that.
looking on so first of all the outline so what is going to what i'm going to talk about so first of all all going to introduce opens treatment and the key data to you as a data source for data analysis i'll show you hot what the difference are between those two projects and how you can connect he did both projects together. the second part will be the whole day to science part on i will i won't have any ai are blocked chain or these kinds of things will be like classical data science and how on show you do i resent tools are used and on will be like exhibition of the projects i tried to do with them on so let's get started. all of you probably know all or most of you in order to the elements and open street map so you have three types of elements and open street map as you also have area but it's also part of these ways or relation so note is like a g.p.s. point in open street map ways like a collection of notes on collected as land string and then you. have relation which can be a collection of knowledge or collection of land strings and so on and also relations of relations and so on. and it's our each of these elements can have their alma mater data and does is usually are stored as a key value pair so for example you have dickey amenity which is pretty common on where you have them values like borrow a barbecue beer garden and restaurant and so on. and on if we have a look at the state are in. in an example so cells per of my my home city you can see the various key value pairs that you have so you have name you have. but men level and so on and write in a on a bomb dare you will see are the odds last small you can see the key data which does we're number in the queue before but so what is the key dates us over the data is basically a knowledge graph so that means that. are so it to the key data it's basically a project from the kapadia and their ideal is mainly to get all the info boxes and and structure of the data from the into boxes and such a way that our when something gets updated like mayor of a city. in all of the other articles and other languages get up to the two and as a side project they wanted to build the biggest knowledge drop in the world so and there you have like all this different kind of connections and for example for this week because they have a computer website you have for douglas adams you will see in the bargain there have to keep up the item so. oh that looks like that you have like this i tune with the idea which saw before and each of these items are structured in such a way that you have statements on which short poll which which which are basically properties pointing to some other entity or some other item and here you can see that.
douglas adams was educated at the center on college and sent dance colleges again the vicar day tied to mimic sen john ensign you can have these connections going from all of these areas and you can i ask very interesting questions without but how do you ask questions and we could it or. so they have their own are in the state though and knowledge grass usually use the sparkle querrey language which is a bit like ask well. and day of the key date offers only querrey serve as a bit like over possible for this treatment and you can asked the square is and here's an example for all the cats in the day i became a turn and zero on the ice.
the annan and under area below you have this table off of what you get what you get back. i'm so one question wanted to ask was ok what are all the windmills in are the key data on the choir looks like that so you have this select statement which is saw before and then you you have these items and adele label and.
to let slip trial was to point to other areas so you have like an item and you you want to have the proper to instance off a windmill and then you can have optional of items were you want to have dementia one of the location you one of the country and his cryptic par below is so you get the. tables from the items. back and surprisingly most of the windows are located in the netherlands and you can this discreet show is also from the very serve as we saw before and you can directly visualize the data is a map or as a table was a bar chart and everything else. and you can click on the points and see the data we collected before like the country and the name of the the windmill and so on. our and care talked about both projects but holiday connected and how can we connect i'm so sorry this is the wrong side of the just a quick overview did not the number so obvious treatments are two thousand four and has a bunch of data like five million use them all over five million users and seven billion. g.p.'s points and song and vicky data is a more much more recent project has only twenty thousand users active users distinction is because when you work with the key data use the same log in as you would use for the kapadia so actually that would be a lot more so but this is only the active part of them. and also the other part is when you noticed that the number of items dare it's a quite big number for a fairly recent project this is because we could date is first of all public domain. so they believe facts are public domain and the other part is that day in day supports bots and people that do automatic inserts so that's one difference and i i want to talk about the connection between those two so as we. saw before we have an open street map this because data item of which date the attack and this is stable because the key data idea even if this gets merger of another item or are on gets to lead it are not when it's going to lead but when it's got merge with another item you get to redirect so the. tag is actually stable so it stays the same on its other ways in in open street map so open street map when you have in the key dates a this so-called always some relation idea which has more in total almost one hundred thousand entities. then you know as in an open stream of the ideas can change so if you use a note it can become a way can become relation and even relations can become other things and so it's actually not so stable and especially i was especially it's important that you don't use knows the ways are areas in this. this field. i'm dead as been a proposal for permanent idea and in the data but is still on the way to be developed and another thing which is interesting is you have some another tack him in vick data were you. where you can map the key values we saw before with a property in the data which means you can do matching and things like that. ok let's just go with the data science on i saw it is really interesting paper about data science which was called fifty years of data science and his actual explains why there is a need for data size and wash it should be called statistics from it's a pretty long rebut it's very interesting but the main difference is that statistic is usually a. i'm focused on fury and inferential statistics which means like pattern of finding pattern small sample sets and status sciences more directed towards prediction and pattern recognition for a large data sets and it's also statistics is more of a fury based on data sciences more. that he can practice based but that's an overgeneralization but that's more or less the gist of it. and that she was are you so i used mainly a jupiter which was also created for this which also create this presentation it's like a notebook environment for where you can have text and co together and then you have cost them to use posthumous which is like the spatial database for pasta of spatial i. attention for posters in school. and finally i used to go for the conventions and all of this was made of putins so on just a quick rundown on so they are signs you mall mostly work with none pipe on his in my public for data science projects shapely in japan is a wonderful libraries for as you spatial analysis and working with jews patient data. and pies i was a pretty good library for spatial malice is like spatial regression and so on and the last one is the data showed it's like a vision late region visualization a library for large scale data.
so let's have a look at the open street map the key dates items are to open street map elements with the key data so this is still map of europe with all the upstream of items with the on label think the data. and i wanted to do at the as we saw yesterday i didn't want to use the heat map so i. i used to clark left it doesn't look much better. it's not much better if you can't really see the interesting patterns here but we do like go hotspot analysis with a method called lisa you can see hotspots of areas which are where you have more of them. a stream of items we could date attack yet like hot spots cold spots and so on. in the data you one of the most common properties is to call for instance off it's like a class old something is a class off here's a quick regionalisation of all the history of all the vicky dates items with the location so said to him. and from them i took laika the most common in since off so you can see the dead are some interesting a regional differences so for example in in france you most commonly have searched building that mostly reflects only the way people are inserted. to our data sets to use or what's interesting for different people that use did and also you can see differences like in britain you have the was a cold building yes and. in another area you have a was a toss yes and here on around the observational lot of martin so that's two. home. and interesting thing is when you have this instance off you can drill down inside to status so you can use you can filter the data so you have only companies in the data. and then you can visualize ok which companies do we have here. and then you can see ok germany and austria has a gas costs which is basically restaurant and then you have a brewery and in england you have a lot of pop's. the. surprisingly. and also interesting lee i went out when you select britain itself you will see that the upper part of scotland is known for whisky distilleries its. i'm not sure if there's a bias or. but it's interesting you can see it is interesting a are relationships in the data. on what i wanted to see is when you look of these businesses each of many of them have a website's associated to them and i recently saw a tweed were some guy on internet said it's apparently seventy percent of all websites from jacory on the lot of that developers haiti curry and i wanted to see ok is. this is right what he says and i have to date on i can prove it and see of its regional are hired looks regionally so he was right the median is a kind of around zero point six nine and this is the distribution of regions. with jacory usage certain percentage so we have some regions that have zero point three percent take courage they probably use react and then you have like people that still use jacory a lot. and when you realize that originally one one big disclaimer is that it's a lot of the dark parts year also because there's a missing data a lot of not enough data to like see a problem pattern but small like fun to see what what it looks like and again the house but analysis. is so on and parts of from state they still like to use jacory and and sweden apparently doesn't like it off. will. so. i. sorry. it's it is a bias in the data so don't don't believe the data like that. for will come to the points shortly if we have an all time yet on kit another project i did was when when with open street map by use the region's dare and i counted to various amenities in the region so for example a restaurant. sounds of bars and things like that and according to the councillors regions i looked at him the signatures of a region and then used his signature of the region to classify countries from.
so here i classified logistic regression to compare are also germany and france and again i want to say to the data is biased. i didn't prove it or checked properly but it was just a the time to to see if that works and hard work how it looks like tom have been a lot of regional differences in the amount and since europe is pretty big you have very big variations between the countries as well in the coverage of to data. so on that's one part and another project that stuff from keller from switzerland showed me was the cost of the see a map of switzerland and it's like it to magic map where you have data from open street map because data and the commedia commons wary of the images from. and the period for additionally. information labels and you can like sea region and see where you have a cost when you're buying then you can see little bit about the story and that was a quite interesting project as well that combine both data sets and in conclusion aso naming things as hard and meaningfully categorizing fingers.
is even harder. so does especially what we see in the data you can see these regional inch differences were. like one region has this class a lot to use it a lot it has been established in this country and another country they use something completely different like building and other one is was holocaust even though the same thing but it's it's quite difficult but what i like about the data is that date he tried to build the whole thing bottom up. so they just see how it develops and we will see of the projects if the project will be good in this in the sense of calm. yeah that's what i said before as is also a disclaimer this hypothesis has not been tested and. have been no peeve about is used for this hypothesis. another thing which i see interesting i'm especially i.s.o. especially in an open street map is this idea of different contract between china's rulers so when use a ruler to measure the table he also used the table to measure the ruler so basically if you have by a state to it can tell you exactly as much about the people that are created. the data and and work behind the data as much as it can tell about the region itself so that's something i read you can see in the data and open street know a lot like where the focus of to communities are and which projects are working for example in the key data you have a lot of projects that they have the key projects where they are. assemble people to work on on a certain region or certain project or nobel prizes are something like that and they collect data for that project. and the last part is no die dead or so really important when you do classifications and things like that and and data analysis you really need to know where the data comes from and and and if it's complete states also in another big project but we don't have time for that i left some.
information for that the ants. yes just information about the completeness we actually do have some time left ok. yet in open stream of you have a lot of research consist of about the data completeness so one research was really interesting is that the world user generated road map is more than eighty percent complete according to this paper and also down a lot of other projects to measure of different views differ.
measures to see how complete the data set is not complete certain attributes are. and in the data that's been also recently worked on so does a interesting interesting article that have talked about how to estimate the completeness of classes in the data so they basically used to the time when somebody didn't added and the time it gets at it again and used his as a as a way. to measure how arm complete the classes and sold eight saw that some administrative regions of it fairly complete and then something like mountains or hills are not so much complete because you can still see a lot of change going on there. and also does another paper that's generally full knowledge base is like the key data where they assessed the completeness of to entities and is also interesting tool where you can see the completeness for various attributes. yet so. we made it. so any questions. the of this disaster zone it so i know that they also well the u.. so i called his book the way that there are so the question was about to take early dates so when it comes from right i'm so basically i used to the company's data which is so before for all this treatment for the day to them.
let me check where he wants.
yes so i know this was not to use yet here i use this map for collected all the day all the businesses in the key data that have a location associated with them and that have a website with them. i was of a finger was like twenty one hundred thousand or something like that and yet. i i thought i have not make the time because when i tested it yesterday a it was like forty minutes.
yet to another question. yeah so. for it. or. we are. so. the. so the question was if the to vicky data are to be keeping the article was related to dictate item yes some of the information is loaded into to infer boxes but the texas not modified by the information unless it's in a language that's not updated tenants using the. the data information to generate the text as far as i understood also decor the mets are manually inserted or if the victim pay the article had information before it was loaded into the village dates item from the video page. yet the. are yellow a if a clean the notebooks i will have published in the books for that yet. so the question was if the data is available somewhere and to queries. yes. as a said are getting question was if the data is extremely skewed a localized so yes a said does the arm did not test if if this is the case so you could probably tested fairly quickly if there is so if does the region obama. into the aura like it into and come between countries does large differences i think that's easy to test but haven't tested for that. well. so. oh yeah yeah you could check on on a much larger scale to you. yes. so. so what. so. but also the question was how to access to open street map data on so i used also the overpass a.p.i. but also use the dumps for the whole are europe dataset it's too large for acquiring the opacity i tried to truncate and everything but in the end it was easier to just download the whole thing. another project i didn't mention is a stall as an ex project it's a wonderful a project by geo of boiling on where you can carry open street map to a fairly easy and you can get it on as a data frame on us a different. yet. the. or so much think it.