We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Harmonizing pan-European datasets – experiences from GISCO

00:00

Formal Metadata

Title
Harmonizing pan-European datasets – experiences from GISCO
Title of Series
Number of Parts
57
Author
Contributors
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer
Production PlaceWageningen

Content Metadata

Subject Area
Genre
Abstract
Hannes Reuter, Statistical Officer - EUROSTAT, outlined the ‘Geographical Information System of the COmmission’ (GISCO), a permanent service of Eurostat that fulfils the requirements of both Eurostat and the European Commission for geographic information and related services at European Union (EU), Member State and regional levels. These services are also provided to European citizens at large. GISCO’s goal is to promote and stimulate the use of geographic information within the European Statistical System and the European Commission.
Keywords
MultiplicationSpeech synthesisView (database)ImplementationBitProcess (computing)Decision theoryComputer animation
StatisticsPhysical systemStaff (military)InformationInternet service providerState of matterPermanentWeb serviceGoogle MapsCalculationDistanceTerm (mathematics)DatabaseRoutingComputer-generated imageryLevel (video gaming)Repository (publishing)Codierung <Programmierung>Address spaceStreaming mediaFood energyProcess (computing)Visualization (computer graphics)BitBlogOntologyBuildingComputer fileOffice suiteMaizeMathematicsState of matterWeb serviceDialectNumberTurtle graphicsSystem callStructural loadLine (geometry)Reverse engineeringDistanceMachine visionEndliche ModelltheorieAreaObservational studyData managementMeasurementWater vaporGoogolInformationMultiplication signForm (programming)Volume (thermodynamics)Service (economics)MappingCalculationComputer programmingLevel (video gaming)Focus (optics)Group actionCartesian coordinate systemMedical imagingCodeUniform resource locatorHarmonic analysisSlide ruleType theoryConnected spaceDatabaseFront and back endsExecution unitGeometryInternet service providerMathematical optimizationProduct (business)Rule of inferenceCoordinate systemGreatest elementRootRight angleTotal S.A.SurfaceComputer animation
outputFunction (mathematics)StatisticsSoftware frameworkBuildingAddress spaceMaß <Mathematik>Computer networkCodierung <Programmierung>Utility softwareService (economics)CalculationLevel (video gaming)State of matterExecution unitWebsiteInformationPoint (geometry)Utility softwareLink (knot theory)Core dumpRight angleRepresentation (politics)Integrated development environmentSoftware frameworkGeometryLine (geometry)MereologyUniform resource locatorOnlinecommunityReference dataData managementMachine visionRow (database)AreaPairwise comparisonPhysical lawParameter (computer programming)Fundamental theorem of algebraMultiplication signStatisticsService (economics)Type theoryBitCompilation albumSystem administratorLocal ringVisualization (computer graphics)Goodness of fitWater vapor1 (number)Group actionKnotObservational studyOrder (biology)Price indexComputer hardwarePortable communications deviceLie groupOffice suiteSoftware testingSingle-precision floating-point formatResultantProcess (computing)BuildingMathematics
Address spaceAddress spaceInterface (computing)Multiplication signCodeOpen setUniform resource locatorRule of inferenceTotal S.A.Client (computing)DatabaseDebuggerInformationCoefficient of determinationDecision theoryGroup actionNumberState of matterInheritance (object-oriented programming)Causality
Local ringSource codeGeometryFrustrationMixed realityIntegrated development environmentState of matterRepository (publishing)System programmingUsabilityCore dumpScale (map)Address spacePoint (geometry)Regular graphBasis <Mathematik>AreaStack (abstract data type)Loop (music)FeedbackWeb portalExistenceContent (media)Execution unitWater vaporData managementComputer-generated imagerySocial classDatabaseServer (computing)Visualization (computer graphics)StatisticsUniqueness quantificationDigital photographyWebsiteArtificial neural networkLine (geometry)TheoryArchaeological field surveySample (statistics)Computer clusterWeb pageBitArchaeological field surveyForm (programming)Medical imagingPoint (geometry)Time zoneConvex hullOnline helpArmCausalityServer (computing)DatabaseSource codeExecution unitGame controllerState of matterLevel (video gaming)Local ringGame theoryDistanceWhiteboardWater vaporBuildingTerm (mathematics)Strategy gameSelf-organizationGraph coloringMultiplication signResultantFood energyDecision theoryEstimationGoodness of fitBus (computing)NP-hardEnergy conversion efficiencyData centerDivision (mathematics)Metropolitan area networkSystem callObservational studySet (mathematics)Different (Kate Ryan album)Centralizer and normalizerFeedbackAddress spaceData modelGeometryWeb portalMoment (mathematics)Integrated development environmentScripting languageCASE <Informatik>MappingStatisticsComputer animation
Archaeological field surveyExpert systemComputer clusterArchaeological field surveyCirclePlotterComputer animation
Presentation of a groupElement (mathematics)StatisticsImage resolutionWorkstation <Musikinstrument>Multiplication signImage resolutionProduct (business)Computer animation
Type theoryArithmetic meanWorkstation <Musikinstrument>Self-organizationPoint (geometry)EmailSystem administratorForm (programming)Open sourceCASE <Informatik>Insertion lossBitSet (mathematics)Multiplication signMeeting/Interview
Coefficient of determinationDistancePopulation densityMeeting/InterviewComputer animation
CalculationLocal ringSource codeMassView (database)Conic sectionNumberPopulation densityBuildingFood energyDecision theoryView (database)VideoconferencingCASE <Informatik>Office suiteMultiplication signComputer animation
Function (mathematics)StatisticsSoftware frameworkoutputReflektor <Informatik>Menu (computing)Point (geometry)DistanceMultiplication signUniform resource locatorMoment (mathematics)MetadataState of matterInformationOffice suiteCASE <Informatik>Variable (mathematics)Different (Kate Ryan album)Limit (category theory)MeasurementSlide ruleComputer file
DatabaseWeb serviceRoutingComputer-generated imageryLevel (video gaming)Repository (publishing)Codierung <Programmierung>Address spaceService (economics)Google MapsDistanceBuildingMaß <Mathematik>StatisticsComputer networkUtility softwareSoftwareSlide ruleLine (geometry)Representation (politics)CalculationSet (mathematics)Mathematical analysisDifferent (Kate Ryan album)Source codeDistanceFunction (mathematics)SurgeryMappingProcess (computing)Lie groupLocal ringComputer animation
Transcript: English(auto-generated)
My name is Hannes Reuter. I'm speaking here on behalf of the GISCO team and I will present to you today about harmonizing pan-ubin datasets and sharing a little bit of what are we doing in GISCO.
And you might wonder what I'm doing here and why I'm presenting. I will take a little bit of a different detail because you have already heard Daniele Rizzi from The Political View this morning and the colleague from, what was his name, sorry I forgot, Matt from Cinea.
So I'm now going to GISCO where we're doing really implementation work and you might wonder why I'm speaking here on behalf of Eurostat.
Because I'm working in the GISCO team and due to legacy office, legacy purpose, we are since somehow 20-25 years in the statistical office of the European Union located in the beautiful town of Luxembourg in the state of Luxembourg. So if you ever come by, you can drop by in our nice building which you see here in the bottom right and visit us.
And look what are we doing for statistics, but GISCO has less to do with statistics. We just published one dataset which is a total surface area and total land
area for the whole European Union, but the rest is more on really geospatial data. And you might wonder where GISCO stands for, a little joke apart, the Commission is known for the acronyms. GISCO stands for the geographic information system of the Commission or you can also, sometimes I'm making the joke on coordination.
And what we're doing in there is the whole stream of work. What are you doing in GIS? We're doing localizing, analyzing, visualizing of datasets. And I hope by the end of my talk today, you will have a bit of
understanding what we're doing and why actually we're also doing this data harmonization efforts which we're doing. As you see here for that one, we run quite a number of things like data procurement from the member states, from Eurogeographics, from commercial sources, from OpenStreetMap, not data procurement but data optimization.
We do things with them, we're analyzing and then in the end, we spit them out with visualization products and you will see that during the talk. So we have really a triple rule. We're a service provider for Eurostat.
So if you look at Eurostat visualizations like regions in Europe or the regional yearbook, we do visualizations for them. We're a service provider for the European Commission. We compile and this is what I want to share here today, how we compile and harmonize pan-European datasets.
For example, location of healthcare services and then also we coordinate and even partnership with member states and we give out grants where we try to bring in together statistical and geo information to produce added value, similar to what you're trying to do with a geo harmonizer.
So, I don't know, I hope you can see all the slides here. Just because services, I mean, if I try to tell my mom what I'm doing, then sometimes and explain what I've done now to you is like sometimes I'm getting,
Hannes, what are you doing? And my standard sentence now is that I'm telling them, guys, we're the Google Maps of the European Commission. And then I see you're doing the maps and I can type and I can hoot. Yes, yes, we do that.
But I'm not Google Maps, you know, we are not Google Maps. No, no, but we do that for the European Commission. So we have around 5 million users a day and we are spread across all the European institutions. Just mentioned a couple of here and just want to show you a couple of examples here. So, for example, Daniel mentioned earlier is data.oropa.eu, which is receive funding from DJ Connect and the publications office.
And you see here, for example, the map on the left side, the overview map is coming from JISCO and is used and quite popular. Then if any one of you came, made Erasmus application to study in a country abroad, if you did that and you needed to calculate the distance.
So there's a distance calculator on the Erasmus application program driven, not developed by us. This is developed by the respective business units in the Erasmus program.
But we are providing the back end to that one. If you apply to visa, you see also which in the EU immigration portal, you can see that again the back end by something by a map driven by a provided register.
Last but not least, you can even see the building outlines. Here is my entry for our who is who, like the telephone book of the European Commission, and then you can see where we are placed.
So last but not least, we have not only a European focus, but also a global focus. That's the reason why we rely also heavily on OpenStreetMap, which then needs to be modified for the political realities of our political leaders. So if you want to see globally where our colleagues from external action service or NIA or ECHO are working in the world, also we're providing maps.
So just to give you a general overview of what we're doing. And with that respect, we provide corporate level service. We have our central database from that one.
We're making our background maps. We provide geocoding, reverse geocoding services, rootings. We have an ID service and I really like to talk by Ingo just now where he said the simplification of the OGCI. And this is the ID service proof, for example, for us quite popular because suddenly
people are using it where you have never expected them to be used in the world, where you just spit in some geometry, some coordinates and gives you back some codes or geometries for the nuts for the countries. And it sometimes hits with 9 million requests over 12 hours, something like that.
And we do something simple, like making a quick map for our policy officers like image or we disseminate the files and even provide metadata portal internally. Why I spent a bit of time and introduction here and I want to put it now on a bigger framework of the United Nations.
If we're talking about the global statistical geospatial framework and you're coming into the same play here. And if you look at the parameter of the key elements, you see the use of the fundamental, what we call fundamental geospatial infrastructure and geocoding,
which is a base to have geocoded unit record data in any data management environment. So for someone who works on Earth observation, that's not really so much important. It only becomes important if you want to have your ground truth data.
How do you geocode an address? How do you locate a building? And then this one becomes important. And that's the reason why we are working on trying to create pan-European datasets to bring really in all this fundamental data.
And you will see that this is a key document or also a key vision of where we are working on and what drives us. Because if we're talking about interoperability, we heard about OGC API, we know
all this, we know common geometries if it's a country or statistical unit. But if we're really going down here to the base, then this is really interesting and becomes also quite challenging technically. From our side, we have set out a couple of themes, which are geospatial data requirements, which we want to have.
This is in line with United Nations UNGJM core reference data is in line with INSPIRE. So here we have a link back to all the legal talks, which we had seen earlier from Daniel Rizzi.
Here on the right side, for example, you see an artistic representation of postal codes, which we're already disseminating from the GISCO site for a couple of years. So now I'm already coming into the part where I want to show you what we are compiling. If we're talking about pan-European datasets, compilation issues, and all these kind of issues, I will comment in my lessons learned session.
So here, for example, you see healthcare and education, and especially with COVID, healthcare was of quite high interest in the user community. How to use it, what is the current status, what is available, and what needs to be done.
Utility services. So here, for example, an example from healthcare. So what we did as a time type of analysis, we averaged the travel time to the nearest free hospitals at the country level.
And this is what you sometimes get at a political level, this kind of statistical discussion. I put down here the URL, if you look at ec.oropa.u, backslash, geostat, web, GISCO, geodata, reference data, healthcare, you can get the datasets for downloading.
Not this one, because this is still under validation, but for the point information underlying, which I will show in the last slide. So here is it for the country level. And then, as we have actually really record
level data, due to our geocoded infrastructure, we can actually do it at the NUTS level. This is a statistical NUTS, sorry if not everybody is aware, these are statistical units which are allowing comparison across all the member states of the European Union. There's a legal act behind, just to mention that one. You can do that at the NUTS level.
Or you can do it at the local administrative unit level, which you see here, and then you look at it and you say, oh, we have an issue. For example, if you just simply look at the law level in Spain or, for example, here in the Carpathian Mountains,
or even in Sweden in a certain area where you're saying, oops, yes, wow, we see quite a lot of brown. So we have lots of travel time to the nearest three hospitals. But is that a reality? Because luckily, what we also have now, we have a one kilometer population grid from the census. The next census is currently in execution. So we will have in the next two, three years, another one kilometer grid based on one kilometer data set.
And then suddenly you see, oh, we don't have an issue here in Spain, because in most of the countryside, we have no one living there, also in the Alps or here in the Carpathian Mountains, but in other areas like in Sweden or in the eastern part of Poland, there might be an issue.
And this is really interesting where it becomes analytically interesting from our side. Yes, again, the healthcare locations which we have used for this one in this geospatial infrastructure GSGF protocol, which I mentioned earlier.
A second example, which I would like to present today to you is addresses, where we're compiling data again from member states, bringing it together.
And just want to show you a small front end to that one, because this is something what we learned actually is a hard way. API is not enough. You always need to put in a human friendly interface on top of that one. This is actually not for me.
This is from UK colleagues, which have figured that one out. I'm Cody, and we're always trying to do that. We're providing time to make the download package, we're trying to make the API, and we're trying to make a human friendly interface on top of that. And here, for example, you see a house number, a street and even the open location code and the URL how to do that.
And we're trying to complement that one. And why actually, you might wonder, why are we actually doing this? Because we received geocoding requests to our infrastructure, and we see that people are entering what we call dirty information.
The people, the data which are entered, are not sanitized. They are not cost checked at data entry. And there's this famous 100% rule of total cost of quality.
It costs you 1% to fix it at data entry. It costs you 10% of your cost to fix it while it is in the database. But it costs you 100% of your cost if you cannot make the business decision, the political decision, or the sale to a client which you want to reach.
And that's the reason why we're trying to get a handle on that one to allow our colleagues in the European Commission, being it Erasmus colleagues, being it RTD, for the future, using this ADAS API to validate the addresses at the data entry point.
And you already see here what is one of the issues which I want to highlight. For some of the countries, we have data. For some of the countries like Ireland and UK, Sweden, Greece, we are in progress. And for some of the countries, apparently ADAS data does not exist, which is sometimes in 2021 really surprising.
Last but not least, Daniela mentioned this morning also the European Green Deal. And for that one, we are currently also looking into what is available from building units.
Because that's interesting for them in terms of energy efficiency. And just to give you a little flavor of how do we actually do the work, because it's really like a tip of the iceberg, the whole process, what we're showing you today.
So first of all, we're going data hunting. We're going to the InspireGeo portal run by our colleagues in the GSE. Panos will present later on some solid data. But I mean, here in the InspireGeo portal, we should see all the data sets from the member state.
Listen, we should. It's not always the case. If this doesn't help, we go to data.oper.eu, which Harvard is different. Then if that doesn't help, we go to the national geo portals. If that doesn't help, we actually need to contact our local contacts, which we have in the national mapping agency or statistical agencies.
To identify the data sets. If we have that one, we are making a map like that, which allows us to clarify which countries do we have identified data, which how do we have ingested. And then, for example, here all the building units from our colleagues in Slovakia, which is really nice, really cool.
And here we even have the building hikes for that one, just to give you a flavor of the question. But then we're looking into that one, especially if we're looking at energy efficiency. The data model of Inspire would allow to share all this information, but suddenly we see the only thing
which we have in there is building hype, but everything else is missing or what we call an Inspire voidable. So we have sent it away for the moment. But what we're actually interested is like, and here, Tommy, I would love to be in Wageningen with you, just to see what the national energy atlas from the Dutch does.
Because there, for example, you see for every single building in the Netherlands, what's the energy efficiency. And then this becomes interesting because with this one, you can make policy decisions and this can be aggregated and given away for someone else.
So coming to lessons learned. What do we learn? You need to have creative people, geo and also IT knowledge people.
And they need to be motivated and then you can reach a lot. If you do this kind of work, which we are doing, expect really long times to change things at the corporate level because these people, they might be motivated, but they need to allocate the resources, they need to get the resources.
And you should not get frustrated. I know you are working in the research community, most of the people here in the room, you are in the research community and you do it like that. Yeah, cool. Working as a European Commissioner for eight years and being a previous researcher, it takes a long time and we need to be not getting frustrated.
What I also learned the hard way, various tools for different things. It's always good in a mixed environment, being at SRE, QGIS, whatever, also human environment. And if I'm looking at what we're doing on the daily business, what happens if
you do it with five people or with 5000 people and in a concurrent environment? And because that might change completely how you do certain things. This lesson I learned especially the hard way coming from a research perspective, but I had good trainers.
Lessons learned on the harmonizing data set side. I just want to mention a couple for you here. What we see is that we receive temporary outdated data.
Sometimes data sets are hidden, they're not documented, so then it's really difficult to find. We find only generalized data, data sets not aligned. So the building data set does not match the addresses or vice versa. We see in the member states of the European Union different approaches.
Certain countries have a centralized endpoint and certain member states says, yeah, you need to go to every commune and without a central stake. While some other, for example, the Spanish are really good, every member state for the addresses, every commune is responsible for those addresses.
But they have a central point and you have over 5000 files, which you can download and everything according to standard and code. While another member state has a central data set, but there's no quality control on that one. Which goes into what I mentioned here with requirement quality control, because we see varying quality across themes and countries.
And believe me, we have seen everything now from really good data set to people which are really engaged and want to fix it to something like, oh, we cannot do that, we don't want to do that.
And sometimes, which is really worrying me, sometimes no official data exists, but the commercial data set exists for the same area. And for me, this is, as a public, as a European citizen, this is something really challenging for me, I must say.
What we also learned, that our member states, if we tell them what we discover in our feedback loops, they're always really interested in what we're doing.
And they want to, they are eager to fix it. And they are eager to allocate the resources to move forward with that one. Last but not least, in our team or in our unit, we also run the Luca survey, and I know you are a heavy user for that one.
And I just want to mention that one. This is a data set, which we are not doing and obtaining from a member state organization. This is a data set with Eurostat, thus themselves. And we have a sampling strategy where we do
a first level stratification with all the photos, then we make a survey where people are going out. And our colleagues at the GSC, here's a paper I've linked down there from Rafael de Andremont, have actually worked together all the different years,
from 2006 to 2018 in one big database. There's an R script there, you have the survey geometries, you have all the 5.4 million images, which we have at the GSCO server in the 1.3 million points,
which you can access with the R script and get really the pictures. And I'm really looking forward that people like you now here in the room, which are much more involved in the Earth observation data sets, use it for their daily work.
I'm really looking forward that these images you see here, for example, a couple of them from the survey are used for anything to work on that one. Just to give you a little bit of flavor, what you can expect if you go out in the surveying. So, for example, we require that
everyone makes a picture on the ground and then the cows are coming or a bear is coming or a snake, we have seen all that one. This is a couple of survey experience in the forest, where people are going up into the mountains or here some roads, tracks, where people are going to the survey plots.
And so, for example, in the last one, there were 500 teams out doing the ground survey, just to keep that one. And with that one, I'm finishing up today for you. Just a little animation from our colleagues from Copernicus from the EA, which just have released the high resolution vegetation phenology and
productivity product, which goes down to a 10-meter resolution and might be of interest for you to work on in the future. Thank you very much. And I'm open for any questions to be used.
Thank you, Hannes, so much for this talk. We have some questions, we have some time for questions. So let's start.
I see the question from Marcio Martins. Maybe I go that one. This is a little bit of a challenging one, or not a challenging one. We're debating to open source it or to open it up to everyone in the outside.
First of all, you need to understand this geocoding API is for European institutions use, first of all, because some
of the data sets which we have requested in there are only for EC policy purpose first and not for everyone. So, yes, this is the point. If you are a public administration, you can access that one.
We have made that case already a couple of times for, I don't know, from which organization you are. If you are a public administration, you can access that one. You email istajisco and we will discuss that. If it's something else, if you're a private organization, apologies. Nope.
Okay, we're looking for more questions. Yes, I want to say something I really like that example with the distance to the hospital. And basically what you showed that if you, you know, if you don't have the right data if you don't have enough data that like with the population density and everything,
you know you don't really see the, you don't see the picture I mean you don't, you don't see what is the honest with you. Yes. So you don't see the, yeah, you don't see the real picture I mean only when you go and standardize by the population density.
And then you also showed like in Netherlands, they have this, let's say, the, how energy efficient the buildings. And so that will be super interesting to see for whole Europe and see where the, the biggest gap. So this is really amazing. I mean, if, if you your department would make this views of Europe that we can see you know whether a chronicle critical
problems and the only way to see this is from the, you know, getting the best data and then preparing the data so people can directly make decisions. So my question to you is, you know, you, you are like official, you know, Office of the European Commission.
How do you deal with uncertainty, how like in this case of the, the distance to the hospital I mean for sure there is also uncertainty and as you said the data has a variable quality from different countries. How do you deal with uncertainty in in DC very important information.
Do you visualize it also do you, is it, do you have a standard for that. I'm sorry if I asked you to. I like interesting questions, Tommy. We know each other.
I wish, I wish we would have enough resources to work on uncertainties. We are aware of these, but we are rather resource limited.
And the question is, where do we start interacting with the member states. So for example, if we're talking about hospitals and distance time to traveling to hospitals.
It starts already. Where is the data entry? Where's the door? Because we know there are six different ways to geo code, the location of an artist, you know, so the point is, and each one, each country does it slightly different.
The best thing what we can do for the moment is we document in our metadata files, what we observe. And with that one, maybe in the future we can try to work on also uncertainties.
But, frankly speaking, we don't have the resources, we are aware of that one. So for example, we are aware that I haven't shown you, apologies, I seem to have missed the slide where I showed you about what we're doing for transport networks.
Because we also see the different varying quality of transport networks. We obtain a commercial data set from one of the biggest providers, then we have OpenStreetMap, and then we have the National Mapping Agencies. And running all these three different, my colleague, Juliane Gaffery has made, for example, an analysis where
he used all these three different data sources with varying quality to make these calculations of distances. And the outputs are quite different, but we are not there yet to make it visually pleasant.
Okay, that's something you show here, I assume, in this slide, no? The slides are showing, because I see there's uncertainty around the lines. No! This is, no, unfortunately, Tom, that's not uncertainty. This is just Topi, Topi Nannen, I
hope, Topi, apologies if I mispronounce your name now, pencil-ish, hugest style, which I really like. And that's the reason why I said it's an artistic representation.