We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Pan-European open building footprints: analysis and comparison in selected countries

00:00

Formal Metadata

Title
Pan-European open building footprints: analysis and comparison in selected countries
Title of Series
Number of Parts
156
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Building footprints (hereinafter buildings) represent key geospatial datasets for several applications, including city planning, demographic analyses, modelling energy production and consumption, disaster preparedness and response, and digital twins. Traditionally, buildings are produced by governmental organisations as part of their cartographic databases, with coverage ranging from local to national and licensing conditions being heterogeneous and not always open. This makes it challenging to derive open building datasets with a continental or global scale. Over the last decade, however, the unparalleled developments in the resolution of satellite imagery, artificial intelligence techniques and citizen engagement in geospatial data collection have enabled the birth of several building datasets available at least at a continental scale under open licenses. In this work, we analyse four such open building datasets. The first is the building dataset extracted from the well-known OpenStreetMap (OSM) crowdsourcing project, which creates and maintains a database of the whole world released under the Open Database License (ODbL). OSM buildings are typically derived from the digitalisation of high-resolution satellite imagery, and in some case from the import of other databases with ODbL-compatible licenses. The second dataset is EUBUCCO, a pan-European building database produced by a research team at the Technical University Berlin by merging different input sources: governmental datasets when available and open, and OSM otherwise [1]. EUBUCCO is mostly licensed under the ODbL, with only exceptions for two regions in Italy and Czech Republic. The third dataset is Microsoft Open Building Footprints (MS), extracted through the application of machine learning technology from high-resolution Bing Maps satellite imagery between 2014 and 2023, available at the global scale and also licensed under the ODbL. The fourth dataset, called Digital Building Stock Model (DBSM), was produced by the Joint Research Centre (JRC) of the European Commission to support studies on energy-related purposes. It is an ODbL-licensed pan-European dataset produced from the hierarchical conflation of three input datasets: OSM, MS and the European Settlement Map [2]. The objective of this work is to compare the four datasets - which derive from different approaches following heterogeneous processing steps and governance rules - in terms of their geometry (i.e. attributes are out of scope) in order to draw conclusions on their similarity and differences. It is known from literature that building completeness in OSM (which plays a key role in three out of the four datasets - OSM itself, EUBUCCO and DBSM) varies with the degree of urbanisation [3] and that machine learning applied to satellite imagery (used in MS) may have different performance depending on the urban or rural context [4]. In light of this, we analyse the building datasets according to the degree of urbanisation of their location using the administrative boundaries provided by Eurostat, which classifies each European province as urban, semi-urban or rural. We chose five European Union (EU) countries for the analysis: Malta (MT), Greece (EL), Belgium (BE), Denmark (DK) and Sweden (SE). The choice was motivated by the needs to: i) select countries of different size and geographical location, which ensure that their national OSM communities are substantially different; ii) select countries having different portions of urban, semi-urban and rural areas; and iii) select two sets of countries for which the input source for EUBUCCO buildings was a governmental dataset (BE, DK) and OSM (MT, EL, SE) to detect possibly different behaviours. From the methodological point of view, for each country and degree of urbanisation we first calculated and compared the total number and total area of buildings in all datasets and we examined their statistics through box plots. This was followed by the calculation, for each couple of datasets and degree of urbanisation, of the building area of intersection and its fraction of the total building area of each of the two datasets. Finally, we intersected all the four datasets and calculated the fraction of the area of each dataset that this intersection represents. Results show that in urban areas, while the datasets are overall similar in terms of total area of buildings, the total number of buildings is typically higher in EUBUCCO for DK and BE, where the information comes from governmental datasets. This suggests that such datasets outperform OSM in modelling the footprints of individual buildings in the most urbanised areas. In contrast, in semi-urban and rural areas, where OSM traditionally lacks completeness, MS (and as a consequence DBSM, which is also based on MS) captures more buildings. This is especially evident in SE, where 94% of the country area is not urban. When calculating the intersection between building areas for each couple of datasets in all countries and urban areas, the area of OSM buildings scores the lowest percentages of intersection when compared to the building areas of the other datasets. The lowest such percentages, equal to 25%, are scored when compared to MS in non-urban areas. EUBUCCO represents an obvious exception for the countries (MT, EL and SE) where it uses OSM. Finally, the dataset for which the area of intersection between the buildings of all the four datasets represents the largest percentage of the area is OSM, with values even higher than 80% for urban areas. This proves that EUBUCCO and even more DBSM can be considered a sort of ‘OSM extension’ improving its completeness. Instead, the lowest values are scored by MS and result from its radically different generation process compared to the other datasets.
Keywords
Mathematical analysisContext awarenessBuildingOpen setSet (mathematics)Key (cryptography)Pairwise comparisonSystem callFlow separationCartesian coordinate systemData managementMappingObservational studyFood energyRight angleDependent and independent variablesLecture/ConferenceMeeting/Interview
BuildingData modelDigital electronicsGoogolOpen setSet (mathematics)Level (video gaming)Range (statistics)Open setMappingBuildingProduct (business)Meta elementCartesian coordinate systemBitSpeech synthesisMultilaterationLecture/ConferenceMeeting/InterviewComputer animation
BuildingSimilarity (geometry)GeometryAttribute grammarBasis <Mathematik>DigitizingSatelliteDigital electronicsData modelMachine learningHierarchyBing MapsMathematical analysisDegree (graph theory)Observational studyVoltmeterNewton's law of universal gravitationAreaTotal S.A.Number theoryContext awarenessOpen setBinary codeBuildingSatelliteInformationTerm (mathematics)Data miningPlotterProcess (computing)Virtual machineSet (mathematics)MappingStatisticsAreaTotal S.A.Self-organizationFraction (mathematics)Number theoryImage resolutionGraph coloringExecution unitCASE <Informatik>Degree (graph theory)Object (grammar)Right angleBitDatabaseVariable (mathematics)Projektive GeometrieException handlingKey (cryptography)Basis <Mathematik>Revision controlProduct (business)Different (Kate Ryan album)Level (video gaming)Attribute grammarPairwise comparisonRepository (publishing)Food energyGeometrySocial classGreatest elementComputer animationLecture/ConferenceMeeting/Interview
Focus (optics)Number theoryBuildingTotal S.A.AreaSimilarity (geometry)Maxima and minimaLaptopStatisticsMachine codeFunction (mathematics)Library (computing)Parallel portInformationVertex (graph theory)Stochastic processComputer-generated imageryoutputNumber theoryTable (information)MultiplicationSet (mathematics)Right anglePrice indexLevel (video gaming)Term (mathematics)LaptopAlgorithmHierarchySimilarity (geometry)Pairwise comparisonAreaRow (database)Product (business)BuildingOpen sourceFlow separationOpen setTotal S.A.CASE <Informatik>Machine codeException handlingSoftware developerComplete metric spaceMereologyDifferent (Kate Ryan album)Perspective (visual)Interpreter (computing)Block (periodic table)BitProcess (computing)Group actionNumberComputer animationLecture/ConferenceMeeting/Interview
Machine codeBuildingFunction (mathematics)LaptopStatisticsInformationLibrary (computing)Parallel portAreaVertex (graph theory)Computer-generated imageryoutputStochastic processPairwise comparisonDigitizingVirtual machineStatistical dispersionComplete metric spaceRevision controlNumber theoryRevision controlCASE <Informatik>Set (mathematics)Pairwise comparisonBuildingDatabaseAreaSatelliteDifferent (Kate Ryan album)Presentation of a groupTerm (mathematics)Local ringFile systemFraction (mathematics)ResultantTotal S.A.Product (business)Latent heatReal numberMathematicsGoodness of fitStochastic processSingle-precision floating-point formatComplete metric spaceLevel (video gaming)Open setGraph (mathematics)Similarity (geometry)Context awarenessCalculus of variations1 (number)Right angleComputer animation
Extension (kinesiology)Observational studyComplementarityMathematical analysisGoogolOpen setBuildingAttribute grammarUsabilityMechanism designDecision theoryProcess (computing)BuildingMathematical analysisCartesian coordinate systemProjektive GeometrieService (economics)Pairwise comparisonPattern languageTwitterPoint (geometry)Type theoryDecision theorySet (mathematics)QuicksortDifferent (Kate Ryan album)Order (biology)CASE <Informatik>Product (business)HypermediaControl systemDampingPlug-in (computing)Attribute grammarMappingBasis <Mathematik>Limit (category theory)Observational studyGoodness of fitLevel (video gaming)Dependent and independent variablesGeometryCondition numberNumber theorySlide rulePresentation of a groupComplete metric spaceScaling (geometry)Codierung <Programmierung>Area1 (number)Open setAdditionWordUniverse (mathematics)Lecture/ConferenceComputer animationMeeting/Interview
Set (mathematics)Context awarenessFlow separationCartesian coordinate systemCASE <Informatik>Presentation of a groupLecture/Conference
Level (video gaming)MappingWebsiteSet (mathematics)Web portalOpen setResultantMetadataLecture/ConferenceMeeting/Interview
BuildingNumber theoryElement (mathematics)Open setLevel (video gaming)Set (mathematics)AlgorithmPattern languageGeometryMathematical analysisGoodness of fitTwitterMappingAttribute grammarVirtual machineInformationProduct (business)Point (geometry)CASE <Informatik>Pairwise comparisonConstructor (object-oriented programming)PolygonBasis <Mathematik>Cartesian coordinate systemSemantics (computer science)Absolute valueQuicksort1 (number)Lecture/ConferenceMeeting/Interview
Validity (statistics)BuildingSatelliteTwin primeResultantAbsolute valueAreaProjektive GeometrieMeeting/InterviewLecture/Conference
Open setDatabaseLevel (video gaming)Set (mathematics)Case moddingLecture/ConferenceMeeting/Interview
Computer-assisted translationGeometryComputer animation
Transcript: English(auto-generated)
Thanks, Daniele. Also, it's nice not to self-introduce myself again. So we stay with the open data, but not from the public sector anymore. So here, the idea is to provide you with a comparison of some pan-European open building footprints with an analysis in some EU countries.
So what is the context here? You know very well that building footprints, which from now on I will just call buildings, are, of course, key data sets for several applications, from disaster management, response, urban planning, energy-related applications, demographic studies, et cetera.
Historically, they have been always produced, updated, curated by the public sector, right? National mapping agencies, cadastral agencies. But today, mainly thanks to the technological advancements, there are other players that can be valuable producers of such data sets as well. And I'm speaking about the public sector, sorry, the private sector, research, or academia,
and also crowdsourcing, or citizen-generated data initiatives. Some examples here. So from crowdsourcing initiatives, you all know OpenStreetMap. Private initiatives, we have Microsoft, we have Google, actually producing their own open building data set. And we have the Overture Maps Foundation.
For those of you who are not familiar, this is a foundation established at the end of 2022 by four companies, Microsoft, Meta, Amazon, and TomTom, with the promise to produce global data sets, open data sets, quality data sets, actually, for a range of applications.
But also we have academia, and two products here, Eobuko and the digital building stock model, which I will cover a bit later on. So what did we do within this context? So the idea was to focus on some of these building data sets for, again, open data from non-governmental organizations.
We focused on the European Union, and we downloaded them in January this year. The objective was to assess how similar or how different they are, only in terms of geometry, so far, so we didn't look at the attributes. We did that in a limited number five of EU countries, and we also considered not just the countries as a whole,
but also the degrees of urbanization, to see whether things are different in urban areas or in rural areas. Which are these four data sets? So the first is OpenStreetMap. Again, I think no need to talk much about that. It's crowdsourcing project started in 2004, currently more than two million contributors.
The whole database is available under ODBL. Buildings typically come from the digitization of satellite imagery, sometimes also from imports, from third-party organizations and data sets with a license compatible with the ODBL. It's a global database, of course, updated on a continuous basis. So we use geofabric to extract the buildings
using the building key. Yubuko is a data set produced by a research team in Berlin, the Mercato Research Institute on Global Commons and Climate Change, and the Technical University Berlin. It's mostly ODBL licensed. There are a couple of exceptions for two areas in Italy and the Czech Republic.
How is this produced? So they basically use the governmental data for countries where they found open governmental data. So this is, again, government. For the countries where they didn't find governmental data, they just use OpenStreetMap. That's important to remember in the following. The coverage is the EU plus Switzerland.
And this was released in 2022, also important to remember for the following. Then we use Microsoft Global ML Building Footprints. I will call it Microsoft from now on. Of course, this is a private initiative, open data under ODBL. In this case, the buildings were derived from machine learning methods on Bing Maps high resolution imagery.
It's a global data set, regularly updated. On the GitHub repository, you find all the releases. There are very frequent releases done by Microsoft. Final product is the Digital Building Stock Model, or DBSM. So this is produced by some colleagues of mine at the GRC, mainly for energy-related purposes.
It's, again, licensed under the ODBL. The production process is still different. So it's a hierarchical conflation. They started from OpenStreetMap. Then they also used and added Microsoft. So where OpenStreetMap is not available, they also looked at Microsoft. And then they also looked at the vectorized version of the European settlement map, which is a binary map with information
on built, non-built areas, basically. Available for the EU and released in 2023. They are planning also a second release to happen soon. As I said, we also looked at the degree of urbanization, looking at the NAATs classification. NAATs is the nomenclature for territorial units for statistics in Europe.
You may be familiar. We looked at the NAATs three areas, so the smallest administrative areas, roughly corresponding to municipalities or counties with a population of 150, 800 inhabitants. And these are classified into three classes, urban, semi-urban, and rural. Five countries, I said before, Belgium, Denmark, Greece, Malta, and Sweden.
Why those countries? Well, we wanted to choose countries in a way that, first of all, we could have countries of different sides, geographically far away from each other to make sure that their OpenStreetMap communities were different and not really influencing each other. They need to have also different fractions of urban, semi-urban, and rural areas.
You see the colors in the maps. And also, they need to have Yubuco coming from different sources. I said before, for some countries, in this case, Belgium, Denmark, and Malta, Yubuco makes use of governmental data. For some other countries, Greece and Sweden, it's OpenStreetMap. Now, what did we do?
First, we looked at the data sets, and we calculated the total number of buildings and the total area of buildings for each of the data sets in each of the countries. We also plotted those two variables in a three-dimensional plot. Let's have a look at what this actually tells us,
starting from Yubuco. So if we look at the countries where Yubuco is based on governmental data, you see that very clearly Yubuco stands on the top right. So it's the data set with the highest number of buildings and highest total area of buildings. But if you look at the data sets where Yubuco comes from OpenStreetMap,
it's in the other side. So it's the data set with less buildings and the lowest total area. Looking at Microsoft, again, here, this is more heterogeneous. So sometimes, look at Greece, Sweden, and Malta, it looks quite good in terms of the total number of buildings, so quite complete.
In some other cases, it's more, let's say, shifted to the left. So basically, a few buildings. You see for Denmark, it's the data set with the lowest number of buildings in terms of the area looks one of the highest areas in general. OpenStreetMap, again, you see,
OpenStreetMap typically is not the data set with, let's say, the highest number of buildings and the highest total area. But you can already see that, for example, in Denmark, OpenStreetMap looks quite good in terms of at least the total area. We can already see that in Denmark, for example, also in Belgium, the OpenStreetMap communities are quite active.
This is not the case for Malta and for Greece, where you see very clearly OpenStreetMap is really at the bottom left here. If we look at the comparison between OpenStreetMap and Yubuko, again, for the countries where Yubuko is based on governmental data, we clearly see that OpenStreetMap is much, let's say,
less complete than Yubuko, clearly. If we look at the other data sets, again, OpenStreetMap is a bit better than Yubuko. Why? Because Yubuko is using OpenStreetMap, but from 2021, 22, when they release the data set. And, of course, OpenStreetMap has improved in the meantime.
OpenStreetMap against Microsoft is quite interesting because, in some cases, look at Greece and look at Sweden, also, and Malta. When the OpenStreetMap community is probably not super active, Microsoft is really much more complete, at least looks much more complete in terms of total area, total number of buildings. In other cases, the situation is opposite.
Look at Denmark, where OpenStreetMap is more complete in terms of total area, both total area and total number of buildings. Belgium is a bit strange because OpenStreetMap has more buildings, but lower total area. Of course, it also depends how buildings are mapped. You may have, in one data set, let's say, a set of adjacent buildings
mapped as just one building, and in another data set, the very same, let's say, group of buildings mapped as the single buildings. So, you need to take all of these things into account. Finally, the DBSM, the Digital Building Stock Model, usually is one of the data sets with the highest area of buildings.
And this comes from this, let's say, hierarchical conflation approach, where they consider multiple data sets together. Now, after that, we wanted to really look at the similarity between the data sets. We started by deriving this table, where basically, you see percentages. So, each number is the percentage of the area
of the data set you see in the column, represented by the area of intersection between the four data sets. What does it mean? Basically, if the percentage is high, it means that that data set in that country is very similar to the other three data sets. If it is a low percentage, it is actually very dissimilar compared to the other data sets.
So, we can look at this table from the country perspective. You see that in Belgium and Denmark, data sets look quite similar. Numbers, percentages between 60, 70%, 80%. Sweden, a bit lower numbers. Greece and Malta, the lowest numbers. Let's try to understand why.
So, Microsoft and the DBSM are those where we actually see the lowest number, especially for Greece and Malta. These are due to different things, but mainly the fact that the open street map is less developed, less complete, as we also saw before in Greece and Malta. If we look at also Iubuco and open street map, Iubuco in Malta stands as a very low number
because in Malta, Iubuco is based on governmental data, but when we compare it with other data sets like open street map that is, as I said, very poorly complete, then of course, we get this very low value of intersection. If we do the same considering only rural areas and only urban areas and semi-urban areas, this is not in the table,
we get very clear indication that in rural areas, the data sets are even more diverse, okay? The minimum percentage would become 7%. The higher percentage for urban areas becomes 79%, which says that, again, in urban areas, the data sets are more similar. This also somehow confirms the literature
on open street map itself that basically tells us that in urban areas, there are usually open street map is more complete because there are more people living there, more mappers actually interested to map things there, et cetera. Then we also computed the same, but for each couple of building data sets and we created this table.
Again, in this table for each country, each number is the percentage of the area of the data set in the row that is represented by the area of intersection between the data set in the row and the data set in the column. So very same story as before, but applied to each couple of data sets. Again, I will guide you through
the interpretation of this table. Let's look at the three countries where Iubuco is based on governmental data and again compare open street map and Iubuco. You see that basically if you look at the open street map rows, the numbers are quite high. If you look at the Iubuco rows, the numbers are high with the exception of Malta where we get these 27%, which again derives from the fact that in Malta,
the development, the completeness of open street map is not very high, so we get this huge difference between the two data sets. Iubuco against Microsoft and the DBSM. The numbers are pretty high here, lower than open street map. Again, in Malta, Microsoft shows the lowest value.
Iubuco against open street map but in the other countries where Iubuco is based on open street map. Here we have this 99% and 97% when comparing the intersection to Iubuco, which is expected because it would have been 100% if we considered open street map two years ago but now open street map, as we said, has evolved,
so we get slightly less than 100%. Open street map versus Microsoft, this is very interesting comparison because they are the basic building blocks to derive also the other data sets, right? Numbers are pretty heterogeneous here. I just want to point you to these 24 and 25% again
in Greece and Malta due to, again, the poor completeness of open street map but also Sweden is an interesting case because both numbers are close to 50%, which basically tells us that both data sets have roughly half of the area of their buildings that do not actually intersect with the area
of the buildings of the other data sets, so in both senses, which is pretty strange as a thing, actually. Again, if we extended that to urban areas and rural areas, we would see basically the same story, so the similarity increases in urban areas. This is just two numbers for Denmark
and it decreases in rural areas. Again, for several reasons, open street map we know. For the other data sets, could be a consequence of open street map, could be also related to the fact that the Microsoft algorithms might work less well in rural areas, or different reasons still to be, of course, understood.
The code we used is Python code, is on GitHub. We parallelize part of the process to increase efficiency. We provide Jupyter notebooks. The license is the European Union public license, which is fully open source, so feel free to take a look and, of course, provide comments, reuse, et cetera. Some conclusions here. So I showed you a lot of numbers, but what can we, let's say, conclude about all of that?
So that was, first of all, the first, at least to our knowledge, comparison of some of the available non-governmental building data sets. I want to stress the fact that we didn't do a quality assessment. I never said this data set is better than this other. I mean, could be better, but relatively to a specific region or context, let's say.
So what we like to use is the term comparison. So we evaluate really the similarity, the difference. Let's always take in mind that these data sets derive from different sources. Again, private sector, the research sector, crowdsourcing, and they derive from very different, completely different processes.
I would say also the purpose of the production of the data sets themselves is very different. And also, let's take into account that there are variations, not only between countries, but also within countries. These are just two examples, both coming from Malta. So on the left, this is Microsoft against OpenStreetMap. So you see on the left a case where we could say
that both data sets are actually, so the buildings of both data sets should correspond to the same real world buildings. Most probably, they derive from the use of different satellite imagery. The result is that if you think to what we did before, we calculated the intersection, which is the portion that is colored, and we calculated the fraction of the area
of each of the two represented by the intersection. So you see the intersection here is a small portion of the area of any of those. Different is the case on the right, where in OpenStreetMap, we only have one big polygon, as if it was a single building. Of course, this is not corresponding to the reality. In Microsoft, we have all the buildings. So the intersection here is almost equal
to the Microsoft area. So different cases, even in the same country, and Malta is a very small country. So what can we say at the end? We can provide some recommendations to users that would like to actually use one of those data sets. So for OpenStreetMap, we showed that the quality
very much depends on the actual presence of a community there. Imports, of course, if there have been imports from governmental data set, the quality is usually better. The completeness, also confirming the literature, increases when moving from rural areas to urban areas, and using the latest version of OpenStreetMap
is always recommended. So if you have in your local file system an OpenStreetMap database of some months or some years ago, and you need to use it, don't use it, download the latest version. Yubuko, so when based on governmental data set, is of course a reliable data set, but it may be outdated. Consider Yubuko was released in 2022.
So the governmental data set present in Yubuko might have even been released before that year. So, you know, good data, of course, because it's authoritative data, but might be outdated. When Yubuko uses OSM, we have seen that OSM, of course, changes continuously. This is not captured.
Microsoft is the data set showing really the most heterogeneous results. It looks like a complete data set, if you remember the initial graphs, because the number of buildings, the total are usually among the highest ones, but we found a lot of very low percentages when looking at the accuracy, when looking at the comparison
between the couple of data sets. So positional accuracy might be questionable. Of course, additional work would be needed to better understand that. DBSM is a nice approach, because they combine different data sets in order to somehow, and this somehow overcomes the limitations of each of those data sets, and this is actually good to maximize completeness.
So if you need, if you have an application requiring a high level of completeness, like disaster response, for example, you want to know where buildings are, where people are, these might be a good data set to use. Final point on the future work. We would like first to extend the study to validate what we have found, although I'm pretty sure that we,
even if we only looked at five countries, we captured already some trends and patterns that we may even found if we extend the area, for example, to the whole European Union. We can, of course, plug in new data sets, like Google, although Google is not available in Europe, and overture maps data. We can even extend the work to attributes in principle.
We didn't consider attributes, as I said, so that would need to be done from scratch. Of course, if anyone is interested, let's talk. And last but not least, the complement, what I showed you with a qualitative analysis of the data set. What do I mean? I mean, I showed you a lot of numbers, intersection, geometries,
but what is behind those data sets? So what are the licensing conditions? I briefly mentioned at the beginning, accessibility. So is it something easy to download, easy to access? What is the encoding, are there APIs available? Are there quality assurance, quality control mechanisms for the production of the data set? What is the granularity?
What is the scale? What is the coverage? I mentioned about the coverage. Final point on the governance. So what is behind? Who takes decisions on what is included and not in the data sets? Is it just one company? Is it a community, like in the case of OpenStreetMap? And what is the sustainability of the project? Some projects already ensure that they will stay there for years.
Some others, for some others, this is questionable, like the Yubuko that was released in 2022. It's a small research team at the universities, this sustainable. So all those questions are equally important to really better understand this landscape. That's it. Thank you very much.
Thank you, Marco, for your presentation. Very nice. Any question for Marco? Thank you very much, Marco. I think that's, for us also as practitioners, really interesting to have this sort of overview of those building footprint data sets.
Some are much more pushed popularity in Twitter and other social media. But it's good to know also other ones. And it's a curious comparison in Europe, this type of thing. Is there, for us now maybe, if we wanna do analysis or services
around buildings and building footprints, I don't know, heat islands and cities, you know, those type of applications, what would you recommend us to use? I missed which application? Nah, different application. For example, heat islands. Heat islands, right. Heat islands, or I don't know, wind corridors,
or you know, those type of things. Which data sets in Europe would you recommend us to use? Well, that's exactly the most challenging point, that it's not possible to tell you for this application, use that data set. You may, coming back to this slide, you may understand, and this is what we try
to provide here, the pros and cons of each data set. But then, I think it's up to the user, knowing that, knowing the specific application, to have a look at the data sets. For example, if it's a urban application, I would say OpenStreetMap usually works well in urban areas, for example. If it is a rural application or a country-wide, well, let's take a look at, for example, how active is the OpenStreetMap community?
There are tools, automated tools available to just check how many users are active on a daily basis. So, you know, this is, I think, the main conclusion of the study, that it's not possible to tell which data set is better to do what, but it really depends on use case, application, the specific context.
I have a great presentation. One question from my side regarding the governmental data. So, is Yubuko already considering cadastral data from every country, or cadastral data is a separate data set? Because if it's not considered here, I think that would be a really good source,
and the one also you can compare the results from all the sources. Well, so, as far as I know, the researchers for Yubuko, they looked at open data portals, so they really went to the national level for each EU country, and they accessed and used that data set whenever they found it as open data.
So, that could be national mapping agencies, even cadastral data, I guess, for some country, but if you check the metadata for each country on the Yubuko website, you will find all the details. And the download happens country by country, so you can easily check all the metadata and where each data set comes from. Again, it's governmental data,
but it's data from 2022, or even earlier times. So, it may be there, or it may be not there? If you find in Yubuko, it's there. The provenance is government, not sure, could be cadastral data, could be national mapping agency, you need to check on a country by country basis.
Thank you. I don't know if this is a question or more of a comment, but one thing that I know about governmental data sets is that in those, not everything that might look like a building to an AI is actually defined to be a building.
Like, for example, in the Finnish building data set, which I happen to know, there are buildings and then there are constructions. So, might this perhaps be one of the reasons why the Microsoft data set had so many buildings because it was made by algorithms, right?
So, it might have categorized things that are not officially buildings as buildings. Yeah, no, no, absolutely, absolutely. We didn't do these, let's say, semantical mapping between what is a building in each of the data sets. Microsoft derives from machine learning, so I think this is pretty straightforward. They just map as building, whatever the algorithm says,
it's a building. So, it could be not only construction works, but even something that is not a construction at all, that the algorithm interprets as a building, absolutely. The same is for OpenStreetMap. Something may be mapped as a building while in reality it's not building at all. So, absolutely, these are all things
to be taken into account when looking at the numbers. I think we still get an idea of the main trends and patterns and how they compare with each other. They are all released as building data sets. So, that's why we just took what is there and make a comparison, but these points are totally valid, thanks.
Yeah, was, sure, no, no.
As I said, we had to start from somewhere and we just started from the easiest thing, that is just to compare. John, no, no, wait, but these are building data sets, so we did not take data sets. We took building data sets, then what each considers as a building or whether any polygon is or not a building,
then we didn't, of course, look at each single polygon, but these are building data sets, so they are released as buildings. So, we assume that they actually represent buildings. As we just commented, this might not be the case in the real world when we go and we say, okay, this OpenStreetMap building is actually something else, that the mapper actually saw as a building,
interpreted as a building. Same is for machine learning. So, the reason not to look at attributes was just to do the first, let's say, level analysis. Looking at attributes, that would be much more complex. So, if we want to map, of course, didn't say that, but some of those data sets include important information like the classification of the building,
number of floors, the age of a building, some others do not. That is also an important element to take into account for the use case and the application, of course. Consider this just as a very first analysis, only limited to the geometry, which already tells a lot, I would say.
Thank you. Thank you. Thank you, Marco, and very good approach. I wanna ask about the possibility to compare not one sort with other,
maybe with in-situ data. I know that it's difficult because you can go from different country to make sure that the building, but maybe it combinates some digital piece or beams model with this sort
to compare the areas that with one's building that you know that is topographic, well-elevanted to compare with this source. Maybe it's a good approach. So, if I understand correctly,
you are speaking about validation, somehow? Yes, validation with some building that you know that are good area, which come from digital twins or from beams. No, no, absolutely. Another way could be to just have a look at some recent satellite imagery, maybe more than one, and just pick up something that we know very clearly from imagery that is a building.
That is, of course, would be very useful for, let's say, manual validation of the results. But, of course, it's very difficult to scale. Yes, maybe in tech project, we will provide some digital twin from Lisboa. So, maybe this is a good approach. Looking for something like that? Yeah, I mean, this is somehow also the approach
that was done at the very beginning of the literature work to assess the quality of open street map. So, they took the governmental data to taken as the ground truth to measure the quality of open street map. So, where there was authoritative data, then this was the ground truth. It works, yes, but again,
authoritative data may not be fully updated. So, once again, there is a new building built today. It's in open street map tomorrow, but it will be in the governmental database in three years, maybe. So, that's always the kind of thing for which I think the approach is based on so-called extrinsic quality assessment. So, we take something, the governmental data
as the ground truth, work, but until a certain point, in my opinion. There is a lot of literature on that. We can discuss later. I've done a lot of research on that in the past. Sometimes, comparing the data sets instead of just assessing the quality of something, taking something else as a reference is also maybe the best approach.
Thank you. Thanks.