We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Introduction to Spatial Data Outputs Platform - OpenStreetMap Galaxy

00:00

Formal Metadata

Title
Introduction to Spatial Data Outputs Platform - OpenStreetMap Galaxy
Title of Series
Number of Parts
351
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2022

Content Metadata

Subject Area
Genre
Abstract
OpenStreetMap (OSM) Galaxy is a project that the HOT Tech Team launched in mid-April 2021 to optimise and improve availability and accessibility of OSM Data outputs for different user groups within the ecosystem. Through this project, we strive to address all the OSM data needs under one umbrella and ensure OSM data is available, accessible and ready to use for all kinds of users. We are trying to solve the high dependency on different data sources and uncontrolled platforms while focusing on fast queries and process optimisation by accessing data from HOT administered and controlled environment. As a one-liner, the vision for OSM Galaxy is to provide a single platform to address all OSM Data Needs. In OSM context, a data need is a broad term covering a variety of topics: Raw data exports Analysing completeness of Data Checking the data quality in your neighbourhood Understanding your contribution to a mapathon, to name a few Through this project we strive to: Bring together all the data needs under one umbrella Ensure OSM data is available, accessible and ready to use for all kinds of users
Keywords
202
Thumbnail
1:16:05
226
242
Differential (mechanical device)SolitonProcess (computing)Mathematical optimizationQuery languageComputing platformReal numberScripting languageMotion captureBuildingGeometryStatisticsBasis <Mathematik>Thresholding (image processing)Scale (map)Google EarthDensity of statesRobotFile formatExpert systemComputer fileCone penetration testForm (programming)CodeWebsiteEuclidean vectorSoftware frameworkFunction (mathematics)Beta functionRepresentation (politics)System identificationFunctional (mathematics)Open setState of matterLevel (video gaming)Product (business)Raw image formatSet (mathematics)Traffic reportingKey (cryptography)Source codeStatisticsProjective planeDifferent (Kate Ryan album)UsabilityMereologyCausalityData managementDatabaseConnectivity (graph theory)WebsiteSpacetimeMultiplicationData analysisSoftware developerTerm (mathematics)Repository (publishing)Ocean currentRevision controlArithmetic progressionScripting languageInformation retrievalAddress spaceProcess (computing)SoftwareBuildingRight angleCountingOnline helpMathematical analysisComputer fileCodeWrapper (data mining)Task (computing)Motion captureDependent and independent variablesFunction (mathematics)Software frameworkRepresentation (politics)Basis <Mathematik>MathematicsSelf-organizationFunctional (mathematics)MappingPlanningTable (information)Similarity (geometry)Channel capacitySoftware testingAreaExpert systemLattice (order)Focus (optics)Perturbation theoryIdentifiabilityMultiplication signGroup actionData qualityValidity (statistics)NumberFile formatFrequencyMaxima and minimaInformationContext awarenessPreprocessorInstance (computer science)Latent heatVariety (linguistics)Derivation (linguistics)Core dumpQuery languageBitLink (knot theory)Internet service providerVisualization (computer graphics)GeometryData typeCurveDebuggerFlagElectric generatorQR codeAmenable groupShape (magazine)Type theoryCASE <Informatik>ChainReal-time operating systemProof theoryComputing platformFrame problemWindows RegistryDiallyl disulfideData storage deviceShared memoryOpen sourceWebdesignLimit (category theory)LaptopMessage passingComputer animation
Transcript: English(auto-generated)
Welcome to the session today. I'm Ramya. I'm a product lead with a humanitarian open state map team, in short, HORT OSM. HORT is a nonprofit focused on humanitarian response through open mapping efforts. And all the humanitarian response that we do,
all the mapping efforts, be it crowdsourcing, like mobilizing contributors, or forging partnerships, or be it the capacity building, or any products that we build around the humanitarian response. Everything is centered around the idea of open street map. So OSM, if you're not familiar with this,
it's called the Wikipedia of Maps. And it has over 8 million registered users. And the number just steadily keeps increasing on a daily basis since the inception in 2004. And there are a wide variety of users registered on OSM. They could be just casual mappers who make the map better,
always make sure the map is up to date. They could be business users, corporates, nonprofits, governmental organizations, just the tech community, and so on. So with Galaxy, we want to ensure OSM data output is
more available, accessible, and in a ready-to-use format for all these different kind of uses. In the context of OSM, the term data, our data output itself would refer to a variety of stuff. It could simply mean raw map data or pre-process information that could be derived from these underlying map features.
So with Galaxy, we want to cover all these data output types, all these different data types from OSM and make it more accessible and increase the ease of use for all these different kinds of uses.
So that's the why for Galaxy. Now, how do we want to do this? How do we want to increase the accessibility or increase the ease of use for OSM data output? We want to reduce the dependency on multiple data platforms. There are different tools. OSM is a thriving ecosystem. There are lots of tools to derive different kinds of data outputs that any user wants.
We would want to reduce the dependency on different sources and instead provide all of this data accessible through, in a easy-to-use format, through Galaxy work. Just a little bit about how people rely on different tools within OSM ecosystem
to derive any outputs that they want. So there's something called Mapathon that is very common in the OSM space where people gather together. It's an organized editing process. People either gather in person or online and contribute to a specific cause. They map for a specific cause.
It could be that during the Mapathon or at the end of the Mapathon, the contributors or the organizers would want to understand the impact they have created as a group. So maybe like a casual mapper would want to, a contributor for the Mapathon would want to refer to the leaderboard, tool like leaderboard. So missing maps, a leaderboard is a very popular tool
within the OSM space. So they might want to refer to their statistics, understand like how many features they have created, like how many buildings did I map today? Like what's the kilometers of highway map today? Those are some like common statistics for a Mapathon. So they would refer to the leaderboard. But for an organizer for the Mapathon,
this would be much more, not just the leaderboard. They'll have to combine this data from different data sources. So they would derive the data from leaderboard and they would also go to like a tool like tasking manager using which these Mapathons are organized, derive the statistics about the time that each user has spent mapping a particular project or particular area.
And maybe, and after that they would also go to tools like Overpass or Planet and Geofabric Extracts to download the data and do any visualizations on top of it and present it as like, hey, this was the impact we created as a group. So there are like multiple other tools existing. These are some of the popular tools that I mentioned. Maybe like I've missed a few tools as well.
So with Galaxy, we want to cover all these different outputs in one place. Instead of having to like go to different tools, they should be able to derive all of this through Galaxy. Otherwise, like with each tool, there is a learning curve and there are like certain limitations and advantages.
All of this would be, with one tool in place, we increase the ease of use. The accessibility is smooth. And that's how like we plan to reduce the dependency on like in different tools and just have like one tool to derive all of the data outputs. Okay, on what are the set of tools
that we have for Galaxy? So with Galaxy, when we say if you want to increase the accessibility and ease of use, the first step would be to build our own data source. So we have, as part of Galaxy, we are building a data source called Underpass, which is focused on the current snapshot of OSM
and then a historical database called Insights. And then there is a backend piece which talks to these data sources. And finally, there is a website to visualize all this data that we have been capturing in different data sources. So I'll just go through each of this component in detail, starting with Underpass.
So Underpass is one of the data sources that we maintain, and it's focused on the current snapshot, the current data that is available on the OpenStreetMap database. So Underpass has a C++ processing script, which in turn injures minutely files from planet sources, from OSM sources.
So these minutely planet files captures what goes into OpenStreetMap database for every minute. So through this, the script will understand like what was the change that was done, what were the features that were added, what was modified or what was deleted. And as part of this processing,
the script will update three different tables. So one would be capture all the raw data that's already available on OSM and keeps it up to date. And then there is also statistics that could be derived out of this raw data. And finally, data quality issues.
So there are like three different focus areas for Underpass, raw data, statistics and data quality. So validation is just like a common term that we use when we refer to the health of the data. So just call it as the validation database. So that's a focus of Underpass. And here I've also included a tasking manager
to refresh the data source because the focus for Galaxy is to ingest different data into one place. So not just the minutely changes that goes into OpenStreetMap database, but also the statistics from tasking manager specific to mapathons, like what was the time the user has spent mapping different projects
and like what were the number of tasks or the projects that they contributed to. So this could be combined for a consolidated report at the end of a mapathon. And the other data source that we have been building as part of Galaxy is the Insights data source. Unlike Underpass, which is focused on the snapshot,
Insights captures the entire history of the OSM database. So all the features that you see on OpenStreetMap has a certain history attached to it. The one that you see on the map is the most recent version of that particular feature. But internally, the database would have multiple versions
of the same feature. So I go and create a particular feature, maybe like I add a building, then someone else add some more details to the building, like a name of a building or address of a building, then that becomes the version two. So all of this becomes a history of that feature within the OSM database. So with Insights, we capture the history
of all the features. When I say all the features, like all the buildings, like all the highways, like all the amenities, it's available on OSM. So all of this is captured within the Insights database. Similar to Underpass, Insights has a Python processing script. In this case, again, it ingests files from Planet,
the minutely files, the change set files, the change files and the initial loading of data is done through Geofabric extracts. So at the end of it, the processing scripts updates one database, which is all the history of OSM. Unlike Underpass, which has three separate stores,
the raw data, the statistics and the validation capture. So here we see the data dumps from Underpass could serve as one key product. So anyone who is interested in running their own analysis on top of the existing data dump. So Underpass has raw data, stats
and data quality issues captured. All of this would be available as a data dump for anyone to download and run their own queries on top of it. The Unpassed database is about 1.5 terabyte data. We cannot provide data dumps around Insights because it's historical and it's going to be huge
to download all of that data. Instead, we are just starting with Underpass. So you could run your own scripts, like Python script with like Jupyter notebook, run your own analysis and generate like CSV, XML or SQL outputs from the underlying data dump.
So this is aimed more at like technical users or data analysts who want to extract the raw data or like see like some more statistics that's already not available Underpass because this raw data is existing there and you could frame the query the way you want to extract the data. So some key progress on Underpass.
So that's the QR code is for the GitHub repository. So I thought like you could just scan and see like what's going on. All the code is open source. Underpass right now captures the raw data and data quality issues from OpenStreetMap. We are still validating the statistics that we are generating as part of Underpass.
It also has statistics from Tasking Manager integrated into the same data source. So if you're using the Underpass data dump, you will have access to Tasking Manager statistics plus the raw data from OSM. And the OSM raw data is updated on a minutely basis.
For now, we don't have this dump available online, but very soon like in another like month, we should have this available for everyone on a weekly basis. So every week like this should be accessible for people to download data. We are also like planning to put this on the Amazon Open Data Registry
and have people run queries against this database and extract data. On the insights, we have historical data, 45 countries loaded into the database and this data is updated once every five minutes. So there is a lag of about like five minutes
to extract like any statistics from Insights. And the statistics from Insights, we have tested it against like multiple other data sources and more or less it has been consistent and we have also run pilot against a few mapathons. The next piece would be the Galaxy API,
which is the data retrieval part from the different data sources from Underpass and Insights data source. So Galaxy API is a Python script and it has like a few endpoints to supply statistics and raw data.
The Galaxy API could directly be used to build a custom front end or it could be used in other process to access the data and do any running analysis on top of it. At present, there are like many of the projects that are using the Galaxy API.
For example, this MSF internal dashboard depends on the hashtag statistics that are supplied from Galaxy API. You could see the statistics, the new buildings, the feature count added for each hashtags listed in the bottom right corner. We are also using Galaxy API as a source for Expert Tool.
Expert Tool is one of the tools from Hot Tech team. And here we provide a wrapper around the Overpass API, like where people could download OSM data outputs in different data format.
Last month, we made a new release for Expert Tool, wherein it depends on the Galaxy API to extract GeoJSON and shapefile formats. The rest of the formats are still dependent on the Overpass API. The plan is to replace the rest of the data sources also to depend on Galaxy API. And we have seen a marked difference
when we made the switch to Galaxy API for GeoJSON and shapefile output formats. Earlier, like downloads, which used to take around 20 minutes. So we tried downloads against Indonesia buildings. So this used to take us around like 20 minutes to generate all of the data outputs. And with Galaxy API, we were able to do it in eight to 10 minutes.
So there is a marked difference with a new data source. And we plan to do something similar for the rest of the data output format, too. And Galaxy API also serves as the data source for the Galaxy website, which is at galaxy.hotosm.org.
Here, it supplies the Mapathon statistics for a 24-hour time frame, and also uses statistics for a max period of one month. Here, Mapathon hashtags, like any OSM hashtag could be supplied, or people could directly supply tasking manager project ID. So it's not tied up to a specific instance
of tasking manager. If you know like hashtags that's used for like a specific project for like different tasking manager instance, that could be used to generate Mapathon statistics. So it'd be like a consolidated report wherein you know the base map features created by each user, contributed by each user,
and also the statistics from tasking manager combined for each user in one place. There is also something called data quality issues available for each mapper. It flags all the, some like wrong edits, or like invalid edits. In this case, we are just capturing bad geometry
and bad tag values. We are just starting something basic there for the data quality. So this is flagged at the user level. The plan is to have this data quality flagged at a different hashtag level, and also the tasking manager project level.
So the API documentation can be accessed from here, where you can also like play around with a few endpoints. So you can switch between two different data sources. The one that we have in the Galaxy website, all the data in the Galaxy website that comes through Insights data source,
because we are still validating the statistics in the underpassed data source. But here in the API documentation, you can directly switch between the two data sources and see how does it work. And then the website is accessible at galaxy.hortoism.org.
And that's all the products put together. We have the data sources, which is underpass and insights, and yeah, the retrieval is done through Galaxy API. Finally, we have the Galaxy website for the near real-time data download, statistics, and validation.
We are also looking for more support from the community on different aspects of Galaxy development. So we need help with the code development, people who are interested in software development. They could do C++ for underpass,
for Python for Insights and the Galaxy API. The website is more like a proof of concept. It's in a React framework. So maybe like there for the website, design aspects would be helpful. If you're interested in like sketching out wireframes for different data representations
or like holding user interviews to better understand their needs, then that could be the help we can have for website. But for the rest of the components, underpass, insights, and Galaxy API, we need help with the software development. And we'll also need help around translating the user stories into functional requirements or functional requirements docs,
identifying the endpoints, like how do you design these endpoints for the Galaxy API. Or if you're just interested in like testing aspects, you could validate the outputs that are generated from these components against various data sources. Or if like data science or like data analysis interests you,
OSM's a huge pile of data. You could help us identify insightful user stories that we could communicate through the website. We are also running a working group meeting every month where we share with the community on the progress we are making with Galaxy. If you're interested in participating in the working group meetings,
there's a bit.ly link to register your interest. And you can also like mention like which areas you're interested. Like the software development or the testing aspects or the design aspects or documentation. You can also write to us at tech at hotosm.org for further queries. Yeah, we are also running a code sprint.
We are also participating in the code sprint on Saturday and Sunday. I think like it's in a different venue. So we'll have the developers available there if you'd like help with the development setup.