We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

A crawler for spatial (meta)data as a base for Mapserver configuration

00:00

Formal Metadata

Title
A crawler for spatial (meta)data as a base for Mapserver configuration
Title of Series
Number of Parts
351
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2022

Content Metadata

Subject Area
Genre
Abstract
At our institute we manage a lot of input data and model outcomes of soil data to be shared online. We experienced that updating service configurations and metadata records can be quite a challenge, when managed manually at various locations. We've been working on tooling to help us automate the publication processes. These days data publications are set up as CI-CD processes on Gitlab/Kubernetes. These efforts resulted in a series of tools which we call the Python Data Crawler. The crawler spiders a folder of files, extracts and creates metadata records for the spatial files, as well as generates a Mapserver configuration for the data to be published as OGC services. Underneath we're building on the tools provided by the amazing FOSS4G community, such as GDAL, Mapserver, pygeometa, owslib, mappyfile, rasterio and fiona. A typical use case for this software is with many organizations maintaining a file structure of project files. The crawler would index all the (spatial) data files, register the metadata records in a catalogue and users would query the catalogue from QGIS Metasearch to find and load relevant data. We will present our findings around the project at the conference and hope to talk to institutes with similar challenges, to see if we can create an open source software project around the Python Geodata Crawler.
Keywords
InformationInformation
Data modelInformationEndliche ModelltheorieOpen setComputer animation
InformationComputer animation
Task (computing)Integrated development environmentInformationComputer-generated imageryMetadataExtension (kinesiology)Shape (magazine)Physical systemProjective planeTask (computing)MultiplicationData managementComputer fileResultantOpen sourceBitSoftwareIntegrated development environmentMetadataGeometryComputer animation
Router (computing)InformationSet (mathematics)
ExistenceMetadataInformationDefault (computer science)Level (video gaming)Web crawlerHarmonic analysisInformationComputer fileFile formatAdditionTransformation (genetics)LoginScripting languageProjective planeMetadataChainEndliche ModelltheorieProcess modelingTable (information)MathematicsLibrary catalogMappingDefault (computer science)Service (economics)Domain nameServer (computing)Computer animation
Cellular automatonInformationGoodness of fitCASE <Informatik>Software developerProjective planeProcess modelingCodeXML
Transcript: English(auto-generated)
Hello, Florence, I have five minutes, I'll be quick. I'm from Israel World Soil Information Center. We maintain our reference collections of all the soils in the world, our scientists go to all places to create and we show that on the museum and we have that in store.
We also have a data set, so we fetch any data that we can get on soils from all over the world and our statisticians build a global soil model which we then have open access available. On the conference you may see these people running around, that's us, the OSTO team
at ISRIC. George is not here unfortunately but you may know him. This is a bit the stack that we have, we run a DevOps environment on Kubernetes and we have a lot of open source software that we're using. So here's a problem setting that I want to bring today is that data dissemination is just too difficult.
And it's only required incidentally at the end of a project and you really have to think, oh, what do I have to do, where do I have to click? It involves multiple tasks, multiple environments and the task is not reproducible.
So this is where we think DevOps brings in good conventions. So the deployment of data can be done with Git workflows, the content itself is more versioned in Git and metadata equals data or the other way around. So this result in transparent data management and we can, our soil community suggests improvements
via the Git issue management. Let me first introduce you to the sidecar concept. We have the geo package or the shape file somewhere on our system and there should always be a metadata file next to that.
Esri introduced that concept in the 90s or at least they made it big and we continue on that. So I'm going to present the set of tools that we have around this concept to support the data DevOps. So this PyGeoData crawler, it runs on a folder of files. So when you start a new project, you get some data from a customer and you run the
tool just to see what's in that folder. If there's metadata, you import it. If there's no metadata, it will use GDAL to fetch metadata from that file and then users can suggest additional metadata via Git pull requests.
Then there's PyMapFileCrawler. We use the sidecar metadata information to generate automatically a maps of a map file. The URL to access that map file is then introduced back in the metadata. So if that metadata ends up in a catalog, you can access that WMS, WFS.
Initial style is default, but you can override it with a style sidecar file. We're looking forward to see OGC API maps also land in PyGeo API so we can also support the PyGeo API tool chain. Then there's the next step. That's a community data-driven harmonization.
Of course, with all that data, soil data coming in from all over the world, it has thousands of formats. So Luis, our ATL guy, has a lot of effort to harmonize that to a common model. But here we want to ask the community to help out. We set up an initial transformation script and people in the community can improve suggestions
to the transformation script. Rename table X to Y. Each change in the transformation script runs a CI CD process which does the harmonization again. And in the logs of the harmonization, in the logs of the CI CD, you can see if things fail.
So that's very transparent. And then, to my surprise, there was already such a thing. It's called Data to Services by University of Maastricht from the medical RDF domain. But that kind of conferred me that this is really an interesting approach. So this is a project under active development.
We're actually using this tooling in our own work processes, but the code itself is still very better. So this is where I want to share ideas with you if you maybe have already tooling available that we can use in our workflows or actually new ideas. So I really appreciated the lightning talk before this one because I think that's also a very good use case.