We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

SDI maintenance DevOps style

00:00

Formal Metadata

Title
SDI maintenance DevOps style
Title of Series
Number of Parts
156
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
At ISRIC - World Soil Information we increasingly maintain our data services through CI-CD pipelines configured via GIT. Both from the service as well as content perspective. The starting point are metadata records of our datasets being stored on GIT. With every change of a record, the relevant catalogues (pycsw) get updated and any relevant web services (mapserver) are updated. These pipelines are reproducable and there are never inconsistencies between catalogue content and the services. On top of that our users can directly report issues (or even improvement suggestions) through git. The stack is build on proven OSGeo components. A tool pyGeoDataCrawler brings the power of GDAL and pygeometa to CI-CD scripting. It crawls files on a folder and extracts relevant metadata, then prepares a mapserver configuration for that folder, while updating the metadata with the relevant service url's. Typical use cases for this stack are; a search interface to any file based data repository or a participatory data catalogue for a project. At the conference we hope to hear from you if any of these components could be relevant to your cases or if there are similar initiatives we can contribute to or benefit from. What's next? At ISRIC we receive and ingest a lot of soil data from partners. To harmonize this data is a huge effort. Via automated pipelines and interaction with the submitters via git comments, we hope to improve also this aspect of the data management cycle.
Keywords
127
InformationSample (statistics)Data modelMetadataSampling (statistics)Field (computer science)Constraint (mathematics)Type theoryProfil (magazine)Uniform resource locatorSearch engine (computing)MetadataRoutingMachine learningVideo gameSet (mathematics)MeasurementPredictabilityProcedural programmingArithmetic meanProbability density functionTraffic reportingRepresentation (politics)Endliche ModelltheorieDistribution (mathematics)SubsetStatisticsProduct (business)Different (Kate Ryan album)Pairwise comparisonWeb pageSheaf (mathematics)Shared memoryCategory of beingMetreCycle (graph theory)Observational studyComputer animationLecture/ConferenceXML
Time domainMetadataCurve fittingTouchscreenTemporal logicSample (statistics)SubsetSystem identificationOrganic computingFraction (mathematics)Integral domainFile formatLimit (category theory)Service (economics)Web crawlerAbstractionComputer fileBound stateDynamic random-access memoryMetadataAddress spaceSelf-organizationLibrary catalogTemplate (C++)Repository (publishing)Standard deviationRow (database)Web pagePoint (geometry)Computer fileComputer configurationMathematicsSet (mathematics)Service (economics)Open sourceLevel (video gaming)OnlinecommunityEigenvalues and eigenvectorsSoftwareVideo gameLink (knot theory)Server (computing)Content (media)Module (mathematics)Cycle (graph theory)Projective planeLibrary (computing)Web 2.0Connectivity (graph theory)Computer animationXML
CollaborationismCategory of beingTime domainMaxima and minimaInformationOperations researchData managementRevision controlMetadataScripting languageComputer configurationFeedbackLibrary catalogPresentation of a groupContent (media)Configuration spaceHorizonComputing platformSource codeService (economics)Reading (process)Library catalogLink (knot theory)Presentation of a groupGoodness of fitRepository (publishing)Self-organizationHarmonic analysisCASE <Informatik>Open sourceRow (database)Projective planeDebuggerScripting languageState of matterField (computer science)FeedbackMechanism designSoftware maintenanceComputing platformStrategy gameDifferent (Kate Ryan album)Level (video gaming)Stack (abstract data type)MetadataBitPlug-in (computing)Configuration spaceComputer configurationWeb servicePoint (geometry)Network topologyRobotCodeServer (computing)Process (computing)File formatMultiplication signSoftware repositorySemiconductor memoryComputer fileType theoryComputer animationXMLLecture/Conference
Computer-assisted translationPresentation of a groupComputer animation
Transcript: English(auto-generated)
Soil data is typically collected as samples from the fields. And we take a sample here and we take a sample there. And then it's up to our statisticians to predict kind of the distribution of that soil between this point and that point.
So we apply a lot of spatial statistics and machine learning to get that done. And then we collect a lot of data from partners, national soil institutes, research projects, and we try to combine all that data into a global model of the soil.
And that's called soil grids, and that's quite intensively used. And so my team maintains that spatial data infrastructure that ingests all that data and advertises those data products to the wider world.
Yeah, so on top of that, we go to national soil institutes to improve their SDI skills so they can more effectively work with soil data. So overall, the whole soil data will benefit. Some pictures, this is the soil museum that we have in Wageningen.
It presents a representative collection of soil profiles, like the first top meter of the soil. And then every Wednesday afternoon, you can come and watch. And we have schools from Europe going here, soil studies to look at those profiles. But this is just a subset of our collection. We aim to have a representative sample of every soil type in the world.
And these are stored like this. So this is thousands of these profiles being down in our basement. And that's just one thing that's nice to look at. But as a researcher, you can also request a sample.
So from each of these soils, we have representative samples from different layers. And researchers can do a request to get a sample if they want to compare different clay types over the world, for example. So that's what we do. And then let's go to the data thing, because that's what we're here for.
Metadata is an important aspect of any SDI. It helps us to, if we have data in our file repository, helps us to understand what is the data, how can I use it, and who to contact if this isn't clear. It supports you, your colleagues, the outer world, but
also the search engine crawlers, and even the machine learning models, to understand the data in a better way. And unfortunately, despite the acknowledged benefits, metadata availability is still low or poor quality. So this is one of the big drivers of my work.
So, and then specific to soil science is that soil data, of course, takes a long route when you dig a hole here. It goes to the lab, it's analyzed on certain properties, properties are sent back to the SDI, then go into statistical processing, and then there's the final data.
So it's a long route. We need to trace that. We need to understand how that data was produced. And that's one of the aspects that is relevant here is which procedure was used to calculate the pH. Because if you calculate pH in water, it will give you a different value
than if you estimate the pH in dry air. So we need to understand how the pH is measured to understand what the value means. And this is often taken for granted because we always do it like this. Well, you do, but maybe 20 years ago they didn't. And maybe in the next country they don't.
So if we want to combine that data between your department and the next department, we need to know what measures you used. And we don't want to look into PDF reports to find out. On page 22, there was a small section about this. No, this should be on the data, on the metadata.
So this is another aspect that kind of recently came up, that there's a growing hesitance to share location data on soil. It's related to liability, like okay, we found this impurity. Who's to blame for that?
So this gets more sensitive to share this type of location data. So we need to carefully keep in our metadata what are the usage constraints of this data set. And this is also another thing. These machine learning models, they eat data.
We literally feed it thousands of data sets, coming from Google Earth Engine or various sources, Copernicus. And the model decides which one is relevant to make the best prediction. And then, so then of those thousand, hundreds stay in the model.
So we need to have good metadata on those thousand data sets to understand later the traceability of these predictions. So that's why metadata is especially important for us. So let me now come to the kind of idea here, is that we use metadata as a recipe to guide the data life cycle.
Unfortunately, my colleagues, researchers, don't put a lot of metadata, like many of us. At most, they put a readme.txt in the file folder. And the goal of my efforts is to lure them to generate more metadata
while they process all these data sets downloaded from Google Earth Engine. And so this will, yeah, prevent the duplicate effort.
And I try to show them the benefits of putting that metadata. With the goal to finally use this metadata as kind of a recipe, how this data, of the life cycle of that data, when is it created, when is it reviewed, when is it published, and when will it be archived.
And for that, we need DevOps principles, because those principles are really fit for this purpose. When and how is software deployed, and when is it archived, and when is it good to go? When can it be released? So our approach starts with a Git repository.
So we put those metadata files generated by the scientists on a Git repository, and then use CI-CD pipelines to process that. If we detect missing metadata of a data set, we try to generate that from the file, and we notify the scientists,
please, go back to that file and add your thoughts. And then we crawl the metadata from that project repository. We use the PyGeoMETA library, which is a library developed in the GeoPython community,
and it's a YAML-encoded metadata model, which makes it really easy to use, both by the machines as well as the users, and very optimal to Git. So we built this GeoData crawler, which is also an open source project. Very happy to find it on PIP. You can easily try it.
It's typically crawling a file repository for metadata files, and then gathers it into any standard that you need, and we can then load that content into a searchable catalog.
And here we've selected PyCSW, but GeoOrchestrata, like just shown, is also a very good option. PyCSW is a Python library which is very aware of the OGC standards, so it supports a lot of the OGC standards, which is very important for us,
considering our diverse user community. But it's, for example, also used as a module in Geonode and Seacon, Seacon Spatial. With the upcoming PyCSW 3.0, we will have OGC standards, OGC API records support,
which also gives us an HTML interface, which makes PyCSW also able to run standalone. And that HTML is quite easy to fit with our needs of the organization using Jinja HTML templates.
So some screenshots. This one is from one of our catalogs, and the interesting aspect here, what has been added is the added me on Git button here. Because this metadata is stored on Git, we can easily invite people to go to the Git repository
and look at that MCF file, which is the YAML convention from that PyGeoMETA standard approach. And people can create a pull request or create an issue about this metadata record. Hey, that address of that organization is wrong. Or even suggesting the proper value.
For those that don't like the YAML and the Git, we have this metadata editor, which, if you click the save button, will generate this YAML file. So for those that feel uncomfortable. And then we run CI-CD pipelines.
So when the change of the metadata is detected, the metadata is published to the catalog. And then that's a nice thing of pipelines. This is GitLab. So you have these failed pipelines, so some bad metadata or something went wrong, and at some point you're happy as a DevOps because it's a past,
so we can continue our daily activities. From that we also publish map services. So we use the metadata, the title, the abstract, to create map services in map server. The caller tool generates these map files
which are used to set up these services and also puts back a link in the metadata which points to those map services. An SLD can be provided to style the layer. So if you put on your file repository also an SLD file,
then it will use that style to draw the layer and the metadata is updated, like I said. SLD you can typically, in QGIS, export your style as SLD.
So that's quite easy to use. Some screenshots here. Oh yeah, so we use TerrierJS which is a client-side library which is like a Web GIS component which has a nice, it's an open source project, React, Cesium, and Leaflet from Australia.
This one has a CSW search. So from here I can query the catalog to get data sets. It shows a nice legend and we also have an about page which links back to the, can I see?
It links back to the catalog. That's the link that is from the WMS capabilities back to the catalog. So that is all managed through this pipeline. And then some pictures of the front end of a catalog. We run actually quite a lot of catalogs because a lot of these Horizon Europe research projects
they say, oh, let's start with a catalog and then I deploy a catalog for them and then they put records. So, nice. So after experimenting a bit with these approaches we noticed that there was like a data community growing around these metadata repositories.
We have put all these metadata and CI-CD scripts on Git. So people, you can comment or get inspiration of any of these pipelines. Create issues or pull requests to improve aspects of it.
Myself, I run a lot of times into a broken link in a catalog and then I want to give feedback to the owners of that catalog or the service behind it and I get sometimes quite frustrated that I can't find a good contact point or never get a response. And now with this approach I'm really happy because I can just create a pull request
to fix my own service in this case but if others adopt this I can also support others. So, what's next? Israel, we also run a couple of Horizon Europe projects, one is called Soilwise. And here the aim is to harmonize soil data
from different member states in Europe and we, during this project which will run another three years we will extend this workflow approach with harmonization options. Because you notice different member states storing the data in different formats
so for us, but also for the European Union it's really hard to combine those data from the different member states and with this initiative we hope to involve the community in describing the data so the harmonization will be easier.
So, for example, one organization uploads a dataset, another organization shares its ETL configuration to work with that data and a third organization says hey, I want to do the same thing let me use that ETL configuration to get started be ready sooner.
So, that was the presentation some takeaways here so Git is a very interesting platform to facilitate a catalog of datasets and how to maintain it. CICD pipelines are very useful to validate metadata to share metadata and even harvest metadata
so in some of our projects we harvest metadata from various catalogs using CICD processes because CICD gave me these pipelines which failed and succeeded and failed and the complete log of the harvest that's really nice.
No, and the orchestration of this type of microservices really facilitates maintainability and flexibility. And this is all inspired and facilitated by you the open source community. Thank you very much.
A few questions, everyone. Okay, I have a question. So, which strategies did you use to convince your colleagues to actually contribute metadata?
Did you talk to them one-to-one did you produce documentation or you did something more top-down like saying, you need to create metadata for the datasets? So, one of the approaches we used was here is an Excel sheet, start populating the fields and I import the Excel sheet and at least we have some
common ground starting point and from there I put them on Git and then you say, okay, I want to improve that. Well, go to Git you can improve it there. The other approach we've used is everywhere where I saw a readme.txt file said, please open the file and put some structure in it. Say this is a title, this is the abstract, this is the date
and then I can parse it. Now I can't. So that's another approach. We used a very interesting mechanism of inheritance, so any subfolder inherits from the folder above. So if you put your metadata in the folder higher, it would apply that to all the other levels more down the tree.
So at least some metadata. And a lot of talking, of course. And asking like, hey, what is this dataset? Why is it not described? When you
push a YAML file to Git, it will run a pipeline to generate a map server file and by Git bot or something like this, it will make a pull request for the same Git repo or where this map file will go?
Ah, that's a good question. We have different approaches because the map file is automatically generated. I usually just put it to a web.dev repository. But we could commit it to Git again.
Any more questions? So one thing, you mentioned that you I really like this mechanism where you import the files and you create a searchable catalog from the readme files that are in the repository. But I was thinking, I mean,
as you have some data already there and you have a folder structure, did you consider doing a crawlable catalog or a static catalog just to, like, as a low-hanging fruit, just to make the data available? Yes, it's an interesting question also because right at the start of that PyG API and stack
had a plugin or there was a stack plugin in PyG API and I actually used some of that code to say, hey, that's a good idea, I need that. And then later I rewrote the whole thing. But I wanted to be more traceable. I don't want the thing to be in memory because memory I
can't touch. I want to have it written down in Git so I know who to blame when there's something wrong. Makes sense. Actually, yeah. Let's talk over dinner. Okay. If there are no more questions, thank you again, Paul,
for this presentation.