A fun way to do spatial cataloguing and publishing using pygeometa and mdme
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 156 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/68424 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
FOSS4G Europe 2024 Tartu7 / 156
6
33
35
53
55
59
61
67
70
87
97
99
102
103
104
105
107
111
121
122
123
124
125
126
127
128
134
144
150
151
155
00:00
Streaming mediaInfinityInformationLink (knot theory)PlastikkarteDew pointTemporal logicOrganic computingSubsetSystem identificationAudio file formatLimit (category theory)MetadataLibrary (computing)Control flowFile formatLibrary catalogMathematical optimizationRevision controlPersonal digital assistantTime domainData managementEXCELElectronic mailing listContent (media)BitGoodness of fitInformationOrientation (vector space)Projective planeOpen setForm (programming)WhiteboardStandard deviationText editorLibrary catalogWebsiteOpen sourceRow (database)Multiplication signMathematicsService-oriented architectureSystem administratorReading (process)Library (computing)EmailSet (mathematics)Computer filePoint (geometry)MetadataFocus (optics)1 (number)Differenz <Mathematik>File formatIdentifiabilityFacebookType theoryLink (knot theory)Shape (magazine)Meta elementLevel (video gaming)Web pageFunctional (mathematics)CASE <Informatik>Game controllerRevision controlMathematical optimizationPerfect groupConfiguration spaceAudio file formatFunction (mathematics)Range (statistics)Endliche ModelltheorieNumberDebuggerServer (computing)Formal grammarRepository (publishing)QuicksortScripting languageData managementSemiconductor memoryWeb 2.0Category of beingSearch engine (computing)Client (computing)InternetworkingLecture/ConferenceComputer animation
07:13
Configuration spaceServer (computing)MetadataDynamic random-access memoryFormal verificationComputing platformContent management systemQuery languageMereologyService-oriented architectureMessage passingClient (computing)Group actionSinguläres IntegralSession Initiation ProtocolRow (database)Link (knot theory)Library catalogGroup actionEvent horizonPhysical systemValidity (statistics)Projective planeContent (media)Web applicationSoftware frameworkMetadataFile archiverComputing platformCASE <Informatik>Set (mathematics)Traffic reportingNumberStructural loadService-oriented architectureServer (computing)Multiplication signINTEGRALData managementKey (cryptography)Online helpMessage passingClient (computing)Database transactionOpen sourceConfiguration spaceLevel (video gaming)Element (mathematics)Revision controlGame controllerElectric generatorFile formatPrice indexWeb 2.0Information privacyComputer configurationCore dumpProfil (magazine)Computer fileUser interfaceEndliche ModelltheorieOpen setPrototypeForm (programming)Standard deviationMeta elementWater vaporLatent heatInternet service providerRight angleFocus (optics)Uniform boundedness principleText editorCodeData exchangeInformation systemsEvent-driven programmingInteractive televisionFile viewerEinbettung <Mathematik>Computer animationLecture/Conference
14:14
Client (computing)Query languageMetadataMereologyMessage passingService-oriented architectureFormal verificationContent management systemComputing platformGroup actionSinguläres IntegralSession Initiation ProtocolRepository (publishing)Local ringData managementFile formatData storage deviceInterface (computing)Virtual machinePower (physics)Cartesian coordinate systemLibrary (computing)Semiconductor memoryMetadataGroup actionMultiplication signRepository (publishing)Point (geometry)InformationGame theoryData storage deviceProcess (computing)Revision controlLocal ringConfiguration spaceRow (database)Library catalogIntegrated development environmentFunction (mathematics)Term (mathematics)BitFile formatData managementText editorStack (abstract data type)IRIS-TDifferent (Kate Ryan album)State of matterService-oriented architectureWeb browserLevel (video gaming)Latent heatGraphische ProgrammierungResultantDiagramComputer fileCASE <Informatik>Web crawlerMetre2 (number)Interface (computing)Diallyl disulfideServer (computing)Physical systemWater vaporVirtual machineSet (mathematics)Link (knot theory)Standard deviationCuboidComputer animationLecture/ConferenceMeeting/Interview
21:15
Computer-assisted translationLecture/ConferenceComputer animation
Transcript: English(auto-generated)
00:00
So welcome, good afternoon everybody. So a bit about us, I'm Paul Wagner-Neuchter, I work in Israel, World Soil Information in Wageningen in the Netherlands. And over there we maintain some ten catalogues, mostly project oriented. Tom Kirlides, I'm a senior geospatial architect with the Meteorological Service of Canada,
00:24
long time contributor to open source and open standards and currently serving on the board of directors. So that's us. What if? So this is the typical record you find in a lot of catalogues. And then you're reading and you say, hey wow, that name, that's not correct.
00:43
That guy went out of service a while. How can I notify anybody of this change in the metadata? That's a thing that I notice a lot. For example, a service being not available anymore, I want to notify somebody. It would be really nice if there is this added me on Git link,
01:05
which you typically find on these mkdocs type documentation websites. And you would click that and you would go into Git and you would be able to make a pull request or create an issue about the thing.
01:20
And somebody would go in and change it and review it and it would be published to the catalogue again. It would also give you this nice overview, like who changed what in my catalogue over the last two years, over the last 20 years, because a lot of these datasets stay in that catalogue for a long time.
01:43
So that was kind of the original goal of this exercise, to have a workflow like this. So let's start with the tools that we're using here. So PyGeoMETA, let's step in Tom.
02:01
PyGeoMETA is a Python package to generate and manage metadata for geospatial datasets. So we all know geospatial metadata is hard and complex and PyGeoMETA is metadata for the rest of us, as we call it, which allows you to do metadata in sort of a configuration-based way, and you can generate whatever formats you wish from that.
02:23
And it's driven by this metadata control file, or MCF, which is basically a YAML grammar to be able to document your dataset. So I asked Tom to step in here because he started an initiative in 2008,
02:44
when was that, 2009, was that 15, 17 years ago? And that lived silently somewhere on the internet and then recently got a lot of attention. Because this YAML format, we all know it,
03:00
it versions really well on Git instead of the JSONs and the XMLs. So if we store a metadata file on Git, YAML is a perfect format. So how do we get from that MCF to the catalog? So we use this PyGeoMETA library to convert that MCF file to ISO
03:21
or whatever other format, because PyGeoMETA has a number of output schemas, such as DCAD and some American, quite a range of output schemas. And then we load it into the catalog. It can be GeoNetwork like we just saw. It can also be CCAN or any of the other ones.
03:42
We chose PyCSW because for us it's really easy to maintain and very flexible, and it has a strong focus on OGC standards, which is for us really important. And through its OGC API records format, API,
04:01
it also supports an HTML front end because you can request every record in an HTML format. And having this identifier for every record is also very useful for your search engines because the search engines require one web page per record.
04:24
So do my colleagues like this MCF? YAML is quite a challenge for some, especially if they're used to fancy web forms and LinkedIn and Facebook. So at first the answer is no. They're a bit hesitated to create these YAML files in Git.
04:46
And that's where we came up with this model-driven metadata editor, which is a standalone metadata editor that lives out there on GitHub. It's a fully client-side thing, so you have a web form based on JSON schema of the MCF,
05:05
and you populate the fields, and at some point you say save, save as MCF file, and you then upload that to Git or send by email to the administrator. Do my tech colleagues like the MCF?
05:23
Yes, they do very much. Optimal for Git version control offers a fully traceable catalog, what changed, what not. And file-based is actually quite easy to modify. You just do a search and replace, and you have the whole catalog updated.
05:46
And then you can also do a lot of things in memory using Python scripts, so you don't need to always go into changing one file by file. So now I want to show kind of the benefits of this workflow in two use cases.
06:06
And let me first go to this one. This Landsoil Crop Hubs is a European-funded research project that we run in some countries in East Africa around land, soil, and crop data. And we try to spark that community with this GitHub repository
06:26
where they can contribute their datasets. Initially they provided metadata in Excel, where each column is a property of a dataset and you have all the rows. So we have an import function that can import this Excel into the MCF format,
06:44
and we share that then on GitHub, and then they can start the contribution. So you have like an initial population of the catalog. But also they learn to register issues on, hey, that record number five is not correct because it changed.
07:02
And then we use this metadata to create a map server configuration. So we also have OGC services on top of the TIFF and shapefiles that were described. So we generate a map server map file based on the metadata, so we use the title and the abstract element from the metadata
07:21
in the map server configuration. That WMS endpoint, which was then generated, is pushed back into the metadata in Git so users can also find the service from the metadata. And so this keeps this metadata in the WMS capabilities
07:43
and always aligned with the metadata in the catalog. And then on top of that we have the TeriaJS viewer framework, which is a client-side open source project from Australia. It's a really nice full-featured web-based GIS system,
08:03
which has CSW search embedded. So you can do a CSW search from TeriaJS to find those records, and vice versa, you can go back to the catalog via the link which exists in the WMS capabilities.
08:21
And then we added an extra integration on PyCSW to say, okay, load this layer that you find in the catalog on the TeriaJS. So that's the linkage from the Teria to the catalog. So how does that look? For example, this is TeriaJS. So that's the dataset that was found on the catalog
08:43
and then opened in the TeriaJS. And you have this legion of options and a lot of tools that you find on typical things. So let me now hand over to Tom for the other use case. You want this one? Great, thanks. Thanks, Paul. So up until now we've gone over basically using PyGeometa
09:08
in support of a simple configuration for geospatial metadata. We've talked about PyCSW as a cataloging capability to be able to publish these metadata records to, although it could be any standards-based catalog. We've also talked about this MDMe user interface web application,
09:24
which basically is driven by a schema. So the technology underneath is you can push, you can configure it with a JSON schema from a standard, and it will automatically populate a web form based on that schema. And Paul gave his use case using all those three tools
09:41
and those interactions and workflows. I'm going to do the same thing on my side with regards to the WMO information system version 2, so WIS2. So WIS2 is a next-generation data exchange platform from WMO. It's 194 countries wide, and it's based on open standards
10:01
and a lot of open source tooling there as well. But basically it's the exchange of earth system data, weather, climate, water data. It could be real-time data, it could be archive data, and anything in between. There's a big focus on event-driven in WIS2, which means PubSub, publish and subscribe.
10:21
The MQTT specification is a requirement of WIS2, so basically all of the publications, all of the data publications and all of the metadata publications are all done with PubSub. There is no harvesting, as we may have seen in previous and current attempts. There's no CSW harvesting or catalog harvesting.
10:43
Everything is pushed on demand. So you don't have to go poll a catalog server every time to look for new updates. You subscribe to that catalog server and it tells you when it pushes an update to you. So very different than the traditional polling case. So we apply that same principle that we do for data with metadata as well.
11:04
So we've also developed, in that spirit, we've developed a number of tools to help weather agencies think about their metadata. Because as much as we do data in weather, climate, water and WMO, all of that needs to be backed by dataset metadata.
11:22
So we need to define what a dataset is, provide a geospatial metadata record that goes into a catalog and that catalog allows somebody to discover that dataset, obviously, and find the appropriate linkages so that you can subscribe to that dataset. So now users need to be able to create and manage metadata
11:41
and be able to publish it. So one tool that we created, we have a prototype tool that's basically no code. So the way it works is it helps you manage, verify and publish metadata and using GitHub as a content management platform.
12:02
So no catalog, no metadata editor, nothing like that. That doesn't mean those things are not useful, but for this case, we use this approach. So basically we're using GitHub to manage the metadata. And we send out metadata to all the data stewards or content.
12:22
We send out GitHub links to metadata stewards or custodians and data providers and they edit that actual YAML right on GitHub. And they're managed as this MCF or metadata control file format. What happens after that is as soon as that metadata is saved on GitHub,
12:43
we also take advantage and use GitHub actions. So what we do after that is once we save some metadata, the GitHub action triggers and what it does is takes that MCF format, converts it to the metadata format that we require in WMO because we've defined something called the WMO Core Metadata Profile
13:05
which is based on OGC API records core record model and it creates this WCMP2 metadata record on the fly. And then what it does is it takes that record and the GitHub action is actually connected to an MQTT broker.
13:21
So it sends that metadata record to a broker as a new message. And that's how everything populates throughout the system. So there's no transactions, no pushing, pulling, harvesting. There's an MQTT event-driven publication. So as soon as it gets onto a broker,
13:41
this circulates throughout the WIS2 ecosystem and there's a catalog somewhere in that pipeline which takes that MQTT message of the metadata which was pushed through a notification and validates it, does quality assessment, provides a validation report.
14:01
It does some KPI, key performance indicator things, all these kinds of things and then it sends back a report through, you guessed it, PubSub in an event-driven way. And then the WIS2 catalog itself or our global discovery catalog is based on OGC API records and you can use something like metasers to connect to it and discover these metadata.
14:22
So the basic workflow... Oops, let me go back. I keep going forward. Okay, let's rewind. And we have a broken link.
14:41
That's weird. Does that work? Maybe not. No problem. So the basic idea here is that we have... That would have been a diagram of everything I just described
15:00
and if you go online, you'll find that diagram. But it's a bunch of boxes and arrows, okay? So having said that, what is the end game here? The end game is simplifying the management of the content and we've stripped it down to basically...
15:21
Stripping it right down to basically editing YAML files. And from there, we let all this machinery and all these GitHub action capabilities, everything takes over from there. And never at one point are we actually writing a file everywhere. This is all in memory and this is using all these pipelines
15:41
because of all these tools which can be used standalone or on the command line or they can be used as libraries for you to glue inside of your application and your pipeline. So this is super powerful and a super flexible approach. Again, these are powerful approaches for these use cases.
16:06
We're not downplaying the need for full-blown metadata editors. They're obviously useful and needed. But in terms of having metadata as a composable and a reproducible pipeline, I think these are really strong examples of trying to move this stuff forward.
16:22
Paul, over to you yourself. Thanks. You notice two very motivated people here that work with metadata every day. It seems a boring job, but actually these tools make it fun. So some takeaways for this thing. So this MCF format is actually a really interesting format
16:45
for managing metadata in local repositories and then bringing it to the outside. Git storage and CI-CD workflows, the Git actions are a very traceable, reproducible participatory approach for metadata management.
17:02
And OGC API records, such as PyCW, offers a very clean, machine-readable and human-friendly interface to metadata. And then I have some references for if you're checking this online. Some of these libraries are to be released very soon.
17:21
So Map Server 8.2 has OGC API features, and SLD support is coming soon, but we already depend on it very, very largely. PyCSW, we're waiting for the OGC API records to get approved by the OGC, and then the 3.0 will come.
17:41
But already there's a large adoption of the 3.0 version. The PyGeoDataCrawler is the Python tool that we have developed that stands on the shoulders of all these great libraries, but does the crawling of the drive. So it scans a drive of datasets for any metadata
18:00
to be extracted to go into the catalog. And yeah, so there's a lot going on in this environment. So any questions, I hope, so we can continue discussion. Thank you. Any questions?
18:27
Thanks. Very interesting. Sandra, read me that also it supports stack as export. Could you talk a bit about how stack could be involved? Sure, it's basically another format output.
18:42
So PyGeoData supports the stack item output, the stack item specification, and basically your configuration would end up going out as a stack item. So that's how basically that would work. Yeah, and PyCSW as a service has stack API search capabilities, so you can use stack browser to browse the PyCSW catalog.
19:08
Thank you. I have a question about this WMO standard. Is it related only to WMO data or in general to metrological and climate data
19:20
like ERA5, for example, or other data provided by ECMWF? I guess the answer is yes. So both from all the member states and from ECMWF and UMETSAT and so on and so forth. There's also federated activities with Ocean Information Hub and Earth System Gateway Federation and different partners,
19:45
but it's basically all weather climate and water data which includes all the countries and specialized centers such as ECMWF. So yeah. Okay, thank you. Thank you too. More?
20:05
Assuming you get a distributed setup, so how fast would be the updates to some of the metadata so that everyone would see the latest updates?
20:20
Is it the distributed setup with GitHub into catalog or the one with MQTT? In the GitHub. The GitHub one? Usually it's 10 seconds because the CICD usually sometimes takes some time to kick in because it's queued and then it more or less depends on the size of your catalog
20:42
because it goes through the records but it's usually minutes which sometimes is too long because people that are doing the actual edits want to see instant results to see if they did a good job. Yes, looking at the data.
21:01
Thank you. Thank you too. I'm still in your room. Okay, so many thanks. One more time. Thanks.