We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

European (Inspire) Data Tour

00:00

Formal Metadata

Title
European (Inspire) Data Tour
Title of Series
Number of Parts
351
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2022

Content Metadata

Subject Area
Genre
Abstract
This talk provides concrete tips on how to improve your open data accessibility and discovery. We use real world analysis of what Europe has today, rather than specifications, guidelines, or theory. We recently investigated the linkage between Metadata (CSW Dataset and Service Metadata records) and actual downloadable/viewable data (WFS, WMS, WMTS, and Atom). We also looked at other linkages between the documents (for example, metadata document links, "operatesOn" links, Inspire "ExtendedCapabilities", and other MetadataURL links). Following links isn't as simple as just taking the given URL and resolving it - we will look at "fixing" the URL as well as setting request headers. We will also investigate comparing two different metadata documents (from different URLs) to see if they are "the same" even if they aren't really equivalent. If you are responsible for an INSPIRE catalogue or web service, attend this talk to learn what works (and does not work) based on real world analysis rather than theory. Or just attend to be sure you did not show up in the examples.
Keywords
Digital photographyException handlingString (computer science)Link (knot theory)CodeSoftware developerVariety (linguistics)Inheritance (object-oriented programming)Internet service providerSoftwareUniform resource locatorLine (geometry)TouchscreenContent (media)XML
Digital photographyWeb pageMultiplication signCASE <Informatik>WritingCodeInformation securityContext awarenessWebsite
Digital photographyMultiplication signNatural numberProcess (computing)Parallel portServer (computing)CASE <Informatik>Error messageCodeDecision theory
Public key certificateDigital photography1 (number)CodeJava appletPublic key certificateSubsetWeb browserComputer animation
Information securityDigital photographyRandomizationNumberWeb browserQuicksortAreaMultiplication signHTTP cookieSurfaceProcedural programmingInternetworkingInformation securityServer (computing)Uniform resource locatorSoftwareComputer animation
Digital photographyEmailWebsiteServer (computing)SubsetError messageUniform resource locatorWeb browserEmailSoftwareDependent and independent variablesWeb pageSource code
Revision controlDigital photographyLink (knot theory)Computer fileWeb pageBitProbability density functionComputer animation
Digital photographyCache (computing)Data storage deviceThread (computing)Link (knot theory)Uniform resource locatorVirtual machineSinc functionHash functionSpacetimeData storage deviceError messageThread (computing)
Digital photographyQuicksortReal numberLink (knot theory)Computer animation
Digital photographyJames Waddell Alexander IILink (knot theory)Web browserService (economics)Uniform resource locatorRevision controlSlide ruleBitLink (knot theory)Type theoryRow (database)View (database)Sheaf (mathematics)Communications protocolContext awarenessMappingGreatest elementDiallyl disulfideAttribute grammarParameter (computer programming)TouchscreenMachine learningComputer animation
Link (knot theory)Digital photographyUniform resource locatorService (economics)Singuläres IntegralMultiplication signType theoryMachine learningContext awarenessBitHeuristicLink (knot theory)UsabilityCASE <Informatik>InternetworkingService (economics)Row (database)Computer fileUniform resource locatorSheaf (mathematics)Computer animation
BitCASE <Informatik>Latent heatSheaf (mathematics)Diagram
Service (economics)Uniform resource locatorRow (database)MetadataLink (knot theory)Extension (kinesiology)Quantum electrodynamicsCodeCodierung <Programmierung>NamespaceSet (mathematics)Uniform resource locatorMetadataLibrary catalogIdentifiabilityLevel (video gaming)1 (number)Link (knot theory)QuicksortLatent heatTouchscreenRight angleSpacetimeGreatest elementRow (database)CodeArrow of timeMereologyMappingService (economics)Sheaf (mathematics)Point (geometry)NamespaceType theoryDiagramDirection (geometry)Uniqueness quantificationLine (geometry)Connected spaceInformationEmailConnectivity (graph theory)Different (Kate Ryan album)Extension (kinesiology)Presentation of a groupIdentity managementDialectComputer animationProgram flowchart
Process (computing)Link (knot theory)PlastikkarteMetadataService (economics)Digital photographyIdentifiabilitySet (mathematics)Link (knot theory)Row (database)MetadataService (economics)BitComputer animation
Digital photographyService (economics)MetadataService (economics)Set (mathematics)Row (database)CASE <Informatik>Process (computing)Computer animation
Server (computing)Data modelLink (knot theory)MetadataRow (database)MetadataLink (knot theory)Different (Kate Ryan album)Endliche ModelltheorieDatabaseUniform resource locatorRow (database)Set (mathematics)Arrow of timeProcess (computing)Service (economics)Computer animation
Digital photographyJames Waddell Alexander IIAutomorphismUniform resource locatorNP-hardType theoryMereologyLink (knot theory)Compilation albumService (economics)Web browserPublic key certificateComplete metric spaceSoftware testingUniform resource locatorParameter (computer programming)Computer animation
Public key certificateDigital photographyMetadataLink (knot theory)Uniform resource locatorSet (mathematics)NumberInformation securityCoefficient of determinationProcedural programmingMultiplication signServer (computing)CuboidPort scannerPublic key certificatePoint (geometry)HTTP cookieSystem administratorPhysical systemComputer animation
Digital photographyMetadataLatent heatSet (mathematics)Single-precision floating-point formatMultiplication signLink (knot theory)Process (computing)GeometryBitServer (computing)Computer animation
Row (database)Digital photographyLink (knot theory)Data modelServer (computing)MetadataUniform resource locatorConnected spaceMetadataProcess (computing)Slide ruleVideo gameLink (knot theory)Set (mathematics)Endliche ModelltheorieRow (database)Resolvent formalismComputer animation
Uniform resource locatorLink (knot theory)MetadataDigital photographyCore dumpGoodness of fitSet (mathematics)Connected spaceMetadataExpert systemService (economics)Spherical capField (computer science)Uniform resource locatorLink (knot theory)Type theoryCodeOpen sourceRow (database)Resolvent formalismPresentation of a groupBoom (sailing)1 (number)IdentifiabilityProjective planeProduct (business)Inheritance (object-oriented programming)Text editorLevel (video gaming)Data managementComputer animation
SatelliteComputer animation
Transcript: English(auto-generated)
Okay, let's talk about our first topic which is pretty focused just about downloading links Now if you were to ask a developer to write some code to download the contents of URL You might get something like what I've got up on the screen here. Just two lines of code super easy
done home early for beers but Not so fast. It's not quite that easy It turns out to be a lot more difficult than you'd expect Which is a common theme when dealing with a large variety of infrastructures and software providers very common. I Must admit I wasn't expecting this to be difficult, but it was so I thought I just talked about some of the issues
I found the first one just simple redirects Websites move around all the time and you often find redirects to send users to the correct page Totally standard and usually invisible to you However, if you're writing code to download you're gonna have to handle a bunch of special cases and also be security aware
You're also going to have to deal with servers returning errors and timeouts And in this case when you're doing a large harvesting we're downloading making a lot a lot a lot of requests and
sometimes they're saturating the server that we're connecting to so Oftentimes these are our own fault you're gonna have to implement some back off code and Retry and decide on how long you're going to be waiting for a timeout in our processing we found that I think we spent a Cumulative time of a couple of days just waiting for timeouts, but they're done in parallel
So if the actual clock time wasn't that wasn't that bad? We also had to deal with some troublesome HTTPS certificates ones that were a bit dodgy, but we want to accept anyways And Java and your browser can really have a disagreement about what's a valid and properly signed certificate
So you're gonna have to write a bunch of custom code to validate that You're also going to find some weird security procedures that people have put in place. The one that sticks out for me Is you would go I found I was going to a server and it was infinitely Redirecting back to itself and I thought this was just some sort of misconfiguration
But I put it in a browser and it worked and what was actually happening is you would make your first request It would send back a redirect and it would also attach a session cookie to the redirect And then when you made the second request you would attach a session cookie and it would work. So
That wasn't very fun. But um, I don't want to spend too much time on security I'm sure other people have talked a lot in the conference and in other areas about that But you have to be careful when you're downloading a large number of URLs from random URLs from the Internet and it's a fairly big attack surface and
Especially when you're dealing with XML documents and how you're parsing them and you also want to protect your internal networks Just want to quickly talk about headers After you go to a year all expecting to get an XML document But sometimes you get back an HTML page a blank response or an error and you have to send a properly
formatted HTTP s or HTTP accept header But there's a lot of disagreement between browsers about or servers about what that is Sometimes if you put a Q value in It'll get confused and not respond it or if you don't put a Q value in it'll get confused and won't respond
This is the one that I found worked fine it through a lot of trial and error so I suggest you replicate that and my notes will be online and Again for efficiency some of these links will be to 10 gigabyte data files or a thousand page PDF
And you really don't want to be downloading them. It's super slow and expensive So one things you do is just download the first little bit of your file Check to see if it's an XML file That's what you're expecting and then make sure that the start tag is one of the tags that you're expecting and that can Really make things much more efficient. You can really crank up your speed to 11
Just a few more things on efficiency since we're probably going to be following a million links One is to use thread pools. So you're doing a bunch of simultaneous requests as I mentioned before
I just have to be cautious that you're not Overloading those machines because they'll start timing out and giving you errors Also some HTTP request caching don't keep downloading the same URL since they're often replicated and Finally for something like storage you can store your documents based on their sha2 hash or some other hash
That saves a lot of space especially when you're doing multiple runs You have all these big documents and they're often not changing So if you reference something by sha2 or by a hash, you're only storing them once that can save a huge amount of space And I can go on for hours about all the efficiencies and stuff like that, but I'm just going to mention a few
Okay, so we've been talking about really Nitty-gritty technical issues About efficiency and things like that Some hints and tricks, but the next problem is sort of a real Keystone issue And it was kind of unexpected but our links don't actually link And those links are wrong or incomplete
You can't just go willy-nilly copying pasting a URL at a service document and into your browser expecting it to work I'll just give a couple examples here This is a good one, I hope you can see that My slides are a little bit. I'm not translated into PowerPoint
perfectly, but So this is a complete link and you can copy and paste this into a browser and it'll work just fine you'll get a service record from this and Or you'll get a capabilities document from this and it's no problems and you can see there's three parameters in this request The request type get capabilities the service a WMS and the version of the service that you want
So this works fine Unlike in your previous example a fully qualified URL often you get a URL that just points to the Endpoint of your OGC service cluster. I can see the example highlighted in the middle of the screen
You have to then morph that URL into a get capabilities request as I shown at the bottom So you have to add a request equals get capabilities The service equals WFS and also you might want to add a version to it But so far so good
So pop quiz here without one in the middle That's a typical service link and there are four requests at the bottom and these are for a WMS WFS WM TS and an atom feed you can see they're different for all of them So, how do you know which of those four requests?
Do you make? Hey, it's a little
Yeah, the answer is basically you don't really know But you can look around the document for some context and the simple thing for context is that all the service records will be either of type view or
Download and if it's a view service It will be for maps for WMS or WM TS and if it's a download service, it'll be for actual data so atom feed or a WFS and Again, you can look around in the document. Here's an example and every record kind of does this differently
They're all snowflakes. So in here you can see it says in the protocol section that it's a OGC WMS So this is almost certainly a WMS. And if you look at the actual URL, you'll see there's WMS in the URL So this is a WMS request So you would know what type of service you're talking to and you might think you need some type of AI or machine learning to
this but I found that there's just a few simple heuristics and Getting a little bit of context and looking at a bunch of examples. You can get the right answer almost all the time So let's just summarize where we are right now
This is for downloading first You find the links of the service documents that's over there on the far left as I just talked about we need to use context In the record and some heuristics to transform that URL into something usable That's the big brain and then we need to actually go to the internet and efficiently download the capabilities document
that's the here no see no speak no evil monkeys because of all the special cases you're gonna have to handle and you end up with an on a far-right ODC or an atom capabilities file and It's all good Okay Finish the first section We're moving on to the last one which is talking about linkage between documents
And again Jordi and your own yesterday talked quite a bit more depth in this so I'm not gonna fill that in again And I can talk hours explaining about how linking works and all the special cases get the handle But I'm just gonna simplify and move a bit quickly But I did want to make sure there was a reference here because there's it's not obvious by reading the specifications how to do this
Okay back to our scary diagram We just talked about that top line there the service document through the service document links to the capabilities document And what I want to talk about is the highlighted section on the right which is connections from the metadata
to the data and vice versa Sorry the connections between the capabilities document and the data and the metadata So I just want to flip this diagram at the capabilities on its side so we can see A little more detail and what goes on in a capabilities document for inspire
So there's our capabilities document again I just flipped this on its side, and there's two major components to an inspire capabilities document first is an extended inspire extended metadata That's some green at the top Which is a header doc which is a header to the document which talks about information in the entire document
And it may or may not be present There's also set of layers at the bottom and I put layers in quotes because they're called feature types in a WFS and items in an atom feed but it contains information specific to a layer and
Another note is that data sets can be comprised of multiple layers So just drilling into that section You Find two two parts in the inspire extended metadata one is a URL backlink to the service metadata
And it may or may not be there We talked about the service record to the capabilities at the beginning of this talk. This is the opposite Capabilities back to the service document I'm not really going to talk about that right now Might also have a set of spatial data set identifiers Usually zero one or several which tell you which data set the service provides maps or data for
That links back through an ID Which you perform a search for to a data set metadata record as shown in the black arrow saying search But let's look at a quick example here
Up at the top. We have a inspire extended capabilities that you might find inside a capabilities document and we can see it has a Service record backlink. I said I'm not going to really talk about that and an inspire code and code space right in the middle of the screen there and then at the bottom we have a
Dataset metadata record and you can see those records will have a data set a unique data set Identifier which will have a code in a code space and basically you have to do searches between these two types of documents to sort Of link them up in either direction
Okay back to our diagram Going on to the lower half of this in blue the layer section and each layer might might have a metadata URL link and that goes It's a direct link to another data set document that you can download using the techniques
We talked about earlier, but a big note of this is that? The document that this points to is probably not one of the ones that you harvested And the reason for that is you probably harvested from a higher level catalog like a country-level catalog And these are probably pointing to regional catalogs and those two
Metadata records will almost certainly be different. I don't think I've seen any that were actually perfectly the same And it also has a metadata URL link so spatial set a dental sorry and a spatial data set identifier that links the data set record to
A data set record via an ID And let's look at an example of that We have an example capabilities layer up at the top. It has a spatial data set identifier. I was Just put one in there. So it's very original code is identifier name spaces namespace and
Just a note here. Is that for WFS WMS and Adam they all have a different way of Putting that identifier in there, so they just have to handle that but that's usually not not too difficult and you can use this to search by ID back into your set of data set metadata records and
Also sort of in the middle of the screen there. You see a metadata URL link. So that's a direct link to a another metadata data set metadata XML document And there's a lot more details that I'm not really getting on get getting into here and I'm gonna stop the sort of detailed
XML explanations now and just talk things at high level, but I did want to make sure that I showed you what was going on So if you're ever asked to sort of look at one of these things you have an idea of what's going on because Like I said reading through the specification is quite difficult
But you don't read that's up there I'm just going to summarize basically you harvest all the service metadata records you harvest all the data set metadata records You follow all the links in your service that metadata records to capabilities documents and they follow all the links in your Capabilities documents and they do bunch of matching based on your extracted
Identifiers and then you can sort of link up your data sets and and actual OGC services And I'm simplifying quite a bit here, but poof That's a lot of work Excuse me, Tristan. It's not easy to go from a data set metadata record to a data or service
You have to process all the service records and process all the capabilities documents before you can start doing that matching step And it's not very easy to do that It takes a lot of work and there's a lot of special cases because everybody has a snowflake way of doing things So it can be quite difficult
But inspire realize this is pretty complicated and they're moving in the process of moving to a simplified link model as I've shown here The big difference in this is there is a direct link in the capabilities a direct link to the capabilities document right inside the metadata the
dataset metadata record Mind-blown, it's totally easy You don't have to go searching. You just follow the link. It's really easy Now you can go directly from the metadata record to the capabilities document without having to harvest and search through all your service documents and do a ton of processing and then if you do also
Add the metadata URL for each of the layers in your capabilities file. Then you have bi-directional it's really simple bi-directional and that is It's from that dashed arrow, but it's it's it's really big I can't it's you know, easy peasy lemon squeezy it's I can't really emphasize how much easier this is because this is just direct links as opposed to
having to search through and have a database of all your all your endpoints and and Deal with that Okay Done the hard part Let's summarize with some rubber meets the road type of advice
Let's start with the downloading links, which I talked about at the beginning First use the complete the complete URL Don't make people guess what the service is or what the parameters you need to actually talk to that service Put the full thing in there
The next thing is make it simple to download the easiest test is to use some type of command-line tool like curl To make sure your link resolves and returns XML Don't just copy paste it into a browser your browser does a bunch of magic and make sure your SSL certificates
are both valid and widely trusted and Again, no simple access don't put any strange security procedures in there testing with curl will catch most of these And if you remember my story about the infinitely redirecting looking for a session cookie
Yeah enough said Yeah, so don't put people in the bad bad dog box. This is something that happened a few times You're following a large number of links thousands and thousands of these links, especially in a large set of metadata URL links And a lot of them aren't correct and they're pointing to
locations that return a 404 or not found Unfortunately some servers detect this as a URL Security scan and they get a little upset with you and don't allow you to connect to that server anymore So and at that point there's not a lot that you can do other than email the system as administrator and ask them
to be nice for you and Finally let's talk about linking together data and Metadata So some of the capabilities documents are a bit hilarious They have 10 to 15 thousand layers in them and by hilarious. I mean totally not funny at all. It's just too many
It really takes a long time to process and a lot of links to follow I recommend you have a capabilities That's just about one a single data set that keeps things really obvious and it solves a bunch of other problems I haven't talked about
This is really easy in something like geo server. Just break up your layers into workspaces One workspace per data set and you can link to a workspace specific capabilities. That's really easy One data set one capabilities file I know Jody is really interested in this because he just did a bunch of work in jail server to make that more obvious
Again in capabilities for each of the layers make sure there is a metadata URL link to the data set metadata record Everyone will love you everyone Seriously, everybody if those links aren't there People are probably just going to google your layer name from your WMS and then give up
That metadata URL link on each layer is is really user-friendly really makes things easier and You don't want to have to do all that processing and searching that we talked about before so when you use the inspire
Simplified link model that's easy and obvious connections And again, this makes life so much easier as I said earlier. Everybody will love you everyone It is so good and using the simplified model and layers if you have also used the simplified link model and
Have what I said previously on the previous slide, which is a metadata URL for every layer You have this really nice bi-directional connectivity and it's it's really awesome It's so easy to use So if you're only going to do four things one use the full URL and make sure curl resolves
one data set per capabilities Add a metadata URL link to all your layers Data that gives you your data to metadata link and use the inspire Dataset linkage simplified inspired data set linkage which gives you metadata to data linkage
Yeah, and then it's everything is much easier So, where do we go from here I Mentioned at the beginning. I just put together an open source project that does a bunch of harvesting follows all the links everywhere Finds all the connections I talked about and a lot more than I haven't talked about
This code already exists and you don't have to write it again And this code can be leveraged to automatically fix up records to go along with my all my recommendations It's super awesome You can really change a lot of things So we talked about those incomplete or problematic URLs at the beginning
I can tell you which ones are problematic and what they should be so boom fixed are your layers missing? Are you missing? Extended capabilities I can fix that I can tell you what the backlinks is to the service metadata document
I can tell you what the spatial data set identifiers are for that capabilities I can tell you how to split that document up. So it's one capabilities per data set and Are your layers missing metadata URLs or spatial data set identifiers? I can tell you what they should be and that's a huge one because
these type of things make your services so much easier to use for people who are experts in the field and don't have a bunch of code lying around and If you're not currently doing the inspire simplified Dataset linkage I can tell you what you need to add to your keep your The capabilities link and what they should look like to go inside your data set record
and this really is a huge win because it's so easy to use when this when they're set up this way and Most of the hard work is already done. You don't have to redo it. I've already done that So that's really good. It's open source. So you can just download it and go and to be honest
there's still a lot of work to do and Obviously, I can't make magic connections between data sets and Metadata when they aren't already present at some level But I really think that tools based on this can make Data managers and editors life's way way easier and the resulting data
Metadata products much easier for users and much easier for the people actually working with that with creating those data those metadata's What more can you want? Thank you very much Dave we're gonna have to go directly to our next speaker, but thank you very much