European (Inspire) Data Tour
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 351 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/68957 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2022 |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
Digital photographyException handlingString (computer science)Link (knot theory)CodeSoftware developerVariety (linguistics)Inheritance (object-oriented programming)Internet service providerSoftwareUniform resource locatorLine (geometry)TouchscreenContent (media)XML
00:45
Digital photographyWeb pageMultiplication signCASE <Informatik>WritingCodeInformation securityContext awarenessWebsite
01:06
Digital photographyMultiplication signNatural numberProcess (computing)Parallel portServer (computing)CASE <Informatik>Error messageCodeDecision theory
01:49
Public key certificateDigital photography1 (number)CodeJava appletPublic key certificateSubsetWeb browserComputer animation
02:09
Information securityDigital photographyRandomizationNumberWeb browserQuicksortAreaMultiplication signHTTP cookieSurfaceProcedural programmingInternetworkingInformation securityServer (computing)Uniform resource locatorSoftwareComputer animation
03:07
Digital photographyEmailWebsiteServer (computing)SubsetError messageUniform resource locatorWeb browserEmailSoftwareDependent and independent variablesWeb pageSource code
03:55
Revision controlDigital photographyLink (knot theory)Computer fileWeb pageBitProbability density functionComputer animation
04:27
Digital photographyCache (computing)Data storage deviceThread (computing)Link (knot theory)Uniform resource locatorVirtual machineSinc functionHash functionSpacetimeData storage deviceError messageThread (computing)
05:20
Digital photographyQuicksortReal numberLink (knot theory)Computer animation
05:41
Digital photographyJames Waddell Alexander IILink (knot theory)Web browserService (economics)Uniform resource locatorRevision controlSlide ruleBitLink (knot theory)Type theoryRow (database)View (database)Sheaf (mathematics)Communications protocolContext awarenessMappingGreatest elementDiallyl disulfideAttribute grammarParameter (computer programming)TouchscreenMachine learningComputer animation
08:41
Link (knot theory)Digital photographyUniform resource locatorService (economics)Singuläres IntegralMultiplication signType theoryMachine learningContext awarenessBitHeuristicLink (knot theory)UsabilityCASE <Informatik>InternetworkingService (economics)Row (database)Computer fileUniform resource locatorSheaf (mathematics)Computer animation
09:36
BitCASE <Informatik>Latent heatSheaf (mathematics)Diagram
10:06
Service (economics)Uniform resource locatorRow (database)MetadataLink (knot theory)Extension (kinesiology)Quantum electrodynamicsCodeCodierung <Programmierung>NamespaceSet (mathematics)Uniform resource locatorMetadataLibrary catalogIdentifiabilityLevel (video gaming)1 (number)Link (knot theory)QuicksortLatent heatTouchscreenRight angleSpacetimeGreatest elementRow (database)CodeArrow of timeMereologyMappingService (economics)Sheaf (mathematics)Point (geometry)NamespaceType theoryDiagramDirection (geometry)Uniqueness quantificationLine (geometry)Connected spaceInformationEmailConnectivity (graph theory)Different (Kate Ryan album)Extension (kinesiology)Presentation of a groupIdentity managementDialectComputer animationProgram flowchart
15:20
Process (computing)Link (knot theory)PlastikkarteMetadataService (economics)Digital photographyIdentifiabilitySet (mathematics)Link (knot theory)Row (database)MetadataService (economics)BitComputer animation
15:51
Digital photographyService (economics)MetadataService (economics)Set (mathematics)Row (database)CASE <Informatik>Process (computing)Computer animation
16:21
Server (computing)Data modelLink (knot theory)MetadataRow (database)MetadataLink (knot theory)Different (Kate Ryan album)Endliche ModelltheorieDatabaseUniform resource locatorRow (database)Set (mathematics)Arrow of timeProcess (computing)Service (economics)Computer animation
17:34
Digital photographyJames Waddell Alexander IIAutomorphismUniform resource locatorNP-hardType theoryMereologyLink (knot theory)Compilation albumService (economics)Web browserPublic key certificateComplete metric spaceSoftware testingUniform resource locatorParameter (computer programming)Computer animation
18:19
Public key certificateDigital photographyMetadataLink (knot theory)Uniform resource locatorSet (mathematics)NumberInformation securityCoefficient of determinationProcedural programmingMultiplication signServer (computing)CuboidPort scannerPublic key certificatePoint (geometry)HTTP cookieSystem administratorPhysical systemComputer animation
19:24
Digital photographyMetadataLatent heatSet (mathematics)Single-precision floating-point formatMultiplication signLink (knot theory)Process (computing)GeometryBitServer (computing)Computer animation
20:22
Row (database)Digital photographyLink (knot theory)Data modelServer (computing)MetadataUniform resource locatorConnected spaceMetadataProcess (computing)Slide ruleVideo gameLink (knot theory)Set (mathematics)Endliche ModelltheorieRow (database)Resolvent formalismComputer animation
21:34
Uniform resource locatorLink (knot theory)MetadataDigital photographyCore dumpGoodness of fitSet (mathematics)Connected spaceMetadataExpert systemService (economics)Spherical capField (computer science)Uniform resource locatorLink (knot theory)Type theoryCodeOpen sourceRow (database)Resolvent formalismPresentation of a groupBoom (sailing)1 (number)IdentifiabilityProjective planeProduct (business)Inheritance (object-oriented programming)Text editorLevel (video gaming)Data managementComputer animation
24:35
SatelliteComputer animation
Transcript: English(auto-generated)
00:01
Okay, let's talk about our first topic which is pretty focused just about downloading links Now if you were to ask a developer to write some code to download the contents of URL You might get something like what I've got up on the screen here. Just two lines of code super easy
00:21
done home early for beers but Not so fast. It's not quite that easy It turns out to be a lot more difficult than you'd expect Which is a common theme when dealing with a large variety of infrastructures and software providers very common. I Must admit I wasn't expecting this to be difficult, but it was so I thought I just talked about some of the issues
00:44
I found the first one just simple redirects Websites move around all the time and you often find redirects to send users to the correct page Totally standard and usually invisible to you However, if you're writing code to download you're gonna have to handle a bunch of special cases and also be security aware
01:07
You're also going to have to deal with servers returning errors and timeouts And in this case when you're doing a large harvesting we're downloading making a lot a lot a lot of requests and
01:20
sometimes they're saturating the server that we're connecting to so Oftentimes these are our own fault you're gonna have to implement some back off code and Retry and decide on how long you're going to be waiting for a timeout in our processing we found that I think we spent a Cumulative time of a couple of days just waiting for timeouts, but they're done in parallel
01:44
So if the actual clock time wasn't that wasn't that bad? We also had to deal with some troublesome HTTPS certificates ones that were a bit dodgy, but we want to accept anyways And Java and your browser can really have a disagreement about what's a valid and properly signed certificate
02:04
So you're gonna have to write a bunch of custom code to validate that You're also going to find some weird security procedures that people have put in place. The one that sticks out for me Is you would go I found I was going to a server and it was infinitely Redirecting back to itself and I thought this was just some sort of misconfiguration
02:24
But I put it in a browser and it worked and what was actually happening is you would make your first request It would send back a redirect and it would also attach a session cookie to the redirect And then when you made the second request you would attach a session cookie and it would work. So
02:42
That wasn't very fun. But um, I don't want to spend too much time on security I'm sure other people have talked a lot in the conference and in other areas about that But you have to be careful when you're downloading a large number of URLs from random URLs from the Internet and it's a fairly big attack surface and
03:00
Especially when you're dealing with XML documents and how you're parsing them and you also want to protect your internal networks Just want to quickly talk about headers After you go to a year all expecting to get an XML document But sometimes you get back an HTML page a blank response or an error and you have to send a properly
03:25
formatted HTTP s or HTTP accept header But there's a lot of disagreement between browsers about or servers about what that is Sometimes if you put a Q value in It'll get confused and not respond it or if you don't put a Q value in it'll get confused and won't respond
03:44
This is the one that I found worked fine it through a lot of trial and error so I suggest you replicate that and my notes will be online and Again for efficiency some of these links will be to 10 gigabyte data files or a thousand page PDF
04:04
And you really don't want to be downloading them. It's super slow and expensive So one things you do is just download the first little bit of your file Check to see if it's an XML file That's what you're expecting and then make sure that the start tag is one of the tags that you're expecting and that can Really make things much more efficient. You can really crank up your speed to 11
04:28
Just a few more things on efficiency since we're probably going to be following a million links One is to use thread pools. So you're doing a bunch of simultaneous requests as I mentioned before
04:40
I just have to be cautious that you're not Overloading those machines because they'll start timing out and giving you errors Also some HTTP request caching don't keep downloading the same URL since they're often replicated and Finally for something like storage you can store your documents based on their sha2 hash or some other hash
05:00
That saves a lot of space especially when you're doing multiple runs You have all these big documents and they're often not changing So if you reference something by sha2 or by a hash, you're only storing them once that can save a huge amount of space And I can go on for hours about all the efficiencies and stuff like that, but I'm just going to mention a few
05:21
Okay, so we've been talking about really Nitty-gritty technical issues About efficiency and things like that Some hints and tricks, but the next problem is sort of a real Keystone issue And it was kind of unexpected but our links don't actually link And those links are wrong or incomplete
05:44
You can't just go willy-nilly copying pasting a URL at a service document and into your browser expecting it to work I'll just give a couple examples here This is a good one, I hope you can see that My slides are a little bit. I'm not translated into PowerPoint
06:03
perfectly, but So this is a complete link and you can copy and paste this into a browser and it'll work just fine you'll get a service record from this and Or you'll get a capabilities document from this and it's no problems and you can see there's three parameters in this request The request type get capabilities the service a WMS and the version of the service that you want
06:26
So this works fine Unlike in your previous example a fully qualified URL often you get a URL that just points to the Endpoint of your OGC service cluster. I can see the example highlighted in the middle of the screen
06:44
You have to then morph that URL into a get capabilities request as I shown at the bottom So you have to add a request equals get capabilities The service equals WFS and also you might want to add a version to it But so far so good
07:03
So pop quiz here without one in the middle That's a typical service link and there are four requests at the bottom and these are for a WMS WFS WM TS and an atom feed you can see they're different for all of them So, how do you know which of those four requests?
07:24
Do you make? Hey, it's a little
07:43
Yeah, the answer is basically you don't really know But you can look around the document for some context and the simple thing for context is that all the service records will be either of type view or
08:01
Download and if it's a view service It will be for maps for WMS or WM TS and if it's a download service, it'll be for actual data so atom feed or a WFS and Again, you can look around in the document. Here's an example and every record kind of does this differently
08:22
They're all snowflakes. So in here you can see it says in the protocol section that it's a OGC WMS So this is almost certainly a WMS. And if you look at the actual URL, you'll see there's WMS in the URL So this is a WMS request So you would know what type of service you're talking to and you might think you need some type of AI or machine learning to
08:46
this but I found that there's just a few simple heuristics and Getting a little bit of context and looking at a bunch of examples. You can get the right answer almost all the time So let's just summarize where we are right now
09:02
This is for downloading first You find the links of the service documents that's over there on the far left as I just talked about we need to use context In the record and some heuristics to transform that URL into something usable That's the big brain and then we need to actually go to the internet and efficiently download the capabilities document
09:21
that's the here no see no speak no evil monkeys because of all the special cases you're gonna have to handle and you end up with an on a far-right ODC or an atom capabilities file and It's all good Okay Finish the first section We're moving on to the last one which is talking about linkage between documents
09:43
And again Jordi and your own yesterday talked quite a bit more depth in this so I'm not gonna fill that in again And I can talk hours explaining about how linking works and all the special cases get the handle But I'm just gonna simplify and move a bit quickly But I did want to make sure there was a reference here because there's it's not obvious by reading the specifications how to do this
10:06
Okay back to our scary diagram We just talked about that top line there the service document through the service document links to the capabilities document And what I want to talk about is the highlighted section on the right which is connections from the metadata
10:21
to the data and vice versa Sorry the connections between the capabilities document and the data and the metadata So I just want to flip this diagram at the capabilities on its side so we can see A little more detail and what goes on in a capabilities document for inspire
10:44
So there's our capabilities document again I just flipped this on its side, and there's two major components to an inspire capabilities document first is an extended inspire extended metadata That's some green at the top Which is a header doc which is a header to the document which talks about information in the entire document
11:05
And it may or may not be present There's also set of layers at the bottom and I put layers in quotes because they're called feature types in a WFS and items in an atom feed but it contains information specific to a layer and
11:22
Another note is that data sets can be comprised of multiple layers So just drilling into that section You Find two two parts in the inspire extended metadata one is a URL backlink to the service metadata
11:41
And it may or may not be there We talked about the service record to the capabilities at the beginning of this talk. This is the opposite Capabilities back to the service document I'm not really going to talk about that right now Might also have a set of spatial data set identifiers Usually zero one or several which tell you which data set the service provides maps or data for
12:07
That links back through an ID Which you perform a search for to a data set metadata record as shown in the black arrow saying search But let's look at a quick example here
12:23
Up at the top. We have a inspire extended capabilities that you might find inside a capabilities document and we can see it has a Service record backlink. I said I'm not going to really talk about that and an inspire code and code space right in the middle of the screen there and then at the bottom we have a
12:45
Dataset metadata record and you can see those records will have a data set a unique data set Identifier which will have a code in a code space and basically you have to do searches between these two types of documents to sort Of link them up in either direction
13:05
Okay back to our diagram Going on to the lower half of this in blue the layer section and each layer might might have a metadata URL link and that goes It's a direct link to another data set document that you can download using the techniques
13:24
We talked about earlier, but a big note of this is that? The document that this points to is probably not one of the ones that you harvested And the reason for that is you probably harvested from a higher level catalog like a country-level catalog And these are probably pointing to regional catalogs and those two
13:45
Metadata records will almost certainly be different. I don't think I've seen any that were actually perfectly the same And it also has a metadata URL link so spatial set a dental sorry and a spatial data set identifier that links the data set record to
14:01
A data set record via an ID And let's look at an example of that We have an example capabilities layer up at the top. It has a spatial data set identifier. I was Just put one in there. So it's very original code is identifier name spaces namespace and
14:22
Just a note here. Is that for WFS WMS and Adam they all have a different way of Putting that identifier in there, so they just have to handle that but that's usually not not too difficult and you can use this to search by ID back into your set of data set metadata records and
14:45
Also sort of in the middle of the screen there. You see a metadata URL link. So that's a direct link to a another metadata data set metadata XML document And there's a lot more details that I'm not really getting on get getting into here and I'm gonna stop the sort of detailed
15:02
XML explanations now and just talk things at high level, but I did want to make sure that I showed you what was going on So if you're ever asked to sort of look at one of these things you have an idea of what's going on because Like I said reading through the specification is quite difficult
15:21
But you don't read that's up there I'm just going to summarize basically you harvest all the service metadata records you harvest all the data set metadata records You follow all the links in your service that metadata records to capabilities documents and they follow all the links in your Capabilities documents and they do bunch of matching based on your extracted
15:41
Identifiers and then you can sort of link up your data sets and and actual OGC services And I'm simplifying quite a bit here, but poof That's a lot of work Excuse me, Tristan. It's not easy to go from a data set metadata record to a data or service
16:03
You have to process all the service records and process all the capabilities documents before you can start doing that matching step And it's not very easy to do that It takes a lot of work and there's a lot of special cases because everybody has a snowflake way of doing things So it can be quite difficult
16:22
But inspire realize this is pretty complicated and they're moving in the process of moving to a simplified link model as I've shown here The big difference in this is there is a direct link in the capabilities a direct link to the capabilities document right inside the metadata the
16:40
dataset metadata record Mind-blown, it's totally easy You don't have to go searching. You just follow the link. It's really easy Now you can go directly from the metadata record to the capabilities document without having to harvest and search through all your service documents and do a ton of processing and then if you do also
17:03
Add the metadata URL for each of the layers in your capabilities file. Then you have bi-directional it's really simple bi-directional and that is It's from that dashed arrow, but it's it's it's really big I can't it's you know, easy peasy lemon squeezy it's I can't really emphasize how much easier this is because this is just direct links as opposed to
17:26
having to search through and have a database of all your all your endpoints and and Deal with that Okay Done the hard part Let's summarize with some rubber meets the road type of advice
17:45
Let's start with the downloading links, which I talked about at the beginning First use the complete the complete URL Don't make people guess what the service is or what the parameters you need to actually talk to that service Put the full thing in there
18:01
The next thing is make it simple to download the easiest test is to use some type of command-line tool like curl To make sure your link resolves and returns XML Don't just copy paste it into a browser your browser does a bunch of magic and make sure your SSL certificates
18:22
are both valid and widely trusted and Again, no simple access don't put any strange security procedures in there testing with curl will catch most of these And if you remember my story about the infinitely redirecting looking for a session cookie
18:41
Yeah enough said Yeah, so don't put people in the bad bad dog box. This is something that happened a few times You're following a large number of links thousands and thousands of these links, especially in a large set of metadata URL links And a lot of them aren't correct and they're pointing to
19:02
locations that return a 404 or not found Unfortunately some servers detect this as a URL Security scan and they get a little upset with you and don't allow you to connect to that server anymore So and at that point there's not a lot that you can do other than email the system as administrator and ask them
19:21
to be nice for you and Finally let's talk about linking together data and Metadata So some of the capabilities documents are a bit hilarious They have 10 to 15 thousand layers in them and by hilarious. I mean totally not funny at all. It's just too many
19:47
It really takes a long time to process and a lot of links to follow I recommend you have a capabilities That's just about one a single data set that keeps things really obvious and it solves a bunch of other problems I haven't talked about
20:02
This is really easy in something like geo server. Just break up your layers into workspaces One workspace per data set and you can link to a workspace specific capabilities. That's really easy One data set one capabilities file I know Jody is really interested in this because he just did a bunch of work in jail server to make that more obvious
20:24
Again in capabilities for each of the layers make sure there is a metadata URL link to the data set metadata record Everyone will love you everyone Seriously, everybody if those links aren't there People are probably just going to google your layer name from your WMS and then give up
20:46
That metadata URL link on each layer is is really user-friendly really makes things easier and You don't want to have to do all that processing and searching that we talked about before so when you use the inspire
21:02
Simplified link model that's easy and obvious connections And again, this makes life so much easier as I said earlier. Everybody will love you everyone It is so good and using the simplified model and layers if you have also used the simplified link model and
21:22
Have what I said previously on the previous slide, which is a metadata URL for every layer You have this really nice bi-directional connectivity and it's it's really awesome It's so easy to use So if you're only going to do four things one use the full URL and make sure curl resolves
21:42
one data set per capabilities Add a metadata URL link to all your layers Data that gives you your data to metadata link and use the inspire Dataset linkage simplified inspired data set linkage which gives you metadata to data linkage
22:00
Yeah, and then it's everything is much easier So, where do we go from here I Mentioned at the beginning. I just put together an open source project that does a bunch of harvesting follows all the links everywhere Finds all the connections I talked about and a lot more than I haven't talked about
22:20
This code already exists and you don't have to write it again And this code can be leveraged to automatically fix up records to go along with my all my recommendations It's super awesome You can really change a lot of things So we talked about those incomplete or problematic URLs at the beginning
22:41
I can tell you which ones are problematic and what they should be so boom fixed are your layers missing? Are you missing? Extended capabilities I can fix that I can tell you what the backlinks is to the service metadata document
23:00
I can tell you what the spatial data set identifiers are for that capabilities I can tell you how to split that document up. So it's one capabilities per data set and Are your layers missing metadata URLs or spatial data set identifiers? I can tell you what they should be and that's a huge one because
23:21
these type of things make your services so much easier to use for people who are experts in the field and don't have a bunch of code lying around and If you're not currently doing the inspire simplified Dataset linkage I can tell you what you need to add to your keep your The capabilities link and what they should look like to go inside your data set record
23:45
and this really is a huge win because it's so easy to use when this when they're set up this way and Most of the hard work is already done. You don't have to redo it. I've already done that So that's really good. It's open source. So you can just download it and go and to be honest
24:01
there's still a lot of work to do and Obviously, I can't make magic connections between data sets and Metadata when they aren't already present at some level But I really think that tools based on this can make Data managers and editors life's way way easier and the resulting data
24:24
Metadata products much easier for users and much easier for the people actually working with that with creating those data those metadata's What more can you want? Thank you very much Dave we're gonna have to go directly to our next speaker, but thank you very much