We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Spatial data and the Search engines

00:00

Formale Metadaten

Titel
Spatial data and the Search engines
Serientitel
Teil
184
Anzahl der Teile
193
Autor
Lizenz
CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Why is it so hard to discover spatial data on search engines? In this talk we'll introduce you to an architectural SDI approach based on FOSS4G components, that will enable you to unlock your current SDI to search engines and the www in general. The approach is based on creating a smart proxy layer on top of CSW and WFS which will allow search engines (and search engine users) to crawl CSW and WFS as ordinary web pages. The research and developments to facilitate this approach have been achieved in the scope of the testbed "Spatial data on the web", organised by Geonovum in the first months of 2016. The developments are embedded in existing FOSS4G components (GeoNetwork) or newly released as Opensource software (LDproxy). We'll introduce you to aspects of improving search engine indexing and ranking, setting up a URI-strategy for your SDI, importance of URI persistence, introducing and testing schema.org ontology for (meta)data. We’ll explain that this approach can also be used in the context of linked data and programmable data, but it is important not to mix it up. María Arias de Reyna (GeoCat bv) Clemens Portele (interactive instruments) Joana Simoes (GeoCat) Lieke Verhelst (Linked Data Factory) Paul van Genuchten (GeoCat bv)
Schlagwörter
78
Vorschaubild
51:51
154
Vorschaubild
35:04
DreiARM <Computerarchitektur>ProgrammierungRelativitätstheorieCoxeter-GruppeSichtenkonzeptRechenschieberCASE <Informatik>Vorlesung/Konferenz
CAN-BusW3C-StandardWurm <Informatik>SystemplattformModemBenutzerfreundlichkeitBenutzerbeteiligungSichtenkonzeptBenutzerfreundlichkeitSchnittmengeGruppenoperationElektronischer ProgrammführerSystemplattformPhasenumwandlungServerSelbst organisierendes SystemSpider <Programm>EnergiedichteAbstandRechenschieberCodeW3C-StandardElementargeometrieComputeranimation
Spezielle unitäre GruppeSummierbarkeitHMS <Fertigung>CAN-BusInhalt <Mathematik>SurjektivitätGEDCOMServerStrom <Mathematik>W3C-StandardTermSoftwareentwicklerServerMultiplikationsoperatorSuchmaschineRechenschieberStandardabweichungInhalt <Mathematik>AbstandSystemaufrufComputeranimation
SoftwareentwicklerSurjektivitätLokales MinimumOffene MengeAdressraumFlächeninhaltW3C-StandardAutomatische IndexierungSpider <Programm>SpeicherbereichsnetzwerkRohdatenProxy ServerSummierbarkeitMehrwertnetzWeb ServicesMAPGesetz <Physik>Spezielle unitäre GruppeWeitverkehrsnetzARM <Computerarchitektur>ElementargeometrieEindringerkennungCAN-BusMicrosoft dot netMetropolitan area networkMailing-ListeCodeComputerunterstützte ÜbersetzungLinked DataOvalGoogolGraphZählenSuchmaschineAutomatische IndexierungVerschlingungInstantiierungVorhersagbarkeitSoftwaretestZusammenhängender GraphSoftwareProxy ServerOntologie <Wissensverarbeitung>SelbstrepräsentationSoftwareentwicklerCASE <Informatik>DatenbankBenutzerbeteiligungAdressraumResultanteKumulanteClientStandardabweichungOpen SourceWeb-SeiteServerWort <Informatik>Landing PagePunktBenutzerfreundlichkeitDifferenteGeradeMereologieBitOrtsoperatorLinked DataMetadatenBildgebendes VerfahrenProjektive EbeneQuick-SortCoxeter-GruppeRechenschieberMechanismus-Design-TheorieSchnittmengeOffene MengeCodeComputerunterstützte ÜbersetzungInverser LimesMailing-ListeFront-End <Software>Computeranimation
SurjektivitätElektronisches ForumGruppenkeimMaßerweiterungAdditionGruppoidMechanismus-Design-TheoriePunktExogene VariableGruppenoperationMetropolitan area networkEmulationCantor-DiskontinuumMehrwertnetzSpeicherbereichsnetzwerkSinusfunktionNummernsystemSpeicherabzugFormale GrammatikVerzweigendes ProgrammMeta-TagInhalt <Mathematik>CodeRechnernetzSichtenkonzeptMittelwertWikiPuls <Technik>Strom <Mathematik>VerschlingungOntologie <Wissensverarbeitung>SoftwareProgrammierumgebungProgrammfehlerDatensatzIdentifizierbarkeitSuchmaschineResultantePlug inProfil <Aerodynamik>DatenfeldTypentheorieMapping <Computergraphik>Physikalische TheorieOnline-KatalogGraphfärbungOnlinecommunitySoftwaretestCodeNummernsystemSpider <Programm>RechenschieberProxy ServerVerschlingungCASE <Informatik>EindeutigkeitDatenmodellSpeicherabzugProgramm/QuellcodeXML
Metropolitan area networkURLGrenzschichtablösungCASE <Informatik>VerschlingungSuchmaschineWeb-SeiteDateiformatInhalt <Mathematik>Vorlesung/Konferenz
Strom <Mathematik>Inhalt <Mathematik>SpeicherbereichsnetzwerkSchnittmengeRelativitätstheorieBenutzerbeteiligungProxy ServerInformationsspeicherungSpider <Programm>Interface <Schaltung>SuchmaschineOffene MengeBrowserResultanteZahlenbereichWort <Informatik>Endliche ModelltheorieDreiVerschlingungRadiusAbstandXML
SuchmaschineCAN-BusSuchmaschineWeb SitePixelGebäude <Mathematik>SchlussregelLinked DataMultiplikationsoperatorProxy ServerComputeranimation
Verhandlungs-InformationssystemDreiHill-DifferentialgleichungGrenzschichtablösungURLDatenstrukturStandardabweichungDomain <Netzwerk>SpieltheorieSichtenkonzeptSoftwareentwicklerMapping <Computergraphik>NummernsystemMinimalgradEndliche ModelltheorieOpen SourceFehlermeldungCodeMultiplikationsoperatorClientVerkehrsinformationMaßerweiterungBenutzerbeteiligungPhasenumwandlungLeckFront-End <Software>Graphische BenutzeroberflächeStrömungsrichtungCASE <Informatik>MAPBildschirmmaskeServerGlobale OptimierungSelbstrepräsentationProxy ServerGeradeTypentheoriePunktImplementierungElementargeometrieUmwandlungsenthalpieEinsRepository <Informatik>Besprechung/Interview
Computeranimation
Transkript: Englisch(automatisch erzeugt)
Hello, welcome in this very nice room with a few. So, who of you was yesterday in the DCAT presentation by me?
Just a couple, okay, so there's a couple of duplicated slides, because, well, you see in the slides, there's a lot of relations between these communities. In the program says Maria, again, but it's again me. And it's Clemens Portola and Likke Verhaest.
She was not able to be here in the conference. So, this is a slide from Linda van der Brink from Genovem, the Dutch SDI organisation, which she showed us as a kickoff of the geo data for the web testbed
that was running in the Netherlands at the beginning of this year, and actually next phase is running now. So, as a spatial community, we're quite disconnected from the rest of the web, and in the geo for web testbed, mostly focused on the search engine spiders.
All our CSW and WFS services, those are the services that we use most, are quite invisible for the search engines. So, she presented that with a big wall. Also, the users that are not aware of the OGC services
also get kind of... Yeah, maybe... Is that frustrated or just we don't understand? I want to mention... So, I mentioned Genovem, the other group where Genovem is actively participating in
and also Clemens is the OGC and W3C working group, which was set up last year and wants to gather a set of best practices to bring those two worlds together. That picture is from one of the presentations, I don't know actually who has the ownership there,
but it presents Mr Globe and Mrs Cube, or the other way around, it's Mrs Globe and Mr Cube, yeah. So, there were four topics in the testbed, modern ways of data publication, usable spatial... I'll go into them one by one. So, topic one was cancelled due to no sensible proposals were made.
But they redefined the topic and that's the phase two which is currently running. Topic two was a usable spatial data publication platform. I'm not going too much into that. They identified CartoDB as the platform to use.
The third one was scrollable geospatial data using the ecosystem of the web. So, they took spatial data and defined best practices from the API kind of world.
What would be the best practice there? And they have a lot of interesting code here on GitHub. You can go to that and check it out. The main conclusion is that they introduce these interesting terms like developer experience instead of user experience, because they focus on developers
and this thing, time to first successful service call. I like those terms. Their APIs are very much based on Swagger, which seems to become kind of a standard within the API world.
One of their findings was that search engines have limited content negotiation, because in the Swagger specs they use a lot of content negotiation, but this is apparently not used by search engines. Well, it aligns with our findings, but search engines are just quite unpredictable.
Then I come at our topic and this is where I hand the slide to Clemens. And Mike. OK, yeah, so our work was really our research topic, so that's why it's called research topics, is that unlike topic three, so they're very similar because the idea is
how can we make the spatial data accessible on the web, crawlable by search engines, usable via APIs. And topic three, they really didn't use anything on the spatial data infrastructure, but our topic was focused more on the aspect, OK, we have this spatial data infrastructure with CSW,
like we've heard in the presentation before, WFS, WMS, and we have then GIS software that clients can connect to that, and developers who are familiar with the OGC standards, they can also use that. So we reached that, but the question was,
how can we get through the wall to the rest of the community? So the approach that we proposed and that was accepted, that we actually introduced the proxy layer, right? So an additional layer which has transparent proxies, so we don't really cache the data, et cetera,
but what we do is we actually have intermediate components that on the back end actually act as clients to CSW, Paul will talk more about the metadata and CSW part that you see on the left-hand side, and I'll talk a little bit more on the part in the centre,
which actually uses the data. So where we try to map the principles that we have in the spatial data infrastructures, which support very complex things that the GI people are interested in, to the more simpler things and different representations
that the web is interested in. So the result of that is also an open source project, LD proxy, that's the linked data proxy that is available and it supports just WGS 84, it supports schema.org as the mechanism that search engines understand
and also support content negotiation and most importantly for crawlability, we make links to every feature in the WFS. So the address database has 8 million or 9 million features and you can click through each of them.
And that makes it indexable by search engines if the search engine would index everything. So as he mentioned, there's some unpredictability on how quickly they do that or what they do. So there are still quite a few open questions related to that. But with that, we reach the search engines,
we can also make the data by also not making it just available in HTML but with GeoJSON, JSON-LD, etc. we can also make it available to developers and using web APIs and we also established links. So that's if you go to Google and we published everything on ldproxy.net.
That's a deployment of the LD proxy software and then you can find currently 18,500 results in that set. And when you click on one of these links, so for each, if you click through from the landing page,
you can see that on top, the Inspire address, WFS, that's basically the entry page, that's the capabilities document, so to speak, in WFS words and then you click to the feature type, the addresses and then you can go through pages, through all the data and then you can go to individual features
like the one, the address that you see here. And you can find that also then using searches. What we also have done is to create a Docker image and you can get that, use Docker
and get the software up and running and connect it to your WFS in a few minutes. So that's one thing that is available to make it possible for you to try it out very quickly with your WFSs. We try to support as many WFSs also with some limitations
that they have as good as possible. But we're also interested about hearing experiences. The code is on GitHub, so you can use that too. And with that, I'll hand back to Paul.
Yes, so this is a similar experience but then for the CSW part, if you look in Google for a certain dataset title, you'll find, in this case, Open Data Cat was the one that we used as a test instance. It will give you the dataset metadata.
So that will bring you to an HTML page showing that data. So this is a slide that I showed yesterday too. We identified more or less, there's probably a lot more, but four main data communities, which each of them uses their own kind of metadata standard.
So our world uses ISO 1.1.1.5, and here on the search engine side, we have the schema.org dataset ontology. The DCat ontology, that was my presentation yesterday,
I was focusing on that one. But there's another one, I don't have a slide set for that, but I think it's a very interesting community also. It's the linked data community, the linked open data community. The second line here is also interesting, because especially both in the search engine world, but especially in the linked data world,
linking to common vocabularies is very important to be discoverable and to be connected. What you see here in the DCat world is that a lot of governments define a code list that have a legal background,
which completely makes sense in their legal world. But the risk is that they're quite disconnected from the other communities, which use DBpedia as kind of the centre of their cloud, or the Google Knowledge Graph for the Google search engines.
So the point here is that if you want to make your data available, then make sure that links between these vocabularies exist or service your data for each of these communities, if you have the capability, because that allows each of these communities to consume the data.
So a couple of slides about the Schema.org dataset. So Schema.org is an initiative of the main search engines to have a shared ontology for things. But there's also a community website where anybody could participate to develop the standard further.
This is what it looks like on one of those popular search engines. If I would look for this building, I would usually find this type of results. But in this case, the search engine has stored some structured data
about the thing, so it knows, okay, so this is a building. And this is then managed in Schema.org. So it would be really great that we would have the same for datasets. However, this search engine hasn't implemented that yet.
So we publish a Schema.org dataset annotation in our HTML, and then the search engine would see that we're representing a dataset there, and could then easily make a nice dataset summary here. Well, who knows what's to come.
This is an interesting tool. It's the structured data testing tool. This one helps you to see if the HTML that you created in your catalogue or in your proxy layer, what things the crawler is able to extract from it,
which is structured data. This is where the code is for the Schema.org mapping of ISO-19139. You see it's in a plug-in of Geonetwork. So any national profile plug-in can have its own customised Schema.org or DAG mapping.
So it's not in the core Geonetwork code itself. Some challenges. We identified already a couple of them. Persistent identifiers for WFS records. There is an Inspire ID, so in theory this could be done
if everybody would fill that unique field in the WFS. But we've seen a lot of WFSs which don't have a unique identifier. Then it's really hard to create persistent, unique URIs for WFS records.
We've seen a lot of bugs and dead links in existing catalogues, in existing WFSs. So because Google just really punishes you directly, it says, OK, I found 2,000 dead links on that page, that page, that page, that page. So because it goes through all these features.
Content negotiation. You really don't know what... We found that content negotiation is really boring in search engines. So if you want to feed data to a search engine, just give it an explicit URL for an explicit format.
In this case, HTML with RDFa. Yes, so another challenge we had is that we want to keep the relations which exist in the WFS also in the proxy. And then you have to make sure that the data set
which is also exposed, or where it's linked to, is also exposed via an LD proxy and then linked to the LD proxy URL. So topic one is still continuing, now as topic five and six. By Tripoli and Altera gained those experiences.
And their mission was to proceed on the findings that we did and bring them into practice. And it's really nice to see them. Their work is coming, will be presented this September 8th, but already they gave out some things.
This is a lot of laundromat from Tripoli. It's a big spider for linked data. So it spiders the whole web to find triples. And they have a lot of triples already. And they set up a lot of laundromat to harvest the LD proxy data
into a triple store. So all that WFS data that is exposed by LD proxy is now proxied into this one. And this one has a sparkle endpoint. So we can now, because the browsers couldn't answer questions like give me everything within a kilometer radius that has opening hours between this and that,
and these difficult questions, there's no interface on the search engine to query them. But it is here. So we're really anxious to see the results in a couple of weeks, what they found.
So being found on a search engine is not the same as linked data. Those are two different communities. So if you want to target those audiences, you have to set up, you have to make special rules for each of those communities.
Being exposed via search engine also helps you, because the search engine will tell you which of your data is used and is valuable. Well, of course, the LD proxy approach is interesting,
for example, for buildings and street lanterns, but it will not be valid for every dataset, because it doesn't make sense to have a coverage pixel as a single website. So some resources.
That was our story. Thank you, Paul. Thank you, Clemens. There is plenty of time for questions. So if I understood correctly,
the LD proxy is working with VE and WFS only. So would it be possible to extend it to use other backends like REST-based APIs and things like that?
Yes, of course, it would be. But we basically use the WFS client, because that's the data that we have in the SDI. And we were actually reusing existing code that we had and just made it available open source, because we also have another development
that actually provides then the Esri ArcGIS API. But now we changed the upper end and make it available for the web structures. And on the lower end, we have the WFS client. But you could easily also create other clients if you have specific APIs that you want to support.
I have a question to the audience. So, anybody here from like degree, GeoServer, MapServer community? Because this is a proxy approach. So it's not the final goal.
The final goal is that the actual server implementations implement these APIs. So I would challenge the GeoServer and MapServer community to think about, okay, are OGC standards the only ones that we want to support? Or do we also want to support this type of APIs?
And maybe somebody has an opinion on that. Does anyone want to comment on that? I'm not from the GeoServer team, but I know that GeoServer is a community extension,
I guess, if I remember correctly, which publishes WFS data directly into HTML that can be crawled by such engines. But I don't know if it goes that far as a proxy. That's all I know.
Thanks for your talk. I found on the GitHub repo that you propose an extension for schema.org. And probably if one of you is involved in that,
for the geographical domain could elaborate on this, why there's a need for a specific geo extension schema and how does it compare to the existing geo vocabulary and schema. Thanks. It's great to hear that somebody actually read our report.
Yeah, so what we found is that the current location model in schema.org is just limited. It actually has errors. And it's not, for example, not defined, what is the separator between a lot and long.
So it's very basic. And also, well, it's a shame that Lika is not here because she did that work. She proposed an extension or optimisation of the current schema.org location model.
I think one of the concerns that Lika had was that how it has been defined isn't really consistent with how linked data you would structure it from a linked data point of view. We had a discussion with Dan Brickley from Google, who is, I think, the main guy behind schema.org, I would say.
And there have been ongoing discussions in the schema.org community about the geo extension for a long time. My guess is, I think they know that it's kind of broken and they're trying to see what it should be, but they are still trying to find out what the right way is.
I don't think it would be going in the way that Lika has proposed in our report. And I think it's more of the discussion that is also currently going on, if you look at how GeoJSON does things. And at the same time, if you want to move it to JSON-LD, so to more RDF representation, that doesn't work.
So there is ongoing discussion also between these communities. And I think that will influence also how schema.org will handle geo in the future. Any more questions or comments? That's not the case.