We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Data.gov/Geoplatform.gov CSW implementation through pycsw and CKAN integration

00:00

Formale Metadaten

Titel
Data.gov/Geoplatform.gov CSW implementation through pycsw and CKAN integration
Serientitel
Anzahl der Teile
188
Autor
Lizenz
CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
Produzent
Produktionsjahr2014
ProduktionsortPortland, Oregon, United States of America

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
This presentation will discuss the implementation of the CSW endpoint using pycsw within the Data.gov infrastructure (architecture/enhancements/testing/deployment) and CKAN, which powers Data.gov.CSW (Catalogue Service for the Web) is an OGC (Open Geospatial Consortium) specification that defines common interfaces to discover, browse, and query metadata about data, services, and other potential resources.Data.gov provides access to its catalog via the CSW standard for both first-order and all metadata for harvested data, services and applications. Data may be referenced from federal, state, local, tribal, academic, commercial, or non-profit organizations. The first-order CSW endpoint provides collection level filtering of all metadata records. The all metadata CSW endpoint provides all levels of metadata at varying levels of granularity.Any client supporting CSW (desktop, GIS, web application, client library, etc.) can integrate the Data.gov CSW endpoints.
Schlagwörter
25
74
Vorschaubild
29:15
Projektive EbeneAbfrageOffene MengeDienst <Informatik>VersionsverwaltungProdukt <Mathematik>Ordnung <Mathematik>Selbst organisierendes Systemsinc-FunktionHyperbelverfahrenProgrammbibliothekKonfigurationsraumPerspektiveZusammenhängender GraphWort <Informatik>DiagrammStellenringNebenbedingungOpen SourceSoftwareQuellcodeMAPBasis <Mathematik>OrtszeitAggregatzustandSystemaufrufMaßerweiterungSuchmaschineSchnittstelleImplementierungEinfach zusammenhängender RaumIntegralMetadatenOnline-KatalogCoxeter-GruppeAuswahlaxiomPlug inDifferenteSystemplattformDispersion <Welle>DatenbankPunktKartesische KoordinatenFächer <Mathematik>SichtenkonzeptSpeicherabzugGruppenoperationIterationVisualisierungPrototypingWärmeleitfähigkeitGebäude <Mathematik>URLCodeZweiBenutzerbeteiligungElementargeometrieService providerDokumentenserverFontServerVorlesung/Konferenz
VisualisierungVerschlingungMetadatenProjektive EbeneAutomatische IndexierungElektronische PublikationDifferenteResultanteDatensatzSoftwareTermForcingProgrammierumgebungOrdnung <Mathematik>AuswahlaxiomDatenbankViewerBenutzeroberflächeQuaderSchnittstelleProzess <Informatik>MaßerweiterungRandomisierungSelbstrepräsentationDialektSinguläres IntegralSuchverfahrenSelbst organisierendes SystemMinkowski-MetrikDebuggingOnline-KatalogMultiplikationsoperatorCASE <Informatik>AdditionAbfrageTypentheorieHeuristikMultigraphQuellcodeMessage-PassingMAPSystemaufrufURLLeistung <Physik>Kategorie <Mathematik>Wort <Informatik>RuhmasseWeb-SeitePackprogrammDatenverwaltungMereologieMengeTabellenkalkulationNummernsystemOffene Menge
ImplementierungOnline-KatalogProzess <Informatik>Mailing-ListeStandardabweichungPhysikalisches SystemInternetworkingClientPlug inURLDatenbankTUNIS <Programm>QuellcodeCluster <Rechnernetz>Offene MengeBitCoxeter-GruppeIdentifizierbarkeitMetadatenPortal <Internet>MusterspracheMaßerweiterungServerProjektive EbeneSystemverwaltungMAPResultanteNummernsystemIdeal <Mathematik>Arithmetisches MittelSkriptspracheProdukt <Mathematik>NetzbetriebssystemExogene VariableOrdnung <Mathematik>MengeVirtuelle MaschineMereologieAbfrageVerschlingungBasis <Mathematik>VersionsverwaltungDebuggingSelbst organisierendes SystemMeta-TagSoftwareRadikal <Mathematik>QuaderWeb ServicesKonfigurationsraumSymboltabelleZweiTypentheoriePrädikatenlogik erster StufeFilter <Stochastik>MetasuchmaschineElektronische PublikationService providerSpeicherabzugCodierungValiditätEchtzeitsystemUmwandlungsenthalpieHardwareInterface <Schaltung>Vorlesung/Konferenz
Elektronische PublikationIdentifizierbarkeitTermArithmetische FolgeDifferenteDatensatzEinsMetadatenOrdnung <Mathematik>DatenbankPunktAutomatische IndexierungMultiplikationsoperatorQuellcodeBitIntegralSchwebungXML
MultiplikationsoperatorOrdinalzahlCoxeter-GruppeLebesgue-IntegralSpielkonsoleSoftwareentwicklerProjektive EbeneComputerspielDatenverwaltungVorlesung/Konferenz
Transkript: Englisch(automatisch erzeugt)
I'm here to present the data.gov CSW implementation, which is based on PyCSW, and it is using also CCAN integration.
So in the next minutes, I'm going to show you how we implemented this project during the last six months. So the presentation outline today is I'm going to give a short introduction about CCAN and
PyCSW, and then I'm going to talk about what are the components of the data.gov portal, what are the features, which feature we implemented during this project, how the configuration was done, and then I'm going to show you some demonstration of how to access the data, how
to search the catalog, and what is coming in the next months. So basically, data.gov is the home of the U.S. government open data. This is the second iteration, actually, of the project.
You are able to find their federal, state, and local data and the resources in order to conduct research and build applications and do visualizations or whatever you want with the data, since it's about open data.
The data.gov project right now is run by the General Service Administration, but it's now completely open source. So you can find the source code on GitHub, and you can make modifications to it. So the new portal, version two, is based on CCAN, which is an abbreviation
for Comprehensive Knowledge Archive Network. It is an open source web platform for publishing and sharing open data, and it has really impressive history of deployments so far.
These are just a few of the well-known deployments, which are the EU open data portal, the data.gov in the U.K., in Australia, and in other places. And there's also an extension of CCAN, which is actually enabling CCAN
to have geospatial capabilities. So it is based on PostGIS, actually, and it has integrated open layers. Actually, now it has also leaflet support. And it is also, it can access OGC services through OWSLib,
and it has now support for CSW through PyCSW. And lately, there has been a GeoJSON support. So PyCSW is an OGC server, CSW server implemented in Python. It's an open source project, and we are using the MIT license.
Currently, PyCSW is fully certified by OGC, and we are now reference implementation of CSW, and also under OS Geo incubation.
So this is an overview of the history of the project. Initially, there were discussions about integrating CCAN with PyCSW, basically because both projects are Python-based. And since PyCSW can be used as a library, there were initial discussions
in 2011 about doing such an integration. Then in 2012, actually, OKFN and data.gov team partnered up to build the version, this version of data.gov. So then OKFN implemented the first prototype during the first months of 2013.
And then GSA took over the project and the extensions that were developed in order to continue and bring the project to deployment and to production state. So at some point, CCAN had an internal implementation of CSW, which was used in the UK.
But there were some issues with it. So they finally made the choice to drop the internal plug-in and use PyCSW as the official CSW library.
Then when the PyCSW was actively involved in this implementation, we started working on new features that will cover the needs of data.gov, things like full text search and repository filtering
and connection pooling and stuff like that. I will talk about that later. And then we released in early 2014, we released the PyCSW 1.8, which was based upon the work we did for data.gov.
And now currently, we are actually, Saturday, we're going to release 1.10 for PyCSW during the code sprint here. And this is the version that is going to be updated on data.gov also. It brings new features like open search. I'll talk about that later in more detail.
So the goals of the project in our perspective were to be able to deploy PyCSW as a WSGI application directly into that URL you see over there, to be able to synchronize the metadata between CCAN and PyCSW, because both projects have different database schemas,
then to provide collection level support, because CCAN now for data.gov uses collection, uses a special extension to CCAN and uses data collections. Then we had the big issue of making this run very fast.
And we had to optimize for performance during this project. And we did some documentation and worked on the deployment configuration stuff. So this is a short diagram of how this project is being implemented.
You can see that the core software is CCAN, which is actually working on top of Postgres and PostGIS, actually, in order to be able to do special queries. But in order to perform fast queries, CCAN uses Solr.
But the latest versions of CCAN actually can do special queries through Solr, but still PostGIS is used because it has more features. Also, as you can see here, there
are many extensions in CCAN on the left. The geodata.gov extension was implemented directly for this project. And also, there are other plugins, like the Harvesters and the Spatial extension. I'll talk about later about how
we deploy using Ansible and RPMs and stuff like that. So PyCSW was used as a search engine next to Solr in order to provide the CSW interface. So CCAN is very successful because it's
targeting governments and companies and organizations, and they're looking at all kinds of open data, not only geospatial. So today, CCAN is not about only geospatial software. So the good thing is that it provides loosely coupled services, which can be turned on and off
according to needs. And this was a very successful project, and it has capabilities like publishing and finding data sets, storing data, creating networks of federated nodes, harvesting between different sources
of data and metadata, doing editing and management of metadata. And also, it has a very strong API, which can be used directly to give access to the data. So I have some screenshots of how the original CCAN looks
like, where you can actually do searches through the user interface. CCAN is a pylons project, so it is a Python-based project. And search and discovery is done through this user
interface, where you just type keywords, and then you get results. And based on the results, you can redefine your searches, or you can use keywords or topic categories, which are provided by the user interface. Also, you can actually look into the metadata that
are provided with the data. And it is very capable of using lots of many kinds of metadata. It can harvest from many sources. And this is a very powerful feature of CCAN. And also, there's a feature of visualizing geospatial data
sources directly on a map. And also, it can do the same, not only for maps, but also for grid data, and it can create graphs. So the spatial extension specifically
uses actually one spatial column to the Postgres scheme of CCAN. And it uses that to perform queries and display the results to the front end. For this project, for data.gov, we
tried to use only ISO XML metadata. So actually, when we harvested from various other catalogs, we were doing transformations to ISO in order to have a unified XML representation of all the metadata.
And this is the UI as it is implemented today. So there are some extra capabilities that are added to the CCAN core, like you see the special extension adds the map to the left, where you can actually create bounding box searches.
You can refine the search with keywords. And also, a nice feature is it can do relevant searches in terms of the bounding box. So it will give you first the results that are actually fitting better to the bounding box that you give. And this is a nice thing to see more data
that you are expecting to get. So this is also how it is performing this filtering that I told you about. So then, when you actually find the data set
that you are looking for, it provides all the resources and the metadata in order to be able to view and download the data. So you can send from this page, which is the resource page, you can directly go to a map and see a WMS, for example, if there is such a resource.
Or you can click on the original XML file and you can directly look at it. But there are also HTML viewers for that. So you have different choices in terms of how you see your data and metadata.
A big topic in this project was to be able to visualize many, many different kinds of resources. So there are things supported like WMS, WFS. You can actually download, save files,
or you can download even spreadsheets directly from the user interface. Also, as part of the CCAN, there is a data set preview extension where you can actually see the data, the spatial data within the viewer,
if that is possible from the original resource. And this is how you get to see the original XML file if you click on the URL. So this is the CSW interface, which was implemented directly into the CCAN deployment.
So here we see something more than 400,000 records within the CSW endpoint. So what are the features that we implemented? Basically, we needed to replicate the way
that data.gov does the collection. So data.gov has collections. Instead of showing to the end user 400,000 records, it uses collections as an intermediate layer. So you get to search something like 80,000 to 90,000
records, which are actually collections of metadata. So we needed to be able to do that. And this is why we implemented a filtering process, where you have your catalog and you have 400,000 records. And you can actually filter those with an SQL query,
depending on if this is a collection data set or not. And this was needed because we needed to have similar behavior of the CSW with the CCAN search.
Then we did some work on the database pooling for WSGI in order to be able to deploy within the environment of data.gov. And actually, the most important feature in this process was the Postgres full text search.
This was the feature that made actually the CSW endpoint really, really fast. So there's no problem searching those 400,000 records or even more, because Postgres has this nice feature that
can actually index all the data. Then we did some work on the link type detection. The big problem is that on data.gov, you get resources from all those organizations around the US, where you have data
like shapefiles within zip archives, when you don't know what it is inside. So this was a pretty tough problem to solve. And actually, we didn't solve it, but we kind of worked towards managing it. So another problem would be when you have a WMS, which doesn't actually
have the word WMS in the URL, what do you do then? You might get OWS. What is that? Is it WMS? Is it WFS? What is that? So there were problems like that, and the original metadata provided by many organizations didn't have the link type in it.
So we didn't know what kind of data it was. And we made some heuristics in order to be able to solve easy cases, but some other cases were not solved. And now we need to ask for the organizations to provide new metadata, which is something that nobody wants to do, actually.
And also, the last addition to our features is that we are the first to implement the OTC open search zero-time extensions, because it happened like two months before we released. And it was something that people wanted. So this is the extension that lets
you do open search, open search queries, and providing a bounding box and not only keywords. So this was implemented in PyCSW. And actually, this is going to be released on Saturday.
But it is already in the data.gov CSW deployment. Apart from what we did for this project, I'm just showing you some features that PyCSW is already doing, like harvesting WMS and WCS
and all those standards. We also support ISO. We support FGDC. We implemented the Inspire. We have been using the Inspire documentation to implement the service.
And we also support many databases. We can work off an SQLite or Postgres or whatever is possible to be done through SQLAlchemy, actually. More features.
We try to keep it simple. The configuration of PyCSW can be done in four minutes. It's actually very, very simple to do that. And we have a very extensible plug-in architecture, where if you have a database, if you have your own schema of metadata, you can actually create a plug-in so
that PyCSW can understand your schema and provide responses to queries according to your database. This requires, though, a bit of coding. It is already integrated with portals and other Python projects, like GeoNode and Open Data Catalog.
And we can do also real-time XML schema validation and stuff like that. And these are the standards that we are currently supporting. So I won't go through that list. It's a long list. So how did we actually configure and deploy it?
It was a long process. It was not something that was done actually in a few days. And also, we didn't have any access to the physical machines. We had to provide every step of the deployment on an email, actually, or in some kind of we didn't have access.
So we had to automate everything from every single thing that would be done through a terminal. We need to automate everything. So actually, there was already work done there. And the data.gov people were using Ansible for that.
So we just picked that up. And we provided Ansible scripts to do the deployment. The tricky part is that Ansible is not used directly into the servers. Ansible is used to create the RPM packages that are then deployed to the servers of the GSA.
So it was a two-level deployment scheme where we first create the RPM packages. And then those RPM packages are sent to the production servers. This way, every one month or so, yes, one month, I think,
the GSA is actually updating the portal with new patches and new features. So this is an easy way for the administrators to upgrade their system. Three type of clusters there. We have database cluster, frontend, which is the CCAN and PyCSW.
And also, we have the cluster for harvesters, because harvesters are actually doing the hard work. They're harvesting from every source of metadata that is available around the US. So it needed a specific set of hardware. And we are using CentOS as the operating system to do that.
Now, OK, the catalog is here. How do we use it? We have CCAN. We can do it through the UI. But why do we have CSW? How can we use the CSW? So part of the project was to document this process.
And this is why we created a set of documents so that the users can actually do and understand first and then do some simple CSW requests. So here, there are some links on the documentation.
Actually, this presentation is available on the internet. I will give you the URL on the last slide. But also, there are many other tools that somebody can use to access the data. One of them is QGIS, actually, where there is the MetaSearch plug-in, which is now a core QGIS
plug-in. So whenever you install QGIS for 2.4 or a later version, when that comes, you are able to use the MetaSearch plug-in to actually search the metadata from data.gov or any other CSW
server out there. And also, you can use any other CSW client that you can have. So I already told you about how data.gov organizes the data in collections. So these are the two endpoints that we provided
for the CSW implementation. So we have the first order of the collection level filtering. This is the URL where you can search the collections. And then when you search the collections, you can actually specify the ID of the collection in order to go to the second endpoint
and search within the collection. So there are two, actually, interfaces instead of one, just to be able to reproduce the workflow that you can do on CCAN UI. So since data.gov uses CCAN and PyCSW,
others became very interested in that. So suddenly, we hear from other organizations that want to reproduce this installation. I have met many people here. I'm very happy to be here and being part of this process.
And a few weeks ago, NOAA deployed the CSW extension there. And you see, there are more coming. So we have also a live deployment map where you can find who is doing that.
What did we learn from this process? The first thing that we learned is that you have to optimize your database. This is something that cannot be done without database tuning. And actually, databases need to be really fine-tuned
in order to be able to do this kind of heavy lifting of metadata and searches and everything. Also, it's very important to be able to have very, very well, you have to create packages. And you have to be very careful of the dependencies and how you deploy this software into your production
machines. So it all came down to using the full text search of Postgres in order to be able to be really, really fast. And this is also part of the upgrade of data.gov to the latest Postgres 9.3, which actually was very fast.
Another thing that we learned was that identifiers are very, very important. And what I mean by that, the original metadata files had,
so the original XML records from all the sources that we harvest have some identifiers in order
to be able to index the data. So CCAN does something a bit differently. When it harvests, it creates new identifiers for the records in order to be able to manage internally the metadata.
And that was very important, because we had to find ways to synchronize or to couple the identifiers, the original ones and the new ones. So this was a bit of trouble. But at the end, it was made possible. Also, some problems where the original metadata were not
the best we could ever find. I mean, there were issues there. And sometimes we had to actually ask for updates. And this is a work in progress. And some issues are already always there. But they are going to be fixed eventually.
So in the future, what we want to do is to do a deeper integration in terms of not synchronizing between two databases, but using only one database. And this is also a work in progress. The next big thing is going to be
CSW3, which is actually not yet released by OGC. But we are already working on that. And I think at some point in the next few months, it will be released. So we have some time here to show you a couple of things, like the actual UI and how you can perform searches directly
on CCAN. So this is the data.gov portal. Don't have enough time? OK. I can two minutes. OK. Then I'll just, well, actually, if we have two minutes, I can show it another time.
So I will skip that. So I want to thank the USGS for supporting this project, and James Bauer, who we worked together for many months now, the FGDC, and the GSA and data.gov
development team, and the integrators of the console and REI. And a big thanks goes to the Pisces-W development team, Tom Kurledis and Adam Hins. Unfortunately, the project manager, who was Doug Nebert, passed away a few months ago.
So I want to dedicate this presentation to him. He was a really inspiring person who made this possible here. And he actually, he was very passionate about CSW3. So he never got to see it.
But anyway, that's life. So this is for Doug. Thank you very much.