Data.gov/Geoplatform.gov CSW implementation through pycsw and CKAN integration - TIB AV-Portal

Data.gov/Geoplatform.gov CSW implementation through pycsw and CKAN integration

00:00

70

Zugehöriges Material

Open Source Geospatial Foundation (OSGeo)

Tzotsos, Angelos

Formale Metadaten

Titel

Data.gov/Geoplatform.gov CSW implementation through pycsw and CKAN integration

Serientitel

FOSS4G 2014 Portland

Anzahl der Teile

188

Autor

Tzotsos, Angelos

Lizenz

CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/31698 (DOI)

Herausgeber

Open Source Geospatial Foundation (OSGeo)

Erscheinungsjahr

Sprache

Produzent

Open Source Geospatial Foundation (OSGeo)

Produktionsjahr

2014

Produktionsort

Portland, Oregon, United States of America

Inhaltliche Metadaten

Fachgebiet

Genre

Abstract

This presentation will discuss the implementation of the CSW endpoint using pycsw within the Data.gov infrastructure (architecture/enhancements/testing/deployment) and CKAN, which powers Data.gov.CSW (Catalogue Service for the Web) is an OGC (Open Geospatial Consortium) specification that defines common interfaces to discover, browse, and query metadata about data, services, and other potential resources.Data.gov provides access to its catalog via the CSW standard for both first-order and all metadata for harvested data, services and applications. Data may be referenced from federal, state, local, tribal, academic, commercial, or non-profit organizations. The first-order CSW endpoint provides collection level filtering of all metadata records. The all metadata CSW endpoint provides all levels of metadata at varying levels of granularity.Any client supporting CSW (desktop, GIS, web application, client library, etc.) can integrate the Data.gov CSW endpoints.

Schlagwörter

FOSS4G 2014 Portland71 / 188

1

28:37

GRASS GIS, Star Trek and old Video Tape Ð a reference case on audiovisual preservation for the OSGeo communities

2

26:05

Cartography in Mapserver from a user's perspective

3

35:38

Mapping in GeoServer with SLD and CSS

4

52:38

MapLoom: A New Web-client With Versioned Editing (GeoGit) Integration

5

26:48

Geo Trio: Putting MapServer, KML, and Google Earth to work at the Province of British Columbia

6

21:43

projections in web browsers are terrible and you should be ashamed of yourself

7

26:31

Introduction to MapGuide

8

29:14

GIS-based modeling with tangible interaction

9

27:33

The state of geospatial WebGL in the browser

10

24:28

3D slippy map with three.js

11

36:45

What's new in Cesium: the open-source alternative for 3D maps

12

26:24

Helping Farmers, Helping the Environment: An Affordable GPS Guidance System for Agricultural Sprayers

13

22:15

Using FOSS Tools, UAVs and Linear Referencing to Better Manage Federal Levee Data

14

21:34

State of the (Geo) Gem

15

51:39

PostGIS Feature Frenzy 2014

16

1:02:01

Mapping for Investigations

17

21:04

Dynamic mapping on the web: building a scalable service for thousands of companies

18

44:28

MapMint: The 100% service-oriented GIS platform

19

25:18

Local Ecological Footprinting Tool (LEFT)

20

28:04

EarthExplorer - On-line Search Tool for USGS Remote Sensing Data

21

26:24

Regional Conservation Strategy Viewer

22

21:10

The unrelenting progress of design in open source

23

31:19

Advanced CartoCSS Techniques

24

27:38

Cartography from code...?

25

27:34

26

24:44

3D-printing with GRASS GIS Ð a work in progress in report

27

21:37

GIS goes 3D : an OpenSource stack

28

23:39

Writing better PostGIS queries

29

31:11

Automated Vehicle Location (AVL)

30

33:26

Open source, open standards and 50 lines of code: A look behind GitHub's GeoJSON rendering and diffing

31

29:01

Distributed Versioned Editing in Action

32

19:20

GeoTools, GeoServer, GeoGit: A Case Study of Use in Utility Field Work

33

28:27

Js.Geo part Deux recap

34

22:47

Server-Side Marker Clustering For Rapid Display of Large Datasets

35

29:40

Accurate polygon search in Lucene Spatial (with performance benefits to boot!)

36

25:39

GeoCouch: A distributed multidimensional index

37

19:48

Connected Cars with PouchDB

38

54:53

State of GeoServer and GeoTools 2014

39

26:18

WPS Benchmarking Session

40

29:23

A GeoNode primer.

41

28:55

The DiscoverTotems Project: Social Curation with Mapping

42

21:38

The Mapossum: A System for Creating, Collecting and Displaying Spatially-Referenced Answers to User-Contributed Questions

43

28:05

MapStory: The next plateau

44

26:07

ZOO-Project 1.4.0: news about the Open WPS Platform

45

18:21

PyWPS - 4 project report

46

31:53

Easy ETL with OGR

47

22:19

Disparate data, technology fiefdoms and 65 pictures of your cat

48

23:31

GeoMOOSE at 10 Years

49

30:00

Quadcopter GIS for less than $700 - Hardware and software to map your local community

50

19:16

An Easy Web Mapping Framework

51

27:27

Open Web Mapping: An educational resource for creating online maps using free and open source software

52

20:28

A User-centered Design for Interactive Masking Capability within Web GIS

53

06:35

Case Study of Brazilian Institute of Environment and Renewable Natural Resources with FOSS GIS

54

27:32

An Automated, Open Source Pipeline for Mass Production of 2 m/px DEMs from Commercial Stereo Imagery

55

25:12

Exposing NASA's Earth Observations

56

29:54

MapServer #ProTips

57

28:03

MapServer Project Update - Introducing Version 7.0

58

28:15

MapCache: Overview of MapServer's tile caching server

59

23:57

GeoExt2 Ð Past, Present and Future

60

22:42

Creating Custom HTML Elements for Maps

61

28:47

Choose your own Adventure - Open Source Spatial on OpenShift

62

26:15

24-hr Latency End to End Data Processing Using Open Source Technologies for the Airborne Snow Observatory

63

28:33

Geolode: the motherlode of geospatial data sources

64

30:08

Compositing a Global Mosaic

65

24:23

Glob3 Mobile (Mobile Map Tools)

66

26:10

From Nottingham to PDX: QGIS 2014 roundup

67

24:56

Case Study: Developing OpenLayers-based Mobile Applications

68

39:07

Web and mobile enterprise applications

69

28:55

Fixing GIS Data Discovery

70

25:44

Open Source Geospatial Production of United States Forest Disturbance Maps from Landsat Time Series

71

29:51

Data.gov/Geoplatform.gov CSW implementation through pycsw and CKAN integration

72

21:48

State of QGIS Server

73

26:46

An automated classification and change detection system for rapid update of land-cover maps of South Africa using Landsat data.

74

29:15

75

26:24

"Sliding" datasets together for more automated map tracing

76

26:57

A Mobile Situated Learning Module using Open Source Geoweb Technology

77

26:30

Educating 21st Century Geospatial Technology Industry Workers with Open Source Software

78

23:17

Extracting geographic data from Wikipedia

79

22:32

ILWIS, the next generation tool framework for GIS and remote sensing

80

24:43

A FOSS4G-Based Geo Connection System for Education and Research

81

26:15

GRASS GIS 7: your reliable geospatial number cruncher

82

15:13

Open for Business Down Under

83

20:47

Seven ways of injecting Python to QGIS

84

29:27

Evaluation of Web Processing Service Frameworks

85

22:37

Adding Phylogenies to QGIS and Lifemapper for Evolutionary Studies of Species Diversity

86

18:22

TileMill and the Tower of Prince Henry, Reversed

87

18:15

Open Source Work-flow for Surface Interpolation with Curvilinear Anisotropy

88

33:05

Mapping Words and Phrases from Geographic Knowledge on the Web

89

21:35

Next Generation of Printed Maps

90

25:10

Köppen-Geiger classifications of paleoclimate model simulations

91

16:56

Mapping with AngularJS

92

29:31

OnEarth: NASA's Boundless Solution to Rapidly Serving Geographic Imagery

93

20:50

Responsive Interactivity: Toward User-centered Adaptive Map Experiences

94

35:35

Spatial-Temporal Prediction of Climate Change Impacts using pyimpute, scikit-learn and GDAL

95

26:27

Open Source Social Media Aggregation and Geolocating for Emergency Management

96

24:34

Inteligeo - Geographic Intelligence System in the Brazilian Federal Police

97

19:59

Creating Charts and Legends for 3D Atlas Maps - A Mashup of D3.js, osgEarth, and the Chromium Embedded Framework

98

25:20

Tracking Slippy Map Analytics

99

22:15

Building development environments using Vagrant

100

41:47

Vert.x - web sockets and async programming for everyone

101

28:07

Tuning Open Source GIS Tools to Support Weather Data / Rapidly Changing Rasters

102

27:16

pyModis: from satellite to GIS maps

103

31:12

A glimpse of FOSS4G in the environmental consulting arena

104

26:36

Big (enough) data and strategies for distributed geoprocessing

105

52:09

Don't Copy Data! Instead, Share it at Web-Scale

106

28:08

The role of geospatial open source (FOSS4G) as a component of hybrid systems

107

40:53

Open Source is People

108

21:58

Spatio-temporal data visualization in GRASS GIS: desktop and web solutions

109

27:41

Geodesign: An Introduction to Design with Geography

110

27:22

Serving high-resolution sptatiotemporal climate data is hard, let's go shopping

111

18:46

OSGeo Incubation

112

32:17

Barriers to FOSS4G Adoption: OSGeo-Live case study

113

28:19

Avoiding Burnout, and Other Essentials of Open Source Self-Care

114

22:08

Open Source Geo Certification

115

26:08

Update on new OGC Standards: GeoPackage, OWS Context & Geosync

116

35:55

Anchoring and PostGIS cure Post-Polygon Stress Disorder

117

13:42

Spatial Temporal Network Web Visualization Techniques

118

24:57

Finding the Where in Big Fuzzy Data

119

18:30

Trusting the Crowd in a Geospatial Crowdsourcing Application

120

21:57

Real-time Scenario Planning with OpenLayers

121

30:55

GeoMesa: Distributed Spatiotemporal Analytics

122

26:22

Adding value to Open Data using Open Source GIS.

123

27:45

Building Open Source Projects in Government Esri Ecosystems

124

26:35

Managing public data on GitHub: Pay no attention to that git behind the curtain

125

08:50

Small town GIS - Leveraging GitHub, QGIS and community members to manage local data

126

28:23

Supporting Open Data with Open Source

127

22:36

Empowering people, popularizing open source, and building a business

128

20:36

GIS in the Browser - The Good Parts

129

22:34

OpenLayers 3: a unique mapping library

130

19:43

Tilez: serving seamless polygons in the browser with TopoJSON and Node.js

131

23:46

Vector tiles for fast custom maps

132

28:56

Getting Started with OpenLayers 3

133

57:42

The Development and Evolution of an open source mapping application within the USG <- Now with More Google Glass

134

26:26

GeoNode for Humanitarian Crisis and Risk Reduction

135

23:37

Scaling for NYC while Tracking Plows

136

25:18

Leaflet + UtfGrids + d3.js = liquid fast, massively scalable interactive web map & data visualization

137

26:37

Client-side versus server-side geoprocessing: Benchmarking the performance of web browsers processing geospatial data using common GIS operations.

138

14:46

CS-Map - coordinate system libraries

139

20:48

Fast Travel Sheds using GTFS Data in GeoTrellis Transit

140

28:24

Mobile vector map rendering with Mapbox tools

141

14:08

A jumpstart for your mobile map app

142

25:22

"Fast Big Data?" A High-Performance System for Creating Global Satellite Image Time Series

143

25:48

Community Health Mapping

144

23:33

GeoScript - A Geospatial Swiss Army Knife

145

31:33

UrbanSim2: Simulating the Connected Metropolis

146

27:50

Assessing the distribution of disease vectors and fruit crop pests from satellite in GRASS GIS 7

147

1:02:53

Making Space for Diverse Mappers

148

57:52

Exploring Openness in Geospatial Education

149

41:41

How Simplicity Will Save GIS

150

52:28

The Toolmaker’s Guide

151

25:44

Government as a Contributing Member of the OpenStreetMap (OSM) Community

152

24:11

An Open Source Approach to Communicating Weather Risks

153

16:21

Shortest Path search in your Database and more with pgRouting

154

20:42

Repurposing OpenTripPlanner for Ride Sharing

155

26:01

A Complete Multi-Modal Carpooling and Route Planning Solution

156

24:18

How to tell stories and engage an audience with maps

157

21:32

158

24:37

Implementing change in OpenStreetMap

159

28:15

Using OpenStreetMap Infrastructure to Collect Data for our National Parks

160

27:53

Raster Data In GeoServer And GeoTools: Achievements, Issues And Future Developments

161

23:34

Using QGIS server

162

26:02

Integrating FOSS4G into an enterprise system for Disaster Management

163

25:26

"Do This, and also That: Integrating Open Source tools into traditional GIS shops"

164

27:26

The Manager's Guide to PostGIS

165

23:33

Gimme some YeSQL ! - and a GIS -

166

33:12

Spatial in Lucene and Solr

167

29:48

Running Your Own Rendering Infrastructure

168

17:31

The best of both worlds: combining geometry and key-value stores using PostGIS and HStore

169

20:44

Crazy data: Using PostGIS to fix errors and handle difficult datasets

170

30:17

Geospatial-Semantic Knowledge Management and Linked Data for Humanitarian Assistance

171

27:36

Fiona and Rasterio: Data Access for Python Programmers and Future Python Programmers

172

20:29

Big size meteorological data processing and mobile displaying system using PostgresSQL and GeoServer

173

25:51

Advanced Security With GeoServer

174

25:23

GeoServer Feature Frenzy 2014

175

31:40

GeoNetwork opensource 3.0

176

25:47

MapJakarta - Enabling civic co-management through GeoSocial Intelligence

177

23:24

OpenSource GIS surveying - water application

178

27:10

Developing Tools for Humanitarian Decision Making

179

21:41

Tileserver on a diet using node.js

180

24:23

Adopting OGC Standards in a Flood Alert System

181

24:57

ScribeUI: MapServer Mapfile management made easy

182

21:01

Creating Map Style & Visibility Rules from Statistics

183

56:25

OSGeoLive: An Overview of the best Geospatial Open Source Software

184

28:56

Implementing basic GeoCouch support in Couchbase Lite

185

29:38

Mending Spatial Data with PostGIS

186

36:45

Introduction to the geospatial goodies in Elasticsearch

187

1:11:54

Open Source Geospatial Foundation - Annual General Meeting

188

39:49

UrbanFootprint: Next-Gen Scenario Planning Tool

Automatisches Abspielen

Sprache

Text

Bild

00:00

Projektive EbeneAbfrageOffene MengeDienst <Informatik>VersionsverwaltungProdukt <Mathematik>Ordnung <Mathematik>Selbst organisierendes Systemsinc-FunktionHyperbelverfahrenProgrammbibliothekKonfigurationsraumPerspektiveZusammenhängender GraphWort <Informatik>DiagrammStellenringNebenbedingungOpen SourceSoftwareQuellcodeMAPBasis <Mathematik>OrtszeitAggregatzustandSystemaufrufMaßerweiterungSuchmaschineSchnittstelleImplementierungEinfach zusammenhängender RaumIntegralMetadatenOnline-KatalogCoxeter-GruppeAuswahlaxiomPlug inDifferenteSystemplattformDispersion <Welle>DatenbankPunktKartesische KoordinatenFächer <Mathematik>SichtenkonzeptSpeicherabzugGruppenoperationIterationVisualisierungPrototypingWärmeleitfähigkeitGebäude <Mathematik>URLCodeZweiBenutzerbeteiligungElementargeometrieService providerDokumentenserverFontServerVorlesung/Konferenz

08:46

VisualisierungVerschlingungMetadatenProjektive EbeneAutomatische IndexierungElektronische PublikationDifferenteResultanteDatensatzSoftwareTermForcingProgrammierumgebungOrdnung <Mathematik>AuswahlaxiomDatenbankViewerBenutzeroberflächeQuaderSchnittstelleProzess <Informatik>MaßerweiterungRandomisierungSelbstrepräsentationDialektSinguläres IntegralSuchverfahrenSelbst organisierendes SystemMinkowski-MetrikDebuggingOnline-KatalogMultiplikationsoperatorCASE <Informatik>AdditionAbfrageTypentheorieHeuristikMultigraphQuellcodeMessage-PassingMAPSystemaufrufURLLeistung <Physik>Kategorie <Mathematik>Wort <Informatik>RuhmasseWeb-SeitePackprogrammDatenverwaltungMereologieMengeTabellenkalkulationNummernsystemOffene Menge

17:31

ImplementierungOnline-KatalogProzess <Informatik>Mailing-ListeStandardabweichungPhysikalisches SystemInternetworkingClientPlug inURLDatenbankTUNIS <Programm>QuellcodeCluster <Rechnernetz>Offene MengeBitCoxeter-GruppeIdentifizierbarkeitMetadatenPortal <Internet>MusterspracheMaßerweiterungServerProjektive EbeneSystemverwaltungMAPResultanteNummernsystemIdeal <Mathematik>Arithmetisches MittelSkriptspracheProdukt <Mathematik>NetzbetriebssystemExogene VariableOrdnung <Mathematik>MengeVirtuelle MaschineMereologieAbfrageVerschlingungBasis <Mathematik>VersionsverwaltungDebuggingSelbst organisierendes SystemMeta-TagSoftwareRadikal <Mathematik>QuaderWeb ServicesKonfigurationsraumSymboltabelleZweiTypentheoriePrädikatenlogik erster StufeFilter <Stochastik>MetasuchmaschineElektronische PublikationService providerSpeicherabzugCodierungValiditätEchtzeitsystemUmwandlungsenthalpieHardwareInterface <Schaltung>Vorlesung/Konferenz

26:17

Elektronische PublikationIdentifizierbarkeitTermArithmetische FolgeDifferenteDatensatzEinsMetadatenOrdnung <Mathematik>DatenbankPunktAutomatische IndexierungMultiplikationsoperatorQuellcodeBitIntegralSchwebungXML

28:26

MultiplikationsoperatorOrdinalzahlCoxeter-GruppeLebesgue-IntegralSpielkonsoleSoftwareentwicklerProjektive EbeneComputerspielDatenverwaltungVorlesung/Konferenz

Transkript: Englisch(automatisch erzeugt)

00:04

I'm here to present the data.gov CSW implementation, which is based on PyCSW, and it is using also CCAN integration.

00:20

So in the next minutes, I'm going to show you how we implemented this project during the last six months. So the presentation outline today is I'm going to give a short introduction about CCAN and

00:44

PyCSW, and then I'm going to talk about what are the components of the data.gov portal, what are the features, which feature we implemented during this project, how the configuration was done, and then I'm going to show you some demonstration of how to access the data, how

01:04

to search the catalog, and what is coming in the next months. So basically, data.gov is the home of the U.S. government open data. This is the second iteration, actually, of the project.

01:22

You are able to find their federal, state, and local data and the resources in order to conduct research and build applications and do visualizations or whatever you want with the data, since it's about open data.

01:41

The data.gov project right now is run by the General Service Administration, but it's now completely open source. So you can find the source code on GitHub, and you can make modifications to it. So the new portal, version two, is based on CCAN, which is an abbreviation

02:05

for Comprehensive Knowledge Archive Network. It is an open source web platform for publishing and sharing open data, and it has really impressive history of deployments so far.

02:21

These are just a few of the well-known deployments, which are the EU open data portal, the data.gov in the U.K., in Australia, and in other places. And there's also an extension of CCAN, which is actually enabling CCAN

02:42

to have geospatial capabilities. So it is based on PostGIS, actually, and it has integrated open layers. Actually, now it has also leaflet support. And it is also, it can access OGC services through OWSLib,

03:03

and it has now support for CSW through PyCSW. And lately, there has been a GeoJSON support. So PyCSW is an OGC server, CSW server implemented in Python. It's an open source project, and we are using the MIT license.

03:25

Currently, PyCSW is fully certified by OGC, and we are now reference implementation of CSW, and also under OS Geo incubation.

03:40

So this is an overview of the history of the project. Initially, there were discussions about integrating CCAN with PyCSW, basically because both projects are Python-based. And since PyCSW can be used as a library, there were initial discussions

04:00

in 2011 about doing such an integration. Then in 2012, actually, OKFN and data.gov team partnered up to build the version, this version of data.gov. So then OKFN implemented the first prototype during the first months of 2013.

04:26

And then GSA took over the project and the extensions that were developed in order to continue and bring the project to deployment and to production state. So at some point, CCAN had an internal implementation of CSW, which was used in the UK.

04:48

But there were some issues with it. So they finally made the choice to drop the internal plug-in and use PyCSW as the official CSW library.

05:04

Then when the PyCSW was actively involved in this implementation, we started working on new features that will cover the needs of data.gov, things like full text search and repository filtering

05:23

and connection pooling and stuff like that. I will talk about that later. And then we released in early 2014, we released the PyCSW 1.8, which was based upon the work we did for data.gov.

05:40

And now currently, we are actually, Saturday, we're going to release 1.10 for PyCSW during the code sprint here. And this is the version that is going to be updated on data.gov also. It brings new features like open search. I'll talk about that later in more detail.

06:02

So the goals of the project in our perspective were to be able to deploy PyCSW as a WSGI application directly into that URL you see over there, to be able to synchronize the metadata between CCAN and PyCSW, because both projects have different database schemas,

06:24

then to provide collection level support, because CCAN now for data.gov uses collection, uses a special extension to CCAN and uses data collections. Then we had the big issue of making this run very fast.

06:45

And we had to optimize for performance during this project. And we did some documentation and worked on the deployment configuration stuff. So this is a short diagram of how this project is being implemented.

07:05

You can see that the core software is CCAN, which is actually working on top of Postgres and PostGIS, actually, in order to be able to do special queries. But in order to perform fast queries, CCAN uses Solr.

07:26

But the latest versions of CCAN actually can do special queries through Solr, but still PostGIS is used because it has more features. Also, as you can see here, there

07:40

are many extensions in CCAN on the left. The geodata.gov extension was implemented directly for this project. And also, there are other plugins, like the Harvesters and the Spatial extension. I'll talk about later about how

08:01

we deploy using Ansible and RPMs and stuff like that. So PyCSW was used as a search engine next to Solr in order to provide the CSW interface. So CCAN is very successful because it's

08:20

targeting governments and companies and organizations, and they're looking at all kinds of open data, not only geospatial. So today, CCAN is not about only geospatial software. So the good thing is that it provides loosely coupled services, which can be turned on and off

08:44

according to needs. And this was a very successful project, and it has capabilities like publishing and finding data sets, storing data, creating networks of federated nodes, harvesting between different sources

09:03

of data and metadata, doing editing and management of metadata. And also, it has a very strong API, which can be used directly to give access to the data. So I have some screenshots of how the original CCAN looks

09:22

like, where you can actually do searches through the user interface. CCAN is a pylons project, so it is a Python-based project. And search and discovery is done through this user

09:41

interface, where you just type keywords, and then you get results. And based on the results, you can redefine your searches, or you can use keywords or topic categories, which are provided by the user interface. Also, you can actually look into the metadata that

10:02

are provided with the data. And it is very capable of using lots of many kinds of metadata. It can harvest from many sources. And this is a very powerful feature of CCAN. And also, there's a feature of visualizing geospatial data

10:24

sources directly on a map. And also, it can do the same, not only for maps, but also for grid data, and it can create graphs. So the spatial extension specifically

10:42

uses actually one spatial column to the Postgres scheme of CCAN. And it uses that to perform queries and display the results to the front end. For this project, for data.gov, we

11:03

tried to use only ISO XML metadata. So actually, when we harvested from various other catalogs, we were doing transformations to ISO in order to have a unified XML representation of all the metadata.

11:23

And this is the UI as it is implemented today. So there are some extra capabilities that are added to the CCAN core, like you see the special extension adds the map to the left, where you can actually create bounding box searches.

11:41

You can refine the search with keywords. And also, a nice feature is it can do relevant searches in terms of the bounding box. So it will give you first the results that are actually fitting better to the bounding box that you give. And this is a nice thing to see more data

12:04

that you are expecting to get. So this is also how it is performing this filtering that I told you about. So then, when you actually find the data set

12:20

that you are looking for, it provides all the resources and the metadata in order to be able to view and download the data. So you can send from this page, which is the resource page, you can directly go to a map and see a WMS, for example, if there is such a resource.

12:43

Or you can click on the original XML file and you can directly look at it. But there are also HTML viewers for that. So you have different choices in terms of how you see your data and metadata.

13:01

A big topic in this project was to be able to visualize many, many different kinds of resources. So there are things supported like WMS, WFS. You can actually download, save files,

13:21

or you can download even spreadsheets directly from the user interface. Also, as part of the CCAN, there is a data set preview extension where you can actually see the data, the spatial data within the viewer,

13:42

if that is possible from the original resource. And this is how you get to see the original XML file if you click on the URL. So this is the CSW interface, which was implemented directly into the CCAN deployment.

14:03

So here we see something more than 400,000 records within the CSW endpoint. So what are the features that we implemented? Basically, we needed to replicate the way

14:21

that data.gov does the collection. So data.gov has collections. Instead of showing to the end user 400,000 records, it uses collections as an intermediate layer. So you get to search something like 80,000 to 90,000

14:42

records, which are actually collections of metadata. So we needed to be able to do that. And this is why we implemented a filtering process, where you have your catalog and you have 400,000 records. And you can actually filter those with an SQL query,

15:06

depending on if this is a collection data set or not. And this was needed because we needed to have similar behavior of the CSW with the CCAN search.

15:21

Then we did some work on the database pooling for WSGI in order to be able to deploy within the environment of data.gov. And actually, the most important feature in this process was the Postgres full text search.

15:43

This was the feature that made actually the CSW endpoint really, really fast. So there's no problem searching those 400,000 records or even more, because Postgres has this nice feature that

16:00

can actually index all the data. Then we did some work on the link type detection. The big problem is that on data.gov, you get resources from all those organizations around the US, where you have data

16:21

like shapefiles within zip archives, when you don't know what it is inside. So this was a pretty tough problem to solve. And actually, we didn't solve it, but we kind of worked towards managing it. So another problem would be when you have a WMS, which doesn't actually

16:43

have the word WMS in the URL, what do you do then? You might get OWS. What is that? Is it WMS? Is it WFS? What is that? So there were problems like that, and the original metadata provided by many organizations didn't have the link type in it.

17:02

So we didn't know what kind of data it was. And we made some heuristics in order to be able to solve easy cases, but some other cases were not solved. And now we need to ask for the organizations to provide new metadata, which is something that nobody wants to do, actually.

17:22

And also, the last addition to our features is that we are the first to implement the OTC open search zero-time extensions, because it happened like two months before we released. And it was something that people wanted. So this is the extension that lets

17:44

you do open search, open search queries, and providing a bounding box and not only keywords. So this was implemented in PyCSW. And actually, this is going to be released on Saturday.

18:02

But it is already in the data.gov CSW deployment. Apart from what we did for this project, I'm just showing you some features that PyCSW is already doing, like harvesting WMS and WCS

18:21

and all those standards. We also support ISO. We support FGDC. We implemented the Inspire. We have been using the Inspire documentation to implement the service.

18:43

And we also support many databases. We can work off an SQLite or Postgres or whatever is possible to be done through SQLAlchemy, actually. More features.

19:01

We try to keep it simple. The configuration of PyCSW can be done in four minutes. It's actually very, very simple to do that. And we have a very extensible plug-in architecture, where if you have a database, if you have your own schema of metadata, you can actually create a plug-in so

19:21

that PyCSW can understand your schema and provide responses to queries according to your database. This requires, though, a bit of coding. It is already integrated with portals and other Python projects, like GeoNode and Open Data Catalog.

19:42

And we can do also real-time XML schema validation and stuff like that. And these are the standards that we are currently supporting. So I won't go through that list. It's a long list. So how did we actually configure and deploy it?

20:03

It was a long process. It was not something that was done actually in a few days. And also, we didn't have any access to the physical machines. We had to provide every step of the deployment on an email, actually, or in some kind of we didn't have access.

20:22

So we had to automate everything from every single thing that would be done through a terminal. We need to automate everything. So actually, there was already work done there. And the data.gov people were using Ansible for that.

20:41

So we just picked that up. And we provided Ansible scripts to do the deployment. The tricky part is that Ansible is not used directly into the servers. Ansible is used to create the RPM packages that are then deployed to the servers of the GSA.

21:04

So it was a two-level deployment scheme where we first create the RPM packages. And then those RPM packages are sent to the production servers. This way, every one month or so, yes, one month, I think,

21:22

the GSA is actually updating the portal with new patches and new features. So this is an easy way for the administrators to upgrade their system. Three type of clusters there. We have database cluster, frontend, which is the CCAN and PyCSW.

21:41

And also, we have the cluster for harvesters, because harvesters are actually doing the hard work. They're harvesting from every source of metadata that is available around the US. So it needed a specific set of hardware. And we are using CentOS as the operating system to do that.

22:04

Now, OK, the catalog is here. How do we use it? We have CCAN. We can do it through the UI. But why do we have CSW? How can we use the CSW? So part of the project was to document this process.

22:23

And this is why we created a set of documents so that the users can actually do and understand first and then do some simple CSW requests. So here, there are some links on the documentation.

22:41

Actually, this presentation is available on the internet. I will give you the URL on the last slide. But also, there are many other tools that somebody can use to access the data. One of them is QGIS, actually, where there is the MetaSearch plug-in, which is now a core QGIS

23:05

plug-in. So whenever you install QGIS for 2.4 or a later version, when that comes, you are able to use the MetaSearch plug-in to actually search the metadata from data.gov or any other CSW

23:20

server out there. And also, you can use any other CSW client that you can have. So I already told you about how data.gov organizes the data in collections. So these are the two endpoints that we provided

23:42

for the CSW implementation. So we have the first order of the collection level filtering. This is the URL where you can search the collections. And then when you search the collections, you can actually specify the ID of the collection in order to go to the second endpoint

24:01

and search within the collection. So there are two, actually, interfaces instead of one, just to be able to reproduce the workflow that you can do on CCAN UI. So since data.gov uses CCAN and PyCSW,

24:22

others became very interested in that. So suddenly, we hear from other organizations that want to reproduce this installation. I have met many people here. I'm very happy to be here and being part of this process.

24:41

And a few weeks ago, NOAA deployed the CSW extension there. And you see, there are more coming. So we have also a live deployment map where you can find who is doing that.

25:01

What did we learn from this process? The first thing that we learned is that you have to optimize your database. This is something that cannot be done without database tuning. And actually, databases need to be really fine-tuned

25:22

in order to be able to do this kind of heavy lifting of metadata and searches and everything. Also, it's very important to be able to have very, very well, you have to create packages. And you have to be very careful of the dependencies and how you deploy this software into your production

25:44

machines. So it all came down to using the full text search of Postgres in order to be able to be really, really fast. And this is also part of the upgrade of data.gov to the latest Postgres 9.3, which actually was very fast.

26:08

Another thing that we learned was that identifiers are very, very important. And what I mean by that, the original metadata files had,

26:32

so the original XML records from all the sources that we harvest have some identifiers in order

26:43

to be able to index the data. So CCAN does something a bit differently. When it harvests, it creates new identifiers for the records in order to be able to manage internally the metadata.

27:00

And that was very important, because we had to find ways to synchronize or to couple the identifiers, the original ones and the new ones. So this was a bit of trouble. But at the end, it was made possible. Also, some problems where the original metadata were not

27:26

the best we could ever find. I mean, there were issues there. And sometimes we had to actually ask for updates. And this is a work in progress. And some issues are already always there. But they are going to be fixed eventually.

27:43

So in the future, what we want to do is to do a deeper integration in terms of not synchronizing between two databases, but using only one database. And this is also a work in progress. The next big thing is going to be

28:01

CSW3, which is actually not yet released by OGC. But we are already working on that. And I think at some point in the next few months, it will be released. So we have some time here to show you a couple of things, like the actual UI and how you can perform searches directly

28:23

on CCAN. So this is the data.gov portal. Don't have enough time? OK. I can two minutes. OK. Then I'll just, well, actually, if we have two minutes, I can show it another time.

28:42

So I will skip that. So I want to thank the USGS for supporting this project, and James Bauer, who we worked together for many months now, the FGDC, and the GSA and data.gov

29:00

development team, and the integrators of the console and REI. And a big thanks goes to the Pisces-W development team, Tom Kurledis and Adam Hins. Unfortunately, the project manager, who was Doug Nebert, passed away a few months ago.

29:23

So I want to dedicate this presentation to him. He was a really inspiring person who made this possible here. And he actually, he was very passionate about CSW3. So he never got to see it.

29:43

But anyway, that's life. So this is for Doug. Thank you very much.

Empfehlungen