Supporting Open Data with Open Source
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 188 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/31643 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer | ||
Production Year | 2014 | |
Production Place | Portland, Oregon, United States of America |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
FOSS4G 2014 Portland126 / 188
7
10
14
15
16
23
24
25
28
29
33
37
39
40
43
45
46
48
50
56
64
65
69
72
74
82
89
91
98
102
107
111
114
118
128
131
132
135
138
141
143
147
149
150
157
158
161
164
165
166
173
174
175
179
185
00:00
Data managementOpen sourceOpen setGoodness of fitObject (grammar)Wage labourComputer animation
00:22
Open sourceOpen setProcess (computing)DataflowWebsiteProjective planeGame controllerTheoryPoint (geometry)Instance (computer science)Remote procedure callIncidence algebraWeb pageDiagramGeometryMereologyMetadataLibrary catalogSubject indexingWeb 2.0Query languageGroup actionPresentation of a groupAdditionSimilarity (geometry)User interfaceSupercomputerService-oriented architecturePhysical systemConnected spaceINTEGRALVariety (linguistics)Goodness of fitServer (computing)CollaborationismMashup <Internet>SoftwareContent (media)Level (video gaming)FingerprintOrder theoryRepresentational state transferTranslation (relic)BitTerm (mathematics)Set (mathematics)Data centerDirected graphPrototypeExecution unitConnectivity (graph theory)Shared memoryPower (physics)Halbordnung2 (number)Thermodynamisches SystemNonlinear systemClient (computing)Reading (process)Revision controlLink (knot theory)Latent heatEnterprise architectureInformation securityComputer fileOffice suiteCASE <Informatik>Extension (kinesiology)Software developerCore dumpInternet service providerSampling (statistics)PlanningForm (programming)TelecommunicationInformationSphereCategory of beingQuicksortDefault (computer science)Message passingContext awarenessMotion captureTouchscreenDynamical systemWeb applicationField (computer science)Rule of inferenceComputing platformVideo gameBuildingCoefficient of determinationFocus (optics)UsabilitySuite (music)Observational studyPoint cloudMultiplication signIntegrated development environmentMoment (mathematics)Right angleFlow separationResultantNumberFreewareArchitectureFormal languageProduct (business)Element (mathematics)Numbering schemeConsistencyComputer programmingWordOperator (mathematics)ImplementationSoftware testingUniform resource locatorPerformance appraisalText editoroutputIntegerPosition operatorData managementWeb serviceAuthenticationFunction (mathematics)Data storage deviceDescriptive statisticsWhiteboardElectronic GovernmentEmailSpacetimePurchasingTwitterInformation privacyStrategy gameComputer configurationContent management systemDecision theorySlide ruleFront and back endsSingle-precision floating-point formatChainStack (abstract data type)Military baseValue-added networkVirtual machineReal numberMappingMultitier architectureSinc functionWechselseitige InformationCodierung <Programmierung>Raster graphicsStandard deviationDifferent (Kate Ryan album)Repository (publishing)SynchronizationUser interfaceDirection (geometry)Lebesgue integrationTemporal logicComputer animation
Transcript: English(auto-generated)
00:00
Good afternoon, folks. Thank you all for coming late in the day. My name is Micah Wengrin. I'm with NOAA, the National Oceanic and Atmospheric Administration, U.S. federal agency. On behalf of my co-author, Jeff de la Baugeu-Dare, the NOAA data management architect, I'm
00:21
going to be discussing the topic of supporting open data with open source.
00:43
Moving right along. This talk is kind of divided into two different parts. The first segment, I'm just going to talk a little bit background on the topic of open data and what it means in the context of this presentation and also in the context of the U.S. federal government at this stage.
01:04
Back in the middle of 2012, there was a presidential memorandum released federal government-wide entitled Building a 21st Century Digital Government. The real message of that was trying to codify some specific ways in which the government
01:25
could increase usage of its services and also just improve the overall digital experience of the citizens of the U.S. That was intended as a broad umbrella document with more specific follow-on policies to come
01:45
later on. Most relevant here is what's called the Project Open Data or Open Data Policy, which followed last year in May 2013 in the form of an executive order titled Making Open and Machine Readable the New Default for Government Information.
02:04
So this was a specific policy that had some requirements placed on federal agencies and departments to release their data where appropriate in open interoperable formats
02:20
with open licenses as well. So the main message of this policy was really just to treat government data and investments in government data as an asset, so recognizing the intrinsic value of those investments and the intrinsic value of the data itself.
02:41
So the policy actually cited a few examples of historic releases of open data by the government, and those included both the GPS system, which I think is particularly relevant here. Everyone knows of the value of GPS in our current lives. So that initially was a private closed system developed by the Department of Defense.
03:03
It was released in the early 90s when it was completed for public use. The second example is actually weather data that's released by my agency, NOAA. And NOAA has traditionally been an open data agency in that regard. And in both cases, there are really large industries that have been built exclusively
03:24
off of that data. So crafty developers and entrepreneurs who have innovated and created value-added services on top of the data. So really the core of the Project Open Data executive order is to delineate a specific
03:47
metadata schema, which consists of both a vocabulary and a data format for describing data sets that an agency releases. So the format used in the policy is JSON, which we're probably all familiar with.
04:01
And the vocabulary is sourced from what had been previously common geospatial metadata or other metadata descriptive vocabulary words. I should also mention that the schema itself is released on GitHub in the spirit of open
04:21
source. So the creators of the policy really wanted to embrace the spirit of open source and to get input from both users of the actual data and the schema, as well as implementers like federal workers ourselves, such as myself.
04:42
So a little bit more detail on the actual files themselves. The executive order essentially mandated that each federal department lists its open data at a particular prescribed location on the web. So the public users could count on accessing these data.json files, sometimes very massive
05:07
files, just a word of warning. Don't try to parse them at home on your 486 or something. So there were basically the policy dictated that these be published to a particular URL.
05:22
So there was some consistency there. And I also wanted to, I don't know if that's really visible, but this is just a small example screen capture of part of one data set that NOAA produced to comply with the policy. And I also listed a few of the schema elements.
05:40
So if you're familiar with geospatial metadata, you can get the idea that there's some carryover between some of the common language used there. So in order to meet this mandate, NOAA, as I mentioned before, has traditionally been an open data agency. And we're comprised of several data centers who have been releasing data available for
06:07
free online for a number of years and who have, as a result, developed their own catalog systems, have their own inventories to kind of facilitate that data access. However, we needed a way to essentially merge that existing information into a single
06:26
output file, this data.json file, which would then be fed up the chain to the Department of Commerce. NOAA is actually an agency under DOC. And in order to do that, the decision was made to deploy a centralized data catalog
06:45
that would be able to harvest from these existing remote catalogs. That catalog is based on CCAN software, which is open source, and it was actually a collaboration between NOAA and the Department of the Interior through an existing interagency
07:02
working group called the Federal GeoCloud to kind of co-develop these systems, one to be deployed for our Department of Interior use and then NOAA to deploy our own. And the way this system works is first by harvesting the remote data inventories
07:22
and making use of a plug-in that's been developed for CCAN related to Project Open Data that can handle that translation from the native metadata format to data.json. So just a little workflow diagram, I guess, of what the catalog does. It takes in the existing data and does that translation.
07:42
It also adds, in addition, the benefit of a CSW endpoint for query and data access, as well as a native web API that CCAN provides. So that's kind of some context, I guess, for the rest of my talk.
08:01
What I want to focus on is a particular full open source stack that we're, I guess, experimenting with deploying in NOAA. It's not necessarily operationally used at the moment, but I just wanted to take some time to kind of illustrate how a few well-known open source projects that we're all familiar
08:24
with here can work together in compliance with Project Open Data. So the first of those is GeoServer. As we all know, that's a spatial data hosting platform for OGC services. The second is GeoNode, which is essentially a web-based geospatial content management system
08:42
that's built to sit on top of GeoServer and kind of provide a dynamic, modern user interface to allow users to access or to discover and access the underlying GeoServer services. And the last, of course, is CCAN, which I've spoken about before.
09:02
So NOAA's background with GeoServer, you know, historically over the years, GeoServer's certainly been used kind of piecemeal in different offices in the agency, along with other open source spatial data hosting systems. However, it hadn't really been used as an enterprise-wide solution until 2011, 2012,
09:27
when the NOAA High Performance Computing and Communications Project chose to fund a, basically a project to set up a prototype GeoServer that could be deployed agency-wide and used by individual office data providers who maybe didn't have the resources
09:45
to run GeoServer themselves, who could just rely on a shared solution to publish their data. So funding through that project was provided to OpenGeo to provide a few enhancements to GeoServer, the first of which was to kind of finalize some work that had been done on the security subsystem
10:06
to enable some enterprise integration capabilities like LDAP authentication. And the second was having first-class support for isolation. So essentially, you know, improved user management permission system
10:21
so that you could restrict users to only have access to their information and not across the board, which is obviously essential for an enterprise deployment. So as a result, the NOAA GeoServer hosting environment has been online for about two years for testing and evaluation at the URL listed here.
10:43
This has been a prototype and wasn't really planned for operational transition. However, just wanted to highlight that this past year, the Weather Service, as part of their integrated dissemination program, chose GeoServer alongside Esri ArcGIS Server for production geospatial hosting service.
11:05
So there will be some production web services running off of GeoServer and NOAA in the near future, which I think is pretty cool. So I'm going to kind of step through this stack, this open source, open data stack that we've been testing out.
11:21
So the first layer, I guess, obviously is GeoServer. This is a bit of a simplification. GeoServer provides many additional service types. I just wanted to highlight WMS and WFS, which is what we've primarily used in our incubator prototype system.
11:41
And of course, PostGIS should also be mentioned because that's the PostGIS and PostgreSQL. That's the underlying data storage backbone for our GeoServer instance. And it's also used in each other component of the stack as well. So I guess the second tier in the system is GeoNode.
12:02
And for those who aren't familiar, GeoNode is a web-based content, geospatial content management system. It's really pretty tightly coupled with GeoServer. So you essentially pair a GeoNode instance with a GeoServer instance. And it gives that kind of modern web user interface.
12:22
It's really good for data discovery. It has fine-grained permission controls and other things. So NOAA's history with GeoNode goes back a few years as well. It was actually included as part of the Federal Geocloud Interagency Working Group in 2012.
12:42
So a NOAA group participated and had a proposal accepted to participate in that, which basically had set up a shared infrastructure for transition of agency-hosted geospatial services to the cloud, to Amazon Web Services.
13:01
So we collaborated with them, tuned the system a little bit so GeoNode would run on it, and have kind of been tinkering with it ever since, I guess. However, even though our NOAA node system isn't publicly deployed yet, partway through the project the Department of Energy came along and decided that they were actually interested in using GeoNode.
13:24
So they were able to essentially use our infrastructure as a starting point and deploy their own GeoNode-based system called Nipponode at the URL below. And that's related to the National Environmental Policy Act. So I'm going to kind of step through some quick highlight of GeoNode features for those who don't know.
13:44
So this is a screen capture. It kind of brings to life individual data layers within GeoServer service. So a user can go to the GeoNode site, search by some common fields such as title, abstract, can filter by ISO topic category, keywords, and of course if there's temporal information,
14:06
they can filter by that as well. GeoNode also includes an integrated CSW service. This is critical for this overall stack design, as you'll see later. By default that's based off of PiCSW currently, but it can also be kind of plug and play with other systems.
14:26
So if you want to use something like GeoNetwork, that's available as well. So that provides a good connection point with desktop GIS. So for a QGIS user who is using the spatial search extension
14:40
or any other extension that can talk to a CSW service, Esri ArcMap as well, it's a great data discovery tool. So data access, GeoNode, as I mentioned, is pretty tightly coupled with GeoServer, so it understands the output formats that GeoServer provides. So once a user has logged in and found the data that they're looking for,
15:04
it provides a convenient endpoint list. It's very easy to download information directly. Additionally, there's kind of two different ways where you can upload data to GeoNode. So it can be configured so that a user can log in through the web interface.
15:25
If they have a spatial data set they want to share, they can interactively basically push it to GeoNode, fill out some relevant metadata, and GeoNode will push it back to the GeoServer level automatically.
15:41
There's also the opposite approach, which is actually taking data from an existing GeoServer and sucking it into GeoNode. So either way, once your GeoNode instance is populated with data layers, you get some neat capabilities. There's an integrated metadata editor, which I have here,
16:04
so if there's some information lacking from the native metadata, you have the option to fill it out via the user interface. There's also pretty fine-grain access control, so you can share data with particular users if you want, groups of users, or you can just publish publicly as well.
16:25
Very recently within GeoNode, there's been some work done on some pretty cool new features, first of which is remote services. So, you know, GeoNode is really meant to run off of GeoServer. However, it does have the capability, sort of fledgling capability,
16:42
to connect to a remote ArcGIS server endpoint and be able to parse layers from the REST API for ArcGIS server, as well as remote WMS and WFS servers and some others. Secondly among those is GeoGit. So for those who don't know, GeoGit is very similar to Git.
17:04
It's basically a versioned editing for geospatial data. And with some recent work done through some GeoNode partners, GeoNode provides kind of a read access. So if you configure GeoServer with a GeoGit repository,
17:21
that edit history for your geospatial data can be read by GeoNode, displayed within the user interface. And there's also an external mapping client called MapLoom that will actually handle the editing side of it as well. So if you have a spatial data set, you can configure a GeoNode instance
17:41
to work with MapLoom and provide disconnected editing and also two-way sync with a remote GeoGit repository. So it's a pretty powerful data editing workflow. What did you call the editor? That's MapLoom. MapLoom? Yeah, there's actually a presentation tomorrow or Friday. I forget which, so check it out.
18:03
So the last feature that I wanted to mention in GeoNode is Maps. So of course, once you populate it with this variety of layers, you can create this integrated map mashup, have the same permissions to share it with users who you choose.
18:23
So getting back to the architecture diagram, GeoNode, NoahNode is our Noah-themed instance. It sits mostly on top of GeoServer. It talks to GeoServer via the REST API and adds that CSW endpoint for data discovery
18:41
as well as the interactive catalog. So moving along, CCAN. So via that CSW endpoint in GeoNode, that actually allows CCAN to take that as a remote harvesting point.
19:01
So as I mentioned before, in our NoahData catalog, we are harvesting several remote catalogs currently. And via the GeoNode CSW, that integration can happen there as well. So any layers that you include in your GeoNode instance can be automatically harvested by CCAN.
19:22
And there's maybe some similarities between the two products. CCAN kind of takes a little bit more of a data catalog approach to presentation. It does a good job of parsing out fields from spatial metadata and presenting it in an approachable, user-friendly way.
19:41
It's good at parsing out online resource linkages so users can have direct access there to the endpoints you want them to use to access your data. It's also pretty efficient in terms of search. So it has a back-end Apache Solar instance that can be configured to handle spatial search as well.
20:04
So it's pretty powerful. And it kind of really sits alongside GeoNode in this system. And of course, it can handle the data.json translation. So if that's of interest to anyone out there, especially federal users.
20:23
So the other thing that I wanted to mention is CCAN has some interactive mapping capabilities as well. So if your GeoNode instance provides, or really any spatial metadata provides, a WMS get capabilities endpoint, CCAN has a native map preview tool.
20:43
So good interactive capability there. GeoNode actually has to be a little bit modified. I had to tweak it myself to provide this capability. But that's something that hopefully will be merged back in with the core at some point.
21:02
So our diagram here, data.noaa.gov is our logo for our CCAN instance. So you can see I put them side by side. Really, it just kind of complements your existing GeoNode site, provides that remote harvest capability,
21:20
as well as integration with any external catalogs that you may want to use. So lastly, data.gov, for those who aren't familiar, this is sort of the behemoth of US federal open data. This is the federal government's open data catalog.
21:43
It's also CCAN based. And it works very similarly to the NOAA data catalog. It does remote harvest of all of existing, kind of a whole variety of federal geospatial metadata sources. I think the plan is to have it, at some point,
22:02
exclusively harvest the data.json files. I don't think that's quite implemented yet. But nonetheless, via one means or another, it's the merged collection of all federal open data, according to the open data policy. So again, it's kind of just a bit alongside the core of the stack
22:25
that I wanted to highlight. But nonetheless, in the federal space, data.gov is certainly important and based off of some of the same software. So just a few take-home points that I wanted to make. Hopefully, I've kind of shown how these open source technologies
22:45
can be used together to create a full open data stack for geospatial data that complies with the federal open data policy, if that's of interest to you. NOAA, as an agency, is trying to continue its role and leadership in the open data world.
23:05
We're keeping up, of course, with the latest policies as much as possible. Lastly, I think really, getting back to the original slide, which I mentioned the digital government strategy, one of the main goals of that was to develop a shared platform
23:23
for federal IT infrastructure. And I think the work that's been done on CCAN related to the open data policy really kind of illustrates a good example of leveraging open source software. So I think if you really read the digital government strategy that way,
23:42
it's really kind of encouraging not only the use of open source software, but also contributions. So as a community of IT users in the federal government, why shouldn't we work together to kind of develop a common product on our own and collaborate as opposed to sit around and wait for someone else to do it
24:03
or to go out and buy the same thing many, many times? It doesn't really make sense. So lastly, I just wanted to mention a lot of this work that I've been involved with over the last few years. It wouldn't have been possible without the support of a guy by the name of Doug Kneibert.
24:21
He passed away this year, tragically, but really, a lot of this and a lot of other advancements in the federal geospatial space wouldn't have been possible without Doug's leadership. So I just wanted to give credit where credit is due. I owe a lot of debt of gratitude to him.
24:40
If anyone has any questions, I'd be happy to try to answer them, or you can reach out to either myself or Jeff via email or Twitter. First, thank you. No questions at all?
25:03
Apparently, you've seen – oh, yeah, there's one. Has anybody created AMIs for GeoNode and CCAN that are publicly available? There must be.
25:20
I don't know for sure. The work done with the GeoCloud, I think there's no reason that it couldn't be just baked into an AMI. I don't know for sure, but I'm guessing they have to be out there.
25:47
This is probably going to expose my ignorance, but how does the GeoJSON deal with raster datasets? Oh, the data.json? Yeah. Yeah, so it's really leveraging JSON as kind of a metadata format.
26:05
In terms of actual encoding of spatial data, it doesn't really do that. I probably should have made that graphic a little bigger, but it basically provides the associated metadata to the dataset,
26:21
along with, say, like an access URL. So whether it's just an open dataset published on the web, or if it's a web API, it'll contain that link. But in terms of actual data encoding, it doesn't really do that. Anybody else? I'm not familiar with GeoNode and CCAN as much as I'd like,
26:46
but when would the users use one versus the other, since you're standing about both? Yeah, that's a good question. I think really CCAN is kind of a good entry point to the actual data,
27:06
or to a dataset that exists in GeoNode. So if you do that connection, the CSW connection between the two, one thing that CCAN does well is it's indexed well by Google.
27:22
So for instance, if someone does a Google search, they can find the page on a CCAN site and then be directed to GeoNode to kind of have the more interactive mapping capabilities. So I think it would kind of flow that way, most likely. Maybe I could add a note to this.
27:42
There are two different worlds, the so-called open data world and so-called open geo data world. They are both a little bit different standards. They kind of don't communicate to each other that well yet, and we have to talk to the... We, as geo guys, should talk more to open data guys
28:02
to connect those worlds more in a standard way. Okay. I assume everybody is looking forward for the next session, which is called Drinks in the Hall, I think. So thank you for coming.