We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Supporting Open Data with Open Source

00:00

Formal Metadata

Title
Supporting Open Data with Open Source
Title of Series
Number of Parts
188
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer
Production Year2014
Production PlacePortland, Oregon, United States of America

Content Metadata

Subject Area
Genre
Abstract
Within the US Federal Government, there is a trend towards embracing the benefits of open data to increase transparency and maximize potential innovation and resulting economic benefit from taxpayer investment. Recently, an Executive Order was signed specifically requiring federal agencies to provide a public inventory of their non-restricted data and to use standard web-friendly formats and services for public data access. For geospatial data, popular free and open source software packages are ideal options to implement an open data infrastructure. NOAA, an agency whose mission has long embraced and indeed centered on open data, has recently deployed or tested several FOSS products to meet the open data executive order. Among these are GeoServer, GeoNode, and CKAN, or Comprehensive Knowledge Archive Network, a data management and publishing system.This talk will focus on how these three FOSS products can be deployed together to provide an open data architecture exclusively built on open source. Data sets hosted in GeoServer can be cataloged and visualized in GeoNode, and fed to CKAN for search and discovery as well as translation to open data policy-compliant JSON format. Upcoming enhancements to GeoNode, the middle tier of the stack, will allow integration with data hosting backends other than GeoServer, such as Esri's ArcGIS REST services or external WMS services. We'll highlight NOAA's existing implementation of the above, including the recently-deployed public data catalog, https://data.noaa.gov/, and GeoServer data hosting platform, as well as potential build out of the full stack including the GeoNode integration layer.
Keywords
25
74
Thumbnail
29:15
Data managementOpen sourceOpen setGoodness of fitObject (grammar)Wage labourComputer animation
Open sourceOpen setProcess (computing)DataflowWebsiteProjective planeGame controllerTheoryPoint (geometry)Instance (computer science)Remote procedure callIncidence algebraWeb pageDiagramGeometryMereologyMetadataLibrary catalogSubject indexingWeb 2.0Query languageGroup actionPresentation of a groupAdditionSimilarity (geometry)User interfaceSupercomputerService-oriented architecturePhysical systemConnected spaceINTEGRALVariety (linguistics)Goodness of fitServer (computing)CollaborationismMashup <Internet>SoftwareContent (media)Level (video gaming)FingerprintOrder theoryRepresentational state transferTranslation (relic)BitTerm (mathematics)Set (mathematics)Data centerDirected graphPrototypeExecution unitConnectivity (graph theory)Shared memoryPower (physics)Halbordnung2 (number)Thermodynamisches SystemNonlinear systemClient (computing)Reading (process)Revision controlLink (knot theory)Latent heatEnterprise architectureInformation securityComputer fileOffice suiteCASE <Informatik>Extension (kinesiology)Software developerCore dumpInternet service providerSampling (statistics)PlanningForm (programming)TelecommunicationInformationSphereCategory of beingQuicksortDefault (computer science)Message passingContext awarenessMotion captureTouchscreenDynamical systemWeb applicationField (computer science)Rule of inferenceComputing platformVideo gameBuildingCoefficient of determinationFocus (optics)UsabilitySuite (music)Observational studyPoint cloudMultiplication signIntegrated development environmentMoment (mathematics)Right angleFlow separationResultantNumberFreewareArchitectureFormal languageProduct (business)Element (mathematics)Numbering schemeConsistencyComputer programmingWordOperator (mathematics)ImplementationSoftware testingUniform resource locatorPerformance appraisalText editoroutputIntegerPosition operatorData managementWeb serviceAuthenticationFunction (mathematics)Data storage deviceDescriptive statisticsWhiteboardElectronic GovernmentEmailSpacetimePurchasingTwitterInformation privacyStrategy gameComputer configurationContent management systemDecision theorySlide ruleFront and back endsSingle-precision floating-point formatChainStack (abstract data type)Military baseValue-added networkVirtual machineReal numberMappingMultitier architectureSinc functionWechselseitige InformationCodierung <Programmierung>Raster graphicsStandard deviationDifferent (Kate Ryan album)Repository (publishing)SynchronizationUser interfaceDirection (geometry)Lebesgue integrationTemporal logicComputer animation
Transcript: English(auto-generated)
Good afternoon, folks. Thank you all for coming late in the day. My name is Micah Wengrin. I'm with NOAA, the National Oceanic and Atmospheric Administration, U.S. federal agency. On behalf of my co-author, Jeff de la Baugeu-Dare, the NOAA data management architect, I'm
going to be discussing the topic of supporting open data with open source.
Moving right along. This talk is kind of divided into two different parts. The first segment, I'm just going to talk a little bit background on the topic of open data and what it means in the context of this presentation and also in the context of the U.S. federal government at this stage.
Back in the middle of 2012, there was a presidential memorandum released federal government-wide entitled Building a 21st Century Digital Government. The real message of that was trying to codify some specific ways in which the government
could increase usage of its services and also just improve the overall digital experience of the citizens of the U.S. That was intended as a broad umbrella document with more specific follow-on policies to come
later on. Most relevant here is what's called the Project Open Data or Open Data Policy, which followed last year in May 2013 in the form of an executive order titled Making Open and Machine Readable the New Default for Government Information.
So this was a specific policy that had some requirements placed on federal agencies and departments to release their data where appropriate in open interoperable formats
with open licenses as well. So the main message of this policy was really just to treat government data and investments in government data as an asset, so recognizing the intrinsic value of those investments and the intrinsic value of the data itself.
So the policy actually cited a few examples of historic releases of open data by the government, and those included both the GPS system, which I think is particularly relevant here. Everyone knows of the value of GPS in our current lives. So that initially was a private closed system developed by the Department of Defense.
It was released in the early 90s when it was completed for public use. The second example is actually weather data that's released by my agency, NOAA. And NOAA has traditionally been an open data agency in that regard. And in both cases, there are really large industries that have been built exclusively
off of that data. So crafty developers and entrepreneurs who have innovated and created value-added services on top of the data. So really the core of the Project Open Data executive order is to delineate a specific
metadata schema, which consists of both a vocabulary and a data format for describing data sets that an agency releases. So the format used in the policy is JSON, which we're probably all familiar with.
And the vocabulary is sourced from what had been previously common geospatial metadata or other metadata descriptive vocabulary words. I should also mention that the schema itself is released on GitHub in the spirit of open
source. So the creators of the policy really wanted to embrace the spirit of open source and to get input from both users of the actual data and the schema, as well as implementers like federal workers ourselves, such as myself.
So a little bit more detail on the actual files themselves. The executive order essentially mandated that each federal department lists its open data at a particular prescribed location on the web. So the public users could count on accessing these data.json files, sometimes very massive
files, just a word of warning. Don't try to parse them at home on your 486 or something. So there were basically the policy dictated that these be published to a particular URL.
So there was some consistency there. And I also wanted to, I don't know if that's really visible, but this is just a small example screen capture of part of one data set that NOAA produced to comply with the policy. And I also listed a few of the schema elements.
So if you're familiar with geospatial metadata, you can get the idea that there's some carryover between some of the common language used there. So in order to meet this mandate, NOAA, as I mentioned before, has traditionally been an open data agency. And we're comprised of several data centers who have been releasing data available for
free online for a number of years and who have, as a result, developed their own catalog systems, have their own inventories to kind of facilitate that data access. However, we needed a way to essentially merge that existing information into a single
output file, this data.json file, which would then be fed up the chain to the Department of Commerce. NOAA is actually an agency under DOC. And in order to do that, the decision was made to deploy a centralized data catalog
that would be able to harvest from these existing remote catalogs. That catalog is based on CCAN software, which is open source, and it was actually a collaboration between NOAA and the Department of the Interior through an existing interagency
working group called the Federal GeoCloud to kind of co-develop these systems, one to be deployed for our Department of Interior use and then NOAA to deploy our own. And the way this system works is first by harvesting the remote data inventories
and making use of a plug-in that's been developed for CCAN related to Project Open Data that can handle that translation from the native metadata format to data.json. So just a little workflow diagram, I guess, of what the catalog does. It takes in the existing data and does that translation.
It also adds, in addition, the benefit of a CSW endpoint for query and data access, as well as a native web API that CCAN provides. So that's kind of some context, I guess, for the rest of my talk.
What I want to focus on is a particular full open source stack that we're, I guess, experimenting with deploying in NOAA. It's not necessarily operationally used at the moment, but I just wanted to take some time to kind of illustrate how a few well-known open source projects that we're all familiar
with here can work together in compliance with Project Open Data. So the first of those is GeoServer. As we all know, that's a spatial data hosting platform for OGC services. The second is GeoNode, which is essentially a web-based geospatial content management system
that's built to sit on top of GeoServer and kind of provide a dynamic, modern user interface to allow users to access or to discover and access the underlying GeoServer services. And the last, of course, is CCAN, which I've spoken about before.
So NOAA's background with GeoServer, you know, historically over the years, GeoServer's certainly been used kind of piecemeal in different offices in the agency, along with other open source spatial data hosting systems. However, it hadn't really been used as an enterprise-wide solution until 2011, 2012,
when the NOAA High Performance Computing and Communications Project chose to fund a, basically a project to set up a prototype GeoServer that could be deployed agency-wide and used by individual office data providers who maybe didn't have the resources
to run GeoServer themselves, who could just rely on a shared solution to publish their data. So funding through that project was provided to OpenGeo to provide a few enhancements to GeoServer, the first of which was to kind of finalize some work that had been done on the security subsystem
to enable some enterprise integration capabilities like LDAP authentication. And the second was having first-class support for isolation. So essentially, you know, improved user management permission system
so that you could restrict users to only have access to their information and not across the board, which is obviously essential for an enterprise deployment. So as a result, the NOAA GeoServer hosting environment has been online for about two years for testing and evaluation at the URL listed here.
This has been a prototype and wasn't really planned for operational transition. However, just wanted to highlight that this past year, the Weather Service, as part of their integrated dissemination program, chose GeoServer alongside Esri ArcGIS Server for production geospatial hosting service.
So there will be some production web services running off of GeoServer and NOAA in the near future, which I think is pretty cool. So I'm going to kind of step through this stack, this open source, open data stack that we've been testing out.
So the first layer, I guess, obviously is GeoServer. This is a bit of a simplification. GeoServer provides many additional service types. I just wanted to highlight WMS and WFS, which is what we've primarily used in our incubator prototype system.
And of course, PostGIS should also be mentioned because that's the PostGIS and PostgreSQL. That's the underlying data storage backbone for our GeoServer instance. And it's also used in each other component of the stack as well. So I guess the second tier in the system is GeoNode.
And for those who aren't familiar, GeoNode is a web-based content, geospatial content management system. It's really pretty tightly coupled with GeoServer. So you essentially pair a GeoNode instance with a GeoServer instance. And it gives that kind of modern web user interface.
It's really good for data discovery. It has fine-grained permission controls and other things. So NOAA's history with GeoNode goes back a few years as well. It was actually included as part of the Federal Geocloud Interagency Working Group in 2012.
So a NOAA group participated and had a proposal accepted to participate in that, which basically had set up a shared infrastructure for transition of agency-hosted geospatial services to the cloud, to Amazon Web Services.
So we collaborated with them, tuned the system a little bit so GeoNode would run on it, and have kind of been tinkering with it ever since, I guess. However, even though our NOAA node system isn't publicly deployed yet, partway through the project the Department of Energy came along and decided that they were actually interested in using GeoNode.
So they were able to essentially use our infrastructure as a starting point and deploy their own GeoNode-based system called Nipponode at the URL below. And that's related to the National Environmental Policy Act. So I'm going to kind of step through some quick highlight of GeoNode features for those who don't know.
So this is a screen capture. It kind of brings to life individual data layers within GeoServer service. So a user can go to the GeoNode site, search by some common fields such as title, abstract, can filter by ISO topic category, keywords, and of course if there's temporal information,
they can filter by that as well. GeoNode also includes an integrated CSW service. This is critical for this overall stack design, as you'll see later. By default that's based off of PiCSW currently, but it can also be kind of plug and play with other systems.
So if you want to use something like GeoNetwork, that's available as well. So that provides a good connection point with desktop GIS. So for a QGIS user who is using the spatial search extension
or any other extension that can talk to a CSW service, Esri ArcMap as well, it's a great data discovery tool. So data access, GeoNode, as I mentioned, is pretty tightly coupled with GeoServer, so it understands the output formats that GeoServer provides. So once a user has logged in and found the data that they're looking for,
it provides a convenient endpoint list. It's very easy to download information directly. Additionally, there's kind of two different ways where you can upload data to GeoNode. So it can be configured so that a user can log in through the web interface.
If they have a spatial data set they want to share, they can interactively basically push it to GeoNode, fill out some relevant metadata, and GeoNode will push it back to the GeoServer level automatically.
There's also the opposite approach, which is actually taking data from an existing GeoServer and sucking it into GeoNode. So either way, once your GeoNode instance is populated with data layers, you get some neat capabilities. There's an integrated metadata editor, which I have here,
so if there's some information lacking from the native metadata, you have the option to fill it out via the user interface. There's also pretty fine-grain access control, so you can share data with particular users if you want, groups of users, or you can just publish publicly as well.
Very recently within GeoNode, there's been some work done on some pretty cool new features, first of which is remote services. So, you know, GeoNode is really meant to run off of GeoServer. However, it does have the capability, sort of fledgling capability,
to connect to a remote ArcGIS server endpoint and be able to parse layers from the REST API for ArcGIS server, as well as remote WMS and WFS servers and some others. Secondly among those is GeoGit. So for those who don't know, GeoGit is very similar to Git.
It's basically a versioned editing for geospatial data. And with some recent work done through some GeoNode partners, GeoNode provides kind of a read access. So if you configure GeoServer with a GeoGit repository,
that edit history for your geospatial data can be read by GeoNode, displayed within the user interface. And there's also an external mapping client called MapLoom that will actually handle the editing side of it as well. So if you have a spatial data set, you can configure a GeoNode instance
to work with MapLoom and provide disconnected editing and also two-way sync with a remote GeoGit repository. So it's a pretty powerful data editing workflow. What did you call the editor? That's MapLoom. MapLoom? Yeah, there's actually a presentation tomorrow or Friday. I forget which, so check it out.
So the last feature that I wanted to mention in GeoNode is Maps. So of course, once you populate it with this variety of layers, you can create this integrated map mashup, have the same permissions to share it with users who you choose.
So getting back to the architecture diagram, GeoNode, NoahNode is our Noah-themed instance. It sits mostly on top of GeoServer. It talks to GeoServer via the REST API and adds that CSW endpoint for data discovery
as well as the interactive catalog. So moving along, CCAN. So via that CSW endpoint in GeoNode, that actually allows CCAN to take that as a remote harvesting point.
So as I mentioned before, in our NoahData catalog, we are harvesting several remote catalogs currently. And via the GeoNode CSW, that integration can happen there as well. So any layers that you include in your GeoNode instance can be automatically harvested by CCAN.
And there's maybe some similarities between the two products. CCAN kind of takes a little bit more of a data catalog approach to presentation. It does a good job of parsing out fields from spatial metadata and presenting it in an approachable, user-friendly way.
It's good at parsing out online resource linkages so users can have direct access there to the endpoints you want them to use to access your data. It's also pretty efficient in terms of search. So it has a back-end Apache Solar instance that can be configured to handle spatial search as well.
So it's pretty powerful. And it kind of really sits alongside GeoNode in this system. And of course, it can handle the data.json translation. So if that's of interest to anyone out there, especially federal users.
So the other thing that I wanted to mention is CCAN has some interactive mapping capabilities as well. So if your GeoNode instance provides, or really any spatial metadata provides, a WMS get capabilities endpoint, CCAN has a native map preview tool.
So good interactive capability there. GeoNode actually has to be a little bit modified. I had to tweak it myself to provide this capability. But that's something that hopefully will be merged back in with the core at some point.
So our diagram here, data.noaa.gov is our logo for our CCAN instance. So you can see I put them side by side. Really, it just kind of complements your existing GeoNode site, provides that remote harvest capability,
as well as integration with any external catalogs that you may want to use. So lastly, data.gov, for those who aren't familiar, this is sort of the behemoth of US federal open data. This is the federal government's open data catalog.
It's also CCAN based. And it works very similarly to the NOAA data catalog. It does remote harvest of all of existing, kind of a whole variety of federal geospatial metadata sources. I think the plan is to have it, at some point,
exclusively harvest the data.json files. I don't think that's quite implemented yet. But nonetheless, via one means or another, it's the merged collection of all federal open data, according to the open data policy. So again, it's kind of just a bit alongside the core of the stack
that I wanted to highlight. But nonetheless, in the federal space, data.gov is certainly important and based off of some of the same software. So just a few take-home points that I wanted to make. Hopefully, I've kind of shown how these open source technologies
can be used together to create a full open data stack for geospatial data that complies with the federal open data policy, if that's of interest to you. NOAA, as an agency, is trying to continue its role and leadership in the open data world.
We're keeping up, of course, with the latest policies as much as possible. Lastly, I think really, getting back to the original slide, which I mentioned the digital government strategy, one of the main goals of that was to develop a shared platform
for federal IT infrastructure. And I think the work that's been done on CCAN related to the open data policy really kind of illustrates a good example of leveraging open source software. So I think if you really read the digital government strategy that way,
it's really kind of encouraging not only the use of open source software, but also contributions. So as a community of IT users in the federal government, why shouldn't we work together to kind of develop a common product on our own and collaborate as opposed to sit around and wait for someone else to do it
or to go out and buy the same thing many, many times? It doesn't really make sense. So lastly, I just wanted to mention a lot of this work that I've been involved with over the last few years. It wouldn't have been possible without the support of a guy by the name of Doug Kneibert.
He passed away this year, tragically, but really, a lot of this and a lot of other advancements in the federal geospatial space wouldn't have been possible without Doug's leadership. So I just wanted to give credit where credit is due. I owe a lot of debt of gratitude to him.
If anyone has any questions, I'd be happy to try to answer them, or you can reach out to either myself or Jeff via email or Twitter. First, thank you. No questions at all?
Apparently, you've seen – oh, yeah, there's one. Has anybody created AMIs for GeoNode and CCAN that are publicly available? There must be.
I don't know for sure. The work done with the GeoCloud, I think there's no reason that it couldn't be just baked into an AMI. I don't know for sure, but I'm guessing they have to be out there.
This is probably going to expose my ignorance, but how does the GeoJSON deal with raster datasets? Oh, the data.json? Yeah. Yeah, so it's really leveraging JSON as kind of a metadata format.
In terms of actual encoding of spatial data, it doesn't really do that. I probably should have made that graphic a little bigger, but it basically provides the associated metadata to the dataset,
along with, say, like an access URL. So whether it's just an open dataset published on the web, or if it's a web API, it'll contain that link. But in terms of actual data encoding, it doesn't really do that. Anybody else? I'm not familiar with GeoNode and CCAN as much as I'd like,
but when would the users use one versus the other, since you're standing about both? Yeah, that's a good question. I think really CCAN is kind of a good entry point to the actual data,
or to a dataset that exists in GeoNode. So if you do that connection, the CSW connection between the two, one thing that CCAN does well is it's indexed well by Google.
So for instance, if someone does a Google search, they can find the page on a CCAN site and then be directed to GeoNode to kind of have the more interactive mapping capabilities. So I think it would kind of flow that way, most likely. Maybe I could add a note to this.
There are two different worlds, the so-called open data world and so-called open geo data world. They are both a little bit different standards. They kind of don't communicate to each other that well yet, and we have to talk to the... We, as geo guys, should talk more to open data guys
to connect those worlds more in a standard way. Okay. I assume everybody is looking forward for the next session, which is called Drinks in the Hall, I think. So thank you for coming.