Metadata Management for Spatial Data Infrastructures

Video thumbnail (Frame 0) Video thumbnail (Frame 1059) Video thumbnail (Frame 3590) Video thumbnail (Frame 5547) Video thumbnail (Frame 7307) Video thumbnail (Frame 8678) Video thumbnail (Frame 10364) Video thumbnail (Frame 15432) Video thumbnail (Frame 16826) Video thumbnail (Frame 18157) Video thumbnail (Frame 19797) Video thumbnail (Frame 22382) Video thumbnail (Frame 24178) Video thumbnail (Frame 25178) Video thumbnail (Frame 25928) Video thumbnail (Frame 27846) Video thumbnail (Frame 28650) Video thumbnail (Frame 32929)
Video in TIB AV-Portal: Metadata Management for Spatial Data Infrastructures

Formal Metadata

Metadata Management for Spatial Data Infrastructures
Title of Series
CC Attribution - NonCommercial - ShareAlike 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date
Production Year
Production Place
Seoul, South Korea

Content Metadata

Subject Area
This presentation will focus on creating geospatial metadata for spatial data infrastructures. The growing emphasis on data management practices in recent years has underscored the need for well-structured metadata to support the preservation and reuse of digital geographic information. Despite its value, creation of geospatial metadata is widely recognized as a complex and labor-intensive process, often creating a barrier to effective identification and evaluation of digital datasets. We will discuss our set of best practices for describing a variety of spatial content types using the ISO Series of Geographic Metadata Standards. We will share a series of Python and XSLT routines, which automate the creation of ISO-compliant metadata for geospatial datasets, web services, and feature catalogs. These auto-generation tools are designed to work directly with XML documents, making them suitable for use within any XML-aware cataloging platform. Our goals are to make metadata creation simpler for data providers, and to increase standardization across organizations in order to increase the potential for metadata sharing and data synchronization among the geospatial community.
Point (geometry) Computer animation Predicate (grammar) Multiplication sign Universe (mathematics) Similarity (geometry) Metadata Library (computing) Template (C++)
Presentation of a group Open source Computer file Workstation <Musikinstrument> Online help Mass Formal language Number Mathematics Term (mathematics) Different (Kate Ryan album) Computing platform Social class Email Mapping Interface (computing) Projective plane Bit Line (geometry) Instance (computer science) Library catalog Subject indexing Digital photography Exterior algebra Computer animation Software Repository (publishing) Self-organization Quicksort Library (computing) Spacetime
Web page Group action Identifiability Digital library Metadata Product (business) Web service Bit rate Case modding Energy level Endliche Modelltheorie Computing platform Standard deviation Information Mapping Uniqueness quantification Software developer Interface (computing) Projective plane Debugger Library catalog Flow separation Landing page Uniform resource locator Arithmetic mean Computer animation Software Repository (publishing) Universe (mathematics) Website Row (database)
Web page Point (geometry) Functional (mathematics) Identifiability Computer file View (database) Multiplication sign Numbering scheme Public domain Shape (magazine) Field (computer science) Metadata Attribute grammar Web 2.0 Causality Robotics Term (mathematics) Cuboid Energy level Endliche Modelltheorie User interface Area Series (mathematics) Email Standard deviation Information Projective plane Variance Special unitary group Category of being Process (computing) Computer animation Software Repository (publishing) Right angle Quicksort Object (grammar) Geometry Library (computing) Spacetime
Game controller Identifiability Computer file Namespace Uniqueness quantification Multiplication sign Sound effect Bit Digital library Flow separation Field (computer science) Metadata Number Template (C++) Uniform resource locator Computer animation Term (mathematics) String (computer science) Energy level Quicksort Row (database) Physical system
Complex (psychology) Computer file Multiplication sign Sheaf (mathematics) Temporal logic Rule of inference Metadata Field (computer science) Template (C++) Normal operator Web service Authorization Extension (kinesiology) Descriptive statistics Scripting language Constraint (mathematics) Information Bit Variable (mathematics) Word Process (computing) Computer animation Order (biology) Statement (computer science) Pattern language Quicksort Abstraction Library (computing) Reverse engineering Row (database)
Point (geometry) Scripting language Functional (mathematics) Computer file Information Code Forcing (mathematics) Mathematical analysis Electronic mailing list Codebuch Library catalog Computer icon Metadata Attribute grammar Category of being Goodness of fit Computer animation Case modding Robotics Right angle Quicksort Table (information) Row (database)
Slide rule Multiplication sign Debugger Physical law Numbering scheme Online help Library catalog Metadata Computer programming Subject indexing Googol Computer animation Geometry Row (database)
Email Computer animation Projective plane Video game Metadata Template (C++) Twitter
Metre Computer file Multiplication sign Connectivity (graph theory) Mereology Stack (abstract data type) Product (business) Mathematics Web service Bit rate Operator (mathematics) Software testing Physical system Predictability Standard deviation Mapping Projective plane Cartesian coordinate system System call Data mining Computer animation Integrated development environment Software output Video game Lipschitz-Stetigkeit Library (computing)
and the
and my name is kinda and on the metadata for geospatial and scientific data and I work at Stanford University and probably unlike later maybe unlike a lot of guys are what you do us as they work in a library are university setting were kind new GIS may be compared to some of you guys are doing uh maybe over last decade is that extreme growth in the amount of people coming in asking for instruction in GIS and people is now starting to usuriousness scholarly research and so we've had really good a lot of growth and similar metadata things on the talk about templates ascription XSLT wasn't on have time to demonstrate that so point you to wear those
documents are at the end of this presentation as a little bit about GAS at Stanford I I'm actually physically located in earth sciences library on campus and this is where a majority of the Cartographic Research happens I want my pictures is sliced off the sorry as so this library the library work at is primarily responsible for purchasing or downloading most the GIS data that we then distribute to our users and we also have a physical space so we have a lot of people coming in for geospatial instructions call the Stanford Geospatial center and then we also have a teaching lab and like I said I think we talked maybe like a thousand classes last year and so this is really grown in terms of the kinds of people that are coming in and wanting use GIS and we are also the maths library for campus so we have everything in general topographic maps like roadmaps to rare and historic maps I don't know if you guys are familiar with David Ramsey uses his work map collector and he's actually donated his map collection to stand so the opening up of a data the mass center in the middle of 20 16 and now we have people coming in and the starting to geo-referenced maps and now they're creating geospatial data out of all of us and now it's it's sort of ours to deal with and so at the at the center of all of us the people coming in asking for data were distributing data it was clear after about 10 years of doing this that we needed a better way to not only gather resources then but that a more efficient way to serve the earth's resources out and historically in the past people actually physically have come to the library to check out like DVD year something or they would have e-mail 1 of us are contact 1 of us and then we have to essentially find data for them and I don't know how much you know about libraries but people don't really come into to the desk and ask for help anymore they Google things so that we
a kind of move into that space outside of G all of the libraries deal a lot with digital files thing about scanned books or digitized photographs things like that and there's a lot to do a digital files but they really don't have that much in common otherwise and that is why for a number of years Stanford has been involved in what's called a hydra project and this is an open source community based projects as a number of institutions involved as well as radio stations news organizations and it's pretty much anybody that has digital assets that they need to manage and more specifically preserved in uniquely identify within a repository environment as and I said this is completely open source it's built using the Ruby language the Fedora repository Apache solar index as well as the black discovery platform and black light I don't know if you guys heard of this it was developed as an alternative to a lot of library catalogs software out there that's commercially purchased so this gives us our library catalog for instance is built using black line we have several other interfaces that are also built off of Blacklight where we can sort of sculpt customizable interfaces for discovery of different types of content i
given all of this and the fact that this was already existing so I should say 1 more thing everything that comes in and out of the repository is assigned a persistent URL as well as a persistent URL page meaning that and if you go to this URL down there were guaranteeing you that at least at that identifier level that those data will always be associated with that universally unique alphanumeric support of strain rate now if you're accessing it through HTTP you can get it through that protocol but within a repository every piece of data is identified as by this unique identifier and so this is more than just a landing page for data there's some descriptive information here but it also allows you to site that data so if someone had written an article and the use this data they could say hey I referenced as if someone takes the state and use it for another project it's going to be and always available at this particular URL at this these these metadata here's is stored in a descriptive standard called models which i is anyone and maturing insulin is ever heard that this is the digital library standard it's really quite common and everything that flows in and out of our repository gets what's called a
MODS record so given the fact that we are had this repository in place and we already have these Pearl services for data i it made sense for us to adapt the existing black platform to accommodate geospatial resources and this led to the development of what we call dual black white and we have branded that software in the front end locally at Stanford as a product called Earth so you can go to earthworks that stanford dot EDU and see that an action and I wanted to give you a live demonstration but I don't think that's a very good idea so feel free to browse around L 1 of the benefits of this catalog is it actually and it's metadata from several institutions they're mostly universities right now anyone that has metadata who we can we can take it and will distill it down and will will make it searchable and you're blocked by I you can also do a map search and then we employ faceted search is in the in this interface as well to
see an example of 1 of the resources that somebody might find in yeah and this is sort of a descriptive page for the metadata and a really basic stuff single-view view the pro page for this site and if you want to at the Pearl you can click on 1 of the data points there and start to view the after the values that are associated with that you can download the shape file into other sort a library things that book market e-mail to yourself that kind of thing in those were all pulled from the existing black software so it was really just a plug-in that we had to to adapt to get this to happen and to get to this point we've really had to from a from a medical standpoint anyway move from storing ISO metadata we then have to move that to the models which I just mentioned to support the prof functionality and then finally we distill that down into what we call the geo black-white scheme and that is in a set of J Sun our solar documents that make it available for someone to search it uh and find on the web so is pretty much just like the Dublin Core standard with some geo fields tacked onto it it's basically just for end-user discovery of geo-spatial data within a web interface the
so I want to make the metadata for these i is extremely hard I should say that it kind of looks easy and maybe examples all put up your make it look somewhat easy but it's actually kind of kind of thing and when we when we usually begin creating metadata for this we start from a collection level and that's because when we buy things it used to be become a DVD your layers served Our served out together thematically usually so we register with called a collection object into the repository I we creates a metadata for in that month standard that I mentioned earlier and then we register according to rights policy so this could be something like we we carry a lot of restricted data because we purchase them it could be public domain data we have those as well as the variance of the Creative Commons licenses that we set up once we register that into the repository it gets 1 of those unique identifiers at a later well workflow is a little more complex in can certainly vary depending on the type of metadata that you're dealing with often you have differentiable 1 layer at a time were trying to speed this up as much as possible and the 1st thing you have to do before you even think about putting a layer into a repository for preservation purposes is to check the data properties and projections I can't really say enough about how missing or incorrect projections are cause in their problems downstream and it's probably the biggest issue we've had in terms of trying to expose something on the web everyone's not in here so that's great I think I have to really encourage you if you take anything away from this if because the creating data please use the standard projection and say what the at the data properties things like file names are also kind of weird bag characters and spaces in file names or really will get objects where the data extending beyond the the side of the box really have to put it down to the to the data so once we've got the data properties in place we can then actually register that layer according to the collection that we've already created we can then assign rights policy to it which is usually the same as the collection but sometimes we our layers do differ in terms of in terms of the rights capability and so you may want to change that and then there's sort of 2 scenarios that every thinking can fall into the if you're dealing with items that have no metadata usually start out with some kind XML template nearly this very controlled receptive fields you can import that in that that generally works pretty well uh but we also get a lot of metadata that already has all this stuff in it you know and that's not something that we wanna just erase and rewrite and so that process we use an XSLT to normalize the metadata and and then we both steps moved to a localized editing and this is where you have to go and this would nobody does this way you have to go in and actually edit feels that a specific to that layer things like just describing what it's about defining attributes maybe atom do a general title that's preferably not violence and the next thing we do know is that auto generate a bunch metadata using Python and this is usually scripting and all these identifiers we write this is not information you wanna be typing in my hand and then finally once that's done we deposit back into a staging area and from there it's picked up by a series of robotic workflows 1 of which will package it up for a sectioning into the repository and the other 1 will package it up for delivery to the user upon request
so this is a little bit about identifiers I just wanna talk about those a little more as I think the great that is really probably the single most important piece metadata you can have record and everything it we call it a digital repository unique identifier a drew so at the letter level you've got this alphanumeric string you've got its URL we also create metadata file identifiers according to this side there's a new were sort of convention with assigning file identifiers where you use your string at the end and then everything is in that stand for namespace so that number is guaranteed to be unique within the EDU . stanford . Pearl and actually URL scripture several times with a records so this is nice to have a scripting effect we can go right those and we don't have to type a URL into several fields I collection-level metadata same thing you've got a druid and then that fills in 2 fields for the bias so 1 9 1 3 9 and a collective title as well as the identifier for the for the current metadata now the
fun stuff I usually this is probably pretty easy but for the layer I showed earlier about geothermal power when we start with a collection of these we had a whole set of of data that dealt with renewable energy so right away there's a bunch of metadata that we can put into a template that we don't have intra by hand and so between the identifiers and then as we've already got a pretty good record shaping up in terms of the you know creating a nice size so record and I would say creating XML templates as opposed to normalizing something with XSLT is really a lot easier and I you have much more control over what you're putting in your system and so this is probably a pretty straightforward example of how you would make a template and wanna put XML appear but at all these fields going every single records so make a template you can either applied if you doing June worker fusing system imported into the XML either way it
works the localized setting again this is a pretty straightforward example but when you're making titles for things you might have sort of our our that where when this but uh specifically the when I notice a lot temporal extents our are left out of data and if you have something that says average annual rainfall but there's no dates that doesn't mean anything and so what what is it where is it and 1 is the data covers a very simple pattern titles again description is is really straightforward here but what is the data what does it represent an then keywords generally differ and were were slightly constraint as a library because we have to follow these hard and fast rules about a controlled vocabulary so we pull keywords from the Library of Congress or we pull names for researchers from either something called by half or the Library of Congress I think we rely a lot more controlled names and maybe some other institutions so when we pull in data we have to make sure that those correspond to an existing vocabulary and again temporal extents I can't I can't say enough
and updating existing metadata like I said is is really can be kind of tough this very significantly by collection and the reason why we have chosen to do things this way specifically when you get data that have very complex lineage information associated with them or of someone's filled out actually definitions of we don't wanna race that that's something that I wouldn't know it just by looking at the data and so in order to preserve that we write an XSLT on those metadata and it essentially works sort of like a template in reverse word applies all those normalized elements to the XML and to do this takes a little bit more time leadership to survey every all the XML that you're looking at before you can release write an XSLT that that works cleanly with with your metadata and and then you have to go and just like you would with the other process and edit things locally there's things like title abstract keywords and so forth the and then a final step we've gotten a lot more used to using Python descriptors many fields as possible this is because more and more as your eyes are appearing in metadata and no 1 was to be copying and pasting typing in these types of information so we have some scripts in Python that answer all of these into the record without us having to touch them so after looking at this there's really only a few fields that you guys have to go in and enter in order to have a really nice record and everything else hopefully will be automated and Florio you can also use python add things and credit statements if you're supplying those by pulling variables other records like authors and titles and dates and if you wanna add file names in like a web services link you can also use something like that for this as opposed to going in in typing the name of a file into the linkage sections I
see our feature catalog metadata so this was 1 of the probably the most difficult things to deal with I think when you're looking at data because you can open up anatomy table even no idea what it means and labels like sitting country I think we can all pretty much infer what those mean or take a really good guess bowl with things like demographic data or or you know household informational icon stuff we get these code lists like this and I we have no idea we have no idea what that means the data is pretty much essentially useless at this point people doing some analysis and said well here we have in forced people people are giving is these giant codebooks for the at tribute tables we are now requesting that in a CSP and and we do the same thing we've created as uh of Python routines that script the entire ISO 1 9 1 1 0 record from CSV file because this is most often how we receive those metadata and I don't think anyone's ever gone into a former actually filled out the definitions for the attribute labels that CSP I think is it is an easier way for for researchers of people providing dated understand a codebook and so this is work pretty well for us
and so after we've created our our metadata it kicks off into this robotic workflow that separates the data and the metadata of for the metadata piece on the right hand side everything starts out like a said with the ISO 1 9 1 3 9 we cross what that into a MODS record for to support the Pearl functionality and what we do a little thing that generates a new arise in the place names in the attributes and then it's finished and it kicks off the discovery workflow that there's some data wrangling that has to happen here and that sort of relates to the data properties issues said before where we the the robot checks to make sure that all the data properties are a are valid after that there
they are joined together and they are kicked off a new discovery workflow where they can then be made available for people in the front end so after we have a month's metadata we generate metadata in the GEO black-white scheme of Indian in J. son we law that you'll Jason into our solar index and we pointed out the earthworks catalog and this is where people can go in and search for the data and we then export the data to get hard and I'll show you that in just a 2nd and and
that their workflows kicked off this might actually be my last slide I'm program probably under time here because I can talk about this and when I 1st started having you make metadata for a bunch resources I was just googling things over time like googling filenames or e-mailing this guy at Harvard and asking him to e-mail me XML it just wasn't working and so we have started to repository all of our metadata I get hub is something called OpenGeo metadata I think there's maybe about 10 institutions here right now and you can store metadata in any format you want just put it up there and then what we do is we gather these from all the other institutions and we pull them into earthworks catalog that way and what what's nice about this is that I no longer have to write someone asked me if there's a record I can search for the title and Google and least nearer this or I can search for it right in here and so I think there's about 25 thousand 30 thousand records in here right now and for somebody who has to make metadata all there's been a huge help because I don't know if I can tell you how much time it saved and and so we're starting to see more people signed on to this there's also some tools that are built around it so if you wanna of the tool dual combine will actually pull help people and a new style sheets to to transform metadata in
Britain and I think that that's actually about it if you want more information about hydraulic uh geo Blacklight OpenGeo metadata as I mentioned as we've got some templates and the XSLT in the Python I talked about it's all upon get half of unwanted demonstrate that life and then if my e-mail me if only on Twitter and I'm I wanted to say thanks so all my contributors this is a pretty large projects but the the any questions
and yeah we
life that the and when you have maybe 1 piece has long the monkeys says meters maybe you're in for the like that you have to reject anything between the different types of you get a year out what happens is we take because it's a preservation environment we take the data in whatever you have it's custom we take it whatever whatever projection it's common and then everything is transformed into 43 26 away from that so that's why if there's a standard projection I think the G dull talking can read those but when it's this custom stuff that's what we see problems downstream and but it's cosplay by year and I'll put this on the prediction for you I stand projection is at rate now anything that's registered in the PSG OK I and the question would be maybe added them so strictly about because his with the you know I come from the GIS roles from Europe so if inspired and its inspires inspire initiative is much of our method in time of services you some component so from what we have and we actually did you network I think it's I think it's a great product I think because we had already had all this technology existing that it just makes sense for us to adapt it that way of there's operator changes it's it's part of our technology stack if we are writing something like inspired all those other tools it requires almost a completely separate personnel sometimes to manage and operate happen at different times now if hydra is updated or Fedora our black earthworks gets you know it's it's part of the package and so we don't have these external applications running from that that change that was that's just sort of what made sense for us and from the minute is simply that I have to say I don't think it matters because it if we follow the standard we can share it with whomever input into whatever system with mine the as 1 last trafficking of the URI picking a voltage reference inquiries thing called data fetuses and irritable you all old lips all lined reviewed which which is kind of kind of crawled sure from here is that the New York dear public library the I'm the smoking you it's not of it and there's some some some we've looked at and some others we've had people had questions and we've seen it happen but the issue I think with some of those your references I don't know about all maps online is there's actually no file created is to stay out so we're not getting a GeoTiff test in in receipt uh I know the British Museum and I wanna call that I'm not sure exactly but someone from the British Museum haven't Shigeo data but there were no files to be extracted it was just data at that point and so we need to have some in some format