Mainstreaming metadata into research workflows to advance reproducibility and open geographic information science
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 351 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/68901 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2022 |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
MetadataOpen setInformationSoftwareStandard deviationCollaborationismMetadataComputer configurationSlide ruleIn-System-ProgrammierungOpen setComputer animation
01:10
MetadataMachine visionOpen setMetadataVideo gameOpen setTraffic reportingCycle (graph theory)Product (business)Computer animation
01:58
Context awarenessObservational studyPoint (geometry)Open sourceMetadataSoftwareInformationOpen setSoftwareInformationLink (knot theory)MetadataOpen setObservational studyBit rateLevel (video gaming)Open sourceVideo gamePoint (geometry)Self-organizationContext awarenessCycle (graph theory)Projective plane2 (number)Standard deviationComputer animation
03:18
Formal verificationDirected setExtension (kinesiology)Observational studyRight angleTable (information)ComputerCore dumpArithmetic progressionMetadataCASE <Informatik>Greatest elementReplication (computing)outputOpen setCalculus of variationsValidity (statistics)Computer animation
04:29
InformationMetadataWater vaporWaveSurfaceMassComputer animation
04:55
MetadataContext awarenessOntologyCodeUsabilityStructured programmingStandard deviationIntegrated development environmentOpen setSelf-organizationInformationMetadataSelf-organizationStandard deviationData storage deviceOperator (mathematics)Complete metric spaceDocument management systemBitVirtual machineData dictionaryProjective planeDublin CoreFile formatOpen setSeries (mathematics)Temporal logicContext awarenessOntologyLevel (video gaming)SurfaceObservational studyCommunications protocolElectronic program guideCodeExtension (kinesiology)Dependent and independent variablesInformation privacyGoodness of fitPhysical systemType theoryCore dumpComputer animation
07:31
InformationMeasurementContent (media)Temporal logicFile formatDistribution (mathematics)Representation (politics)Image resolutionAngular resolutionAbelian categoryConstraint (mathematics)Open sourceData typeStandard deviationDublin CoreStandard deviationMedical imagingAttribute grammarMeasurementContent (media)Temporal logicTable (information)Greatest elementLinearizationProper mapInformationData modelForm (programming)ChainDigital object identifierDublin CoreRight angleProjective planeCycle (graph theory)DatabaseVideo gameType theoryMetadataData qualityDescriptive statisticsData typeFitness functionData dictionaryDependent and independent variablesRow (database)IdentifiabilityImage resolutionExtension (kinesiology)Constraint (mathematics)Theory of relativityGame controllerUniqueness quantificationVector spaceComputer animation
09:37
Point (geometry)SoftwareOpen sourceMetadataInformationOpen setCycle (graph theory)Video gamePresentation of a groupRevision controlRepository (publishing)Template (C++)Data modelTable (information)Pairwise comparisonCodeComputational physicsLaptopIntegrated development environmentCommunications protocolLink (knot theory)Mathematical analysisPhase transitionExterior algebraRevision controlCycle (graph theory)Presentation of a groupVideo gameCodeTraffic reportingMetadataOpen setView (database)Point (geometry)File archiverData managementProcess (computing)Different (Kate Ryan album)Function (mathematics)Procedural programmingEntire functionValidity (statistics)Level (video gaming)Directory serviceProjective planeMathematical analysisShared memoryTrailEndliche ModelltheorieTable (information)Figurate numberElectric generatorStandard deviationData structureCommunications protocolMathematicsRepository (publishing)Digital object identifierFeedbackMetaanalyseLink (knot theory)Software documentationSoftware developerImage registrationService (economics)Observational studySimilarity (geometry)Computer simulationReplication (computing)Template (C++)ResultantVisualization (computer graphics)Arithmetic progressionReal numberComputer animation
14:30
Repository (publishing)Template (C++)InformationData structureCubic graphUsabilityRemote procedure callGroup actionDigital rights managementLevel (video gaming)MetadataData structureProjective planeResultantOpen setObservational studyRight angleTraffic reportingComputer fileDirectory serviceProcedural programmingSoftwareSource code
15:05
Template (C++)Group actionMetadataLevel (video gaming)GEDCOMPicture archiving and communication systemComputer fileArchaeological field surveyStandard deviationElectronic mailing listTable (information)MetadataFile formatSoftwareSource codeJSON
15:32
Point (geometry)SoftwareOpen sourceMetadataInformationOpen setOpen setLibrary catalogMetadataData integrityDigital rights managementSoftwareGeometryGroup actionPoint (geometry)FunktionalanalysisPhysical systemCycle (graph theory)Server (computing)Meta elementGrass (card game)Video gameFlow separationType theoryWeb 2.0Suite (music)Text editorMultiplication signComputer programMixed realityInformationContent (media)Inclusion mapComputer animation
17:02
Standard deviationMetadataSoftwareDescriptive statisticsStandard deviationConformal mapMetadataLimit (category theory)Faculty (division)Attribute grammarFile formatExtension (kinesiology)Personal digital assistantRevision controlCore dumpTrailPhysical systemSoftwareStudent's t-testGraphical user interfaceGame controllerField (computer science)Coordinate systemComplete metric spaceProcess (computing)Library catalogType theoryMultiplication signOpen setDirectory serviceValidity (statistics)Computer animation
18:06
MetadataComputer networkMeta elementGrass (card game)SoftwareSingle-precision floating-point formatMetadataText editorAttribute grammarServer (computing)ImplementationUsabilityTerm (mathematics)Self-organizationFunktionalanalysisComputerVapor barrierProcess (computing)ChainValidity (statistics)Graphical user interfaceDublin CoreVirtual machineGeometryTable (information)Message passingOpen sourceStandard deviationComputer configurationImage registrationGrass (card game)CodeComputer programFile formatDirectory serviceNumeral (linguistics)JSONComputer animation
21:07
Point (geometry)Open setOpen sourceInformationSoftwareMetadataPlastikkarteMetadataMachine visionSoftwareMultiplication signStandard deviationOpen sourceCycle (graph theory)InformationVideo gamePoint (geometry)Machine vision
21:38
CollaborationismSoftwareLink (knot theory)Projective planeComputer animation
Transcript: English(auto-generated)
00:00
Thank you. So, this work was completed with the collaboration of Dr. Peter Kedron at Arizona State University and with the support of a National Science Foundation grant for improving the reproducibility and replicability in the geographic sciences through a project-based graduate and undergraduate methods curriculum.
00:22
These slides and the associated paper are available in the ISPRS archives on GitHub and on OSF for the Open Science Foundation. Essentially, this talk is the story of an academic researcher trying to improve the reproducibility of his own research projects through documenting metadata to standards suggested in the literature, and getting really frustrated with
00:45
a lack of tools that I can discover with a reasonable amount of effort to do so. And so, I'm kind of formalizing that review of the standards expected for reproducible research and the options that I found available in the OSGO software ecosystem.
01:02
So, I actually hope that someone proves me incorrect today and gives me the silver bullet to solving this problem. So, I'll look forward to that and the questions. So, first, we need some motivation for a 20-minute talk about metadata. I am motivated because I have reluctantly concluded that metadata is required to enhance the reproducibility of geographic research.
01:24
In turn, enhanced reproducibility is expected to increase the pace and credibility of knowledge production and knowledge quality in the geographic sciences. Furthermore, I am optimistic that integrating metadata into everyday research practices will facilitate more efficient and open research life cycles.
01:45
This research is based on experience applying the broad consensus reports by the national academies on open science and reproducibility and replicability to the specific challenges in the geographic sciences. So, what's happening in geography? For some context from our research project, we have surveyed geographers about their research practices,
02:05
finding that most folks would say, I'm familiar with reproducibility and my research is also reproducible. However, when asked about metadata, they would say, no, I've never used that before and I'm not sure quite what that is or how to do it.
02:20
So, unfortunately, providing data or even code alongside a research publication without the metadata is sort of like publishing a map with no title or legend, leaving serious questions about the data and its proper use unanswered. This may or may not be COVID rates as of August 2020 in the United States. We have reluctantly and painfully come to this conclusion and the need for this paper after
02:44
attempting seven reproduction studies and publishing EATS as a reproducible research compendium on our GitHub organization. So, the links are there. So, today I'm hoping to convince you of these three points. First, open science and reproducibility require standardized metadata.
03:01
Second, researchers use, create, and modify information about their research projects and their research data throughout their whole research life cycle. And third, we need better open source geospatial software to support a metadata-rich research life cycle. But first, what is this reproducibility that I'm talking about?
03:20
So, reproducibility is a core motivation of open science, but it's more than simply reproducing another researcher's computations, as illustrated by this table on the right. At the top left, if we use the same data and the same methods, then as a prior study, that reproduction study can provide a check of the internal validity of a study, so on the top left.
03:45
However, in our seven works so far, we've kind of found that most studies have some problems in their procedures or in their data that we feel the need to correct as we are reproducing their work, moving us to the top right in a reanalysis, where we actually start altering the methods but use the same input data as the original researchers.
04:05
If we want to externally validate a study, we move to the bottom left, where we apply the same methods to new data in a replication study. And this is where you start trying to generalize geographic knowledge to new case studies and new examples. And finally, the bottom right is where science makes further progress by extending previous studies with varied methods and new data.
04:27
So what's the role of metadata in all of this? Geospatial metadata essentially is information about spatial data. Drawing on the iceberg cliche, consider that geospatial data is just the attractive data floating above the ocean surface.
04:42
Meanwhile, a mass of information, contextual information, lies beneath the surface of the waves, keeping that ice above the water and posing serious dangers to anyone using it without knowledge of what's beneath the surface. We can think of metadata as one of the irreplaceable cogs in the gears of an open science system.
05:02
So metadata provides essential social and ontological context for data's meaning, interoperability, and appropriate use. Good metadata is also an ethical issue, particularly with regards to problems of privacy and quality of big data for humanitarian response. And in open science, metadata is key to fair open data.
05:23
So data becomes findable with project-level metadata, including studies geographic and temporal extent and keywords from standard dictionaries. Data is accessible when metadata specifies open license or access protocols, or when metadata provides enough detail about the data used in a study so that someone could recreate or simulate it, even if the original data is not accessible anymore.
05:47
And data is interoperable when metadata adheres to machine-readable international standards, and data is reusable when metadata provides enough context and detail for the recreation and appropriate reuse of the original data. In a recent publication, Wilson et al. wrote about a five-star guide
06:04
to reproducible research, and distinguishing three of the five stars with metadata, actually. So just providing the data in code with a study gets you one star in their method. Two stars is achieved with a little bit of metadata documentation. Three stars is with complete metadata documentation.
06:24
And four is if you follow the international standards when you specify that metadata. So what are these standards? Metadata standards for geographic information may come from either spatial data infrastructures, for example, the Federal Geographic Data Committee in the United States, or INSPIRE in Europe.
06:44
These data infrastructures were developed to enhance data standards and interoperability between governments and government agencies. Metadata standards have also been developed by other organizations. So, for example, the ISO-191 series is increasingly adopted and extended by both the American and European Federal FDGC and INSPIRE.
07:10
And furthermore, the Duplin Core standard developed by librarians for managing digital archives is ideally suited for documenting metadata about your overall research project.
07:21
Finally, the Open Geospatial Consortium offers plenty of guidance on open data storage formats, but generally leaves the metadata standards to these other organizations. So let's look more closely at the metadata types that we should be using in our reproducible research. Again, I'd suggest that the ISO-19115 standard and related standards on the left
07:41
-hand side is ideal for documenting individual databases or data layers used in the research, while the Duplin Core standards on the right-hand side are more ideally suited for documenting the overall research publication or the research project. So the first five rows in that table answer the basic what, why, and when questions of the data and make
08:06
your projects findable by using a digital object identifier for the unique identifier and controlled vocabularies for the topic and subject keywords. The next two rows answer the who question about the data, who's responsible for authoring, creating, publishing, and maintaining that data.
08:22
The next row provides legal issues of constraints and rights. Ideally, all of your data in a reproducible research compendium will be published with open licenses, without which copyright protection is implied and the reuse of that data is forbidden. From this point on in the table, the ISO-19115 standard provides much more detail for geographic data layers than the Duplin Core,
08:44
including spatial data model and spatial and temporal extents and resolutions. Content information for vector data is essentially a data dictionary of all the attributes, data types, measurements, and even descriptive statistics, while content information for gridded imagery, like drone images, contains the details of the sensors and measurements taken by the sensors.
09:08
Finally, the ISO standard also contains information on data quality and proper usage. Its lineage feature on the bottom of the table is robust enough to essentially
09:21
provide a toolchain of steps used to process the data into its final form, whereas the lineage information in the Duplin Core is more about a chain of custody for an artifact or an image or something like that. How does all this information fit within a research lifecycle? Let's take a look.
09:41
Hopefully by now, the Open Science and Reproducibility require standardized metadata. Hopefully that point is clear at this point. The research lifecycle. The National Academy's Open Science by Design report envisions this research lifecycle with expanded research opportunities and improved research quality enabled by Open Science practices into six phases of research.
10:06
In the provocation phase, we search and review existing literature to generate new research ideas. In the ideation phase, we plan the research, including human subjects protocols for ethics review, research funding proposals with data management plans,
10:22
and pre-analysis registrations of our study protocols. In the knowledge generation phase, we actually create, collect, and analyze our data and start documenting metadata for it. In the validation phase, we analyze and share preliminary results, which may be aided with our OSF or Figshare registrations.
10:43
An example of that is today, sharing work at a conference and getting feedback before final publications. Then in the dissemination phase, we undergo our beloved peer review process, revision, and formal publications. And then finally in preservation phase, we archive our data and code in an open access repository,
11:02
and finalize project metadata for searching and data layers for reusability. What I'd like to do is reimagine that lifecycle using a research compendium, and then we actually work on our preservation phase with metadata throughout the entire research lifecycle from the beginning.
11:26
So Peter, Kedron, and I have developed a template research compendium for reproduction and replication studies, or original research that you want to be reproducible. It provides structure for project-level metadata and for organizing all of the data, metadata, procedures and code, documents and manuscripts,
11:43
and resulting figures, tables, and model outputs related to the research project. This compendium is managed as a Git repository for version tracking, comparing differences or changes between the versions of your research, and branching or merging alternative research designs as the research progresses.
12:02
At the provocation phase at the beginning of the research project, an open science literature review in this world would be enabled by project-level metadata from other published research that enables a spatially explicit literature review meta-analysis or bibliometric analysis. At the ideation phase then, we would create a new research compendium with the structure shown on the left with project-level metadata.
12:28
We would research and imagine the data that we intend to use and create in our research, and we generate standardized metadata for this in our metadata folder. So we actually start creating our metadata before we create the data itself.
12:41
We then use our metadata to help us organize and write our research proposals, our data management plans, our ethical human subjects review protocols, and our pre-analysis plan documents. We would then register the plans on Open Science Foundation or a similar service using project-level metadata,
13:05
and link our OSF project to the Git repository before moving on to the knowledge generation phase when we actually create and collect the data. At that phase of knowledge generation, we would create the data, update our metadata documents if we had to change any of our protocols,
13:21
and enable visualizations of those changes so that we can document them as unplanned deviations to our research plan. Ideally, metadata tools would support cataloging our data, updating our metadata, and building a directory for the whole compendium. Then at the validation phase, we would document any unplanned deviations, write and register our reports,
13:43
and develop open access preprints and conference presentations like this one associated with our compendium. Once we get to the dissemination phase then, our compendium would provide unprecedented access to the details of our research to the reviewers of the work,
14:04
and we would make sufficient metadata details of embargoed, restricted, or proprietary data available so that other people would be able to simulate, access, or recreate similar data. Then finally at the preservation phase, we realized that we've been preserving our research work all along,
14:20
and all that's left to do is basically register a DOI or digital object identifier for our research compendium and link that to the publication. So let's look at an example of a real publication compendium here. So here you see one of our reproduction studies on GitHub with the data docs procedure and results directory structure,
14:43
a citation file, an open access license, and project level readme document that's on the right. The readme ideally contains project level metadata and a directory of data files, and ideally we would be able to maintain this top level readme automatically with some software support.
15:00
Each significant data layer should be documented with metadata in the metadata folder, and here is an example of the top of an XML file using the FGDC standard to describe American community survey data from the US Census. Currently we are maintaining a list of data paths, file names, formats, and metadata files in a CSV file,
15:25
or comma separated text file that renders as a table like this on GitHub, and ideally we could also do this with software. So hopefully my second point is now clear, that researchers use, create, and modify information about their research projects and research data throughout the full research lifecycle.
15:43
So where does Phosphor-G software come into play in this? There are several types of software that we might be able to use to support us in this work, and I'm only considering those software with metadata editing functionality and with open licenses and no associated fees.
16:01
So the first type of software we might use is desktop GIS, so this could include QGIS, GRASS, or SAGA. From the spatial data science worlds we reviewed GeoMetapackage in R and the PyGeoMetapackage in Python. OSGEO projects also include a catalog server, GeoNetwork, and a content management server called GeoNode.
16:23
And in our software search we also discovered two specialized metadata editing software programs. Metadata Wizard is published by the USGS in the US, and MD Editor is a web-based system created by the Alaska Data Integration Working Group.
16:40
And finally, the O2R Opening Reproducible Research Program is developing a containerized executable research compendium with a metadata tool called O2R Meta. And if you're interested in reproducibility you should definitely spend some time on O2R.info and see the work that they're doing.
17:00
So given this suite of possible software tools, what do we want the software to be able to do for us? Based on our practical experience and the literature review, we suggest the following software needs. First, metadata software must be easy to use for students, research assistants, and faculty with limited time. Installation, startup, and learning of the software should be easy, and a
17:21
graphical user interface should support editing and providing guidance throughout the editing process. The software must support open standards, including support for all of the fields and controlled vocabularies of the ISO and duplicore standards, as well as encoding in standardized formats, especially XML. Ideally, working with metadata should be facilitated by as much automation as
17:43
possible, including features to parse a directory for spatial data and catalog it, extract geographic metadata like the coordinate reference system and extent, extract attribute metadata like field names, types, and descriptive statistics, validate metadata documents for completeness and conformity to the standards,
18:01
and track provenance, or basically create a detailed history of data revisions. So how did the Phosphor-G software stack up against those requirements? In the early start criteria on the left, metadata editors received double checks for very fast installation and easy use.
18:24
The desktop GIS received a single check for straightforward installation and use. The servers and code packages have numerous barriers, unfortunately, for a novice to install and start using. GeoNetwork and GeoNode especially have a lot of functionality, but they're really designed for a large organization
18:40
and someone maintaining a server rather than a small research team working on a cluster of computers. In terms of a graphic user interface, the desktop GIS software and the servers and metadata all have really easy-to-use GUIs, but Saga and Grasp could not easily use the GUI for editing metadata.
19:00
In terms of support for standards, only GeoNetwork and GeoNode servers supported both the ISO standards and the Dublin Core standards. A single black check indicates that support for ISO only and Metadata Wizard supports only the FGDC standard, unfortunately, even though it's a really easy-to-use software program.
19:22
Most of the software could save metadata encoded in machine-readable XML or JSON formats. These are essential for extending metadata to other purposes like autofilling our research documents or facilitating our registrations online. Only O2R had a true cataloging feature meant to discover all of the spatial data layers in a research compendium directory.
19:47
QGIS and Saga are both capable of viewing at least some of the spatial data formats in a directory, for example through the QGIS browser, but not in a format usable for documenting our metadata. Six of the software options contained at least semi-automated features for extracting geographic metadata, and four
20:04
of those also contained features for automatically extracting or at least viewing some of the attribute metadata. Six of the software options also contained features to validate the metadata records, but thus far, only Metadata Wizard from the USGS has a full implementation of validation and automating geographic and attribute metadata features.
20:26
Finally, only one software, Saga, tracks provenance and records it as metadata attached to a geographic layer. In Saga, you can actually view that metadata and copy it as a toolchain and then re-execute the whole process that you used to create the data layer in the first place.
20:43
It's a beautiful implementation of provenance. The key message here from this table is that we need better open source geospatial software to support metadata-rich research. There's no one-stop easy shop for a small research team to install a program and use it
21:03
to maintain metadata in their research compendium and support their research, at least not that I know of. So this gives me a chance to just reiterate those three points that open science and reproducibility do require standardized metadata. As researchers, we are creating metadata all the time. We're using, creating, and modifying information about the
21:25
research projects and research data throughout the full research lifecycle, even if it isn't formalized as standardized metadata. And we need a better open source geospatial software to support this kind of metadata research vision. So questions, corrections, and comments are warmly welcome, especially collaborations from the FOSS4G community.
21:46
If anyone's interested in co-authoring a grant to create a software tool like this, I would be more than willing to put the work into that and really excited about it. And here's some links to our overall research projects. And thank you.