Inconsistent XML as a barrier to reuse of Open Access Content

River Valley TV

Mietchen, Daniel Maloney, Chris Dagsson Moskopp, Nils

Formal Metadata

Title

Title of Series

JATS-Con 2013

Part Number

Number of Parts

Author

Mietchen, Daniel

Maloney, Chris

Dagsson Moskopp, Nils

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/21791 (DOI)

Publisher

River Valley TV

Release Date

2016

Language

English

Production Place

Washington, D.C.

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

In this paper, we will describe the current state of some of the tagging of articles within the PMC Open Access subset. As a case study, we will use our experiences developing the Open Access Media Importer, a tool to harvest content from the OA subset and automatically upload it to Wikimedia Commons. Tagging inconsistencies stretch across several aspects of the articles, ranging from licensing to keywords to the MIME types of supplementary materials. While all of these complicate large-scale reuse, the unclear licensing statements required us to implement text mining-like algorithms in order to accurately determine whether or not specific content was compatible with reuse on Wikimedia Commons. Besides presenting examples of incorrectly tagged XML from a range of publishers, we will also explore past and current efforts towards standardization of license tagging, and we will describe a set of recommendations for generators of content on how best to tag certain data so that it is both compatible with existing standards, and consistent and machine-readable.