We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

How to describe the past Web? A data model for web archiving

Formal Metadata

Title
How to describe the past Web? A data model for web archiving
Title of Series
Number of Parts
17
Author
Contributors
License
CC Attribution - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The DNB makes its data accessible as linked open data. Since 2012, the DNB has operated a web archive, which is being redeveloped in-house and open source to increase capacity. It was determined that the web archive’s holdings are not adequately described in the Linked Data Service. Additionally, we aim to provide information about the holdings adapted to the web medium. Thus far, the web archive’s collection is represented in the catalog while the metadata is limited to title, url, and date of the snapshot. However, it has become apparent that current bibliographic standards cannot reflect the complexity and characteristics of the resource web. We aim to develop a vocabulary that can describe the holdings of a web archive and encompass types such as web pages, websites, collections, time slices, records, and web resources. In web archiving, the central data format is WARC, which logs communication between the web browser and web server for archiving and provision purposes. This protocol includes identifiers (mainly URLs), communication metadata, and web resources, including HTML, stylesheets, JavaScript, and images. These resources regularly transport their own metadata, such as Microformats, OpenGraph, Schema.org, and general JSON-LD. The DOWARC vocabulary, an initial draft by The National Archives (UK), allows to describe the contents of a WARC file. Describing the holdings of the web archive as linked data is obvious. By reusing existing data and vocabularies with additional necessary terms, we can create a data model that ensures interoperability at the international and national levels. This model will also facilitate access to the DNB web archive data for researchers and users. The project is in an early phase in which, in addition to our own requirements considerations, other web archives were also asked about their handling of bibliographic metadata.