How to describe the past Web? A data model for web archiving

ZBW - Leibniz-Informationszentrum Wirtschaft

Hochschulbibliothekszentrum des Landes Nordrhein-Westfalen (hbz)

Arndt, Tracy Arndt, Natanael

Formal Metadata

Title

How to describe the past Web? A data model for web archiving

Title of Series

SWIB25 - Semantic Web in Libraries

Number of Parts

Author

Arndt, Tracy

Arndt, Natanael

Contributors

Kasprzik, Argie (Moderation)

License

CC Attribution - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/72405 (DOI)

Publisher

ZBW - Leibniz-Informationszentrum Wirtschaft

Hochschulbibliothekszentrum des Landes Nordrhein-Westfalen (hbz)

Release Date

2025

Language

English

Content Metadata

Subject Area

Computer Science Information Science

Genre

Conference/Talk

Abstract

The DNB makes its data accessible as linked open data. Since 2012, the DNB has operated a web archive, which is being redeveloped in-house and open source to increase capacity. It was determined that the web archive’s holdings are not adequately described in the Linked Data Service. Additionally, we aim to provide information about the holdings adapted to the web medium. Thus far, the web archive’s collection is represented in the catalog while the metadata is limited to title, url, and date of the snapshot. However, it has become apparent that current bibliographic standards cannot reflect the complexity and characteristics of the resource web. We aim to develop a vocabulary that can describe the holdings of a web archive and encompass types such as web pages, websites, collections, time slices, records, and web resources. In web archiving, the central data format is WARC, which logs communication between the web browser and web server for archiving and provision purposes. This protocol includes identifiers (mainly URLs), communication metadata, and web resources, including HTML, stylesheets, JavaScript, and images. These resources regularly transport their own metadata, such as Microformats, OpenGraph, Schema.org, and general JSON-LD. The DOWARC vocabulary, an initial draft by The National Archives (UK), allows to describe the contents of a WARC file. Describing the holdings of the web archive as linked data is obvious. By reusing existing data and vocabularies with additional necessary terms, we can create a data model that ensures interoperability at the international and national levels. This model will also facilitate access to the DNB web archive data for researchers and users. The project is in an early phase in which, in addition to our own requirements considerations, other web archives were also asked about their handling of bibliographic metadata.