The DNB makes its data accessible as linked open data. Since 2012, the DNB has operated a web archive, which is being redeveloped in-house and open source to increase capacity. It was determined that the web archive’s holdings are not adequately described in the Linked Data Service. Additionally, we aim to provide information about the holdings adapted to the web medium. Thus far, the web archive’s collection is represented in the catalog while the metadata is limited to title, url, and date of the snapshot. However, it has become apparent that current bibliographic standards cannot reflect the complexity and characteristics of the resource web. We aim to develop a vocabulary that can describe the holdings of a web archive and encompass types such as web pages, websites, collections, time slices, records, and web resources. In web archiving, the central data format is WARC, which logs communication between the web browser and web server for archiving and provision purposes. This protocol includes identifiers (mainly URLs), communication metadata, and web resources, including HTML, stylesheets, JavaScript, and images. These resources regularly transport their own metadata, such as Microformats, OpenGraph, Schema.org, and general JSON-LD. The DOWARC vocabulary, an initial draft by The National Archives (UK), allows to describe the contents of a WARC file. Describing the holdings of the web archive as linked data is obvious. By reusing existing data and vocabularies with additional necessary terms, we can create a data model that ensures interoperability at the international and national levels. This model will also facilitate access to the DNB web archive data for researchers and users.
The project is in an early phase in which, in addition to our own requirements considerations, other web archives were also asked about their handling of bibliographic metadata. |