Order from Chaos: Potential and Limits of CRF-based Reference Extraction from Footnotes

Zitieren

Boulanger, Christian

Wagner, Andreas

Max-Planck-Institut für Rechtsgeschichte und Rechtstheorie

Boulanger, Christian

Formale Metadaten

Titel

Order from Chaos: Potential and Limits of CRF-based Reference Extraction from Footnotes

Serientitel

New Approaches For Extracting Heterogenous Reference Data

Anzahl der Teile

Autor

Boulanger, Christian

Mitwirkende

Boulanger, Christian

Lizenz

CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/68327 (DOI)

Herausgeber

Boulanger, Christian

Wagner, Andreas

Max-Planck-Institut für Rechtsgeschichte und Rechtstheorie

Erscheinungsjahr

2023

Sprache

Englisch

Produzent

Reinold, Fabian

Produktionsjahr

2023

Produktionsort

Frankfurt am Main

Inhaltliche Metadaten

Fachgebiet

Information und Dokumentation

Genre

Konferenz/Talk

Abstract

Citations are a central component of a knowledge graph that maps the history of knowledge production in a scientific field. However, for many academic domains, particularly those where the language is not English, citation data either does not exist or is fragmentary. In these domains, citations have to be extracted from the source material. For literature that provides the cited references in well-organized, consistently styled bibliographies, technologies exist that have sufficient performance to produce the needed data. However, in the Humanities and parts of the Social Sciences, this is not the case: whereas humans have no problem understanding what is being referenced, from a computational perspective, the reference data is inconsistent, fragmentary, and buried in a lot of noise. This area of reference extraction is understudied. I present an end-to-end framework for citation extraction from PDF documents that addresses this problem. The framework includes a web application for annotating documents and training a model to semi-automate annotations, and a CLI which uses the trained model to for the extraction of references from PDFs and validates and enriches the extracted data from various sources such as the Web of Science or OpenAlex. The extraction workflow relies the AnyStyle extraction engine, which is based on a conventional CRF-Model. GROBID is additionally used to extract affiliation data from the documents. The use case is the extraction of references from socio-legal literature, which contains many scholarly works which contain references solely in the footnotes, in heterogeneous formats and often heavily mixed with non-reference text such as commentary. The default model of AnyStyle exhibits poor performance in handling citations based on footnotes. However, the performance can be substantially enhanced by using a moderately-sized training corpus of annotations for two distinct models: the “finder” model, which predicts whether a line comprises a reference or not, and the “parser” model, which predicts the constituent parts of the extracted lines that belong to a particular element of a reference. Trained with one dataset of annotated documents from the Journal of Law and Society (25 “finder” sequences and 1500 “parser” sequences), the recognized references can be validated by entries in citation indexes such as the Web of Science in roughly 90% of cases. The limits of “old-style” CRF models are obvious: No amount of training data will be able to cover all edge cases when applying simple statistical prediction algorithms to messy data. In contrast, commercial large language models such as GPT-3 show spectacular performance without any prior training, as they can “understand” the semantics of the tokens in the text and are therefore able to distinguish information from noise much better. Yet, as long as these new technologies are, at least at scale, slow and expensive, and given their proprietary nature, further work is warranted to improve the performance of the existing fast and open source reference extraction technologies. In this context, deep learning techniques will be needed to overcome the limitations of the existing solutions.