We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Order from Chaos: Potential and Limits of CRF-based Reference Extraction from Footnotes

Formale Metadaten

Titel
Order from Chaos: Potential and Limits of CRF-based Reference Extraction from Footnotes
Serientitel
Anzahl der Teile
7
Autor
Mitwirkende
Lizenz
CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
Produzent
Produktionsjahr2023
ProduktionsortFrankfurt am Main

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Citations are a central component of a knowledge graph that maps the history of knowledge production in a scientific field. However, for many academic domains, particularly those where the language is not English, citation data either does not exist or is fragmentary. In these domains, citations have to be extracted from the source material. For literature that provides the cited references in well-organized, consistently styled bibliographies, technologies exist that have sufficient performance to produce the needed data. However, in the Humanities and parts of the Social Sciences, this is not the case: whereas humans have no problem understanding what is being referenced, from a computational perspective, the reference data is inconsistent, fragmentary, and buried in a lot of noise. This area of reference extraction is understudied. I present an end-to-end framework for citation extraction from PDF documents that addresses this problem. The framework includes a web application for annotating documents and training a model to semi-automate annotations, and a CLI which uses the trained model to for the extraction of references from PDFs and validates and enriches the extracted data from various sources such as the Web of Science or OpenAlex. The extraction workflow relies the AnyStyle extraction engine, which is based on a conventional CRF-Model. GROBID is additionally used to extract affiliation data from the documents. The use case is the extraction of references from socio-legal literature, which contains many scholarly works which contain references solely in the footnotes, in heterogeneous formats and often heavily mixed with non-reference text such as commentary. The default model of AnyStyle exhibits poor performance in handling citations based on footnotes. However, the performance can be substantially enhanced by using a moderately-sized training corpus of annotations for two distinct models: the “finder” model, which predicts whether a line comprises a reference or not, and the “parser” model, which predicts the constituent parts of the extracted lines that belong to a particular element of a reference. Trained with one dataset of annotated documents from the Journal of Law and Society (25 “finder” sequences and 1500 “parser” sequences), the recognized references can be validated by entries in citation indexes such as the Web of Science in roughly 90% of cases. The limits of “old-style” CRF models are obvious: No amount of training data will be able to cover all edge cases when applying simple statistical prediction algorithms to messy data. In contrast, commercial large language models such as GPT-3 show spectacular performance without any prior training, as they can “understand” the semantics of the tokens in the text and are therefore able to distinguish information from noise much better. Yet, as long as these new technologies are, at least at scale, slow and expensive, and given their proprietary nature, further work is warranted to improve the performance of the existing fast and open source reference extraction technologies. In this context, deep learning techniques will be needed to overcome the limitations of the existing solutions.