Citations are a central component of a knowledge graph that maps the history of knowledge production in a scientific field. However, for many academic domains, particularly those where the language is not English, citation data either does not exist or is fragmentary. In these domains, citations have to be extracted from the source material. For literature that provides the cited references in well-organized, consistently styled bibliographies, technologies exist that have sufficient performance to produce the needed data. However, in the Humanities and parts of the Social Sciences, this is not the case: whereas humans have no problem understanding what is being referenced, from a computational perspective, the reference data is inconsistent, fragmentary, and buried in a lot of noise. This area of reference extraction is understudied. I present an end-to-end framework for citation extraction from PDF documents that addresses this problem. The framework includes a web application for annotating documents and training a model to semi-automate annotations, and a CLI which uses the trained model to for the extraction of references from PDFs and validates and enriches the extracted data from various sources such as the Web of Science or OpenAlex. The extraction workflow relies the AnyStyle extraction engine, which is based on a conventional CRF-Model. GROBID is additionally used to extract affiliation data from the documents. The use case is the extraction of references from socio-legal literature, which contains many scholarly works which contain references solely in the footnotes, in heterogeneous formats and often heavily mixed with non-reference text such as commentary. The default model of AnyStyle exhibits poor performance in handling citations based on footnotes. However, the performance can be substantially enhanced by using a moderately-sized training corpus of annotations for two distinct models: the “finder” model, which predicts whether a line comprises a reference or not, and the “parser” model, which predicts the constituent parts of the extracted lines that belong to a particular element of a reference. Trained with one dataset of annotated documents from the Journal of Law and Society (25 “finder” sequences and 1500 “parser” sequences), the recognized references can be validated by entries in citation indexes such as the Web of Science in roughly 90% of cases. The limits of “old-style” CRF models are obvious: No amount of training data will be able to cover all edge cases when applying simple statistical prediction algorithms to messy data. In contrast, commercial large language models such as GPT-3 show spectacular performance without any prior training, as they can “understand” the semantics of the tokens in the text and are therefore able to distinguish information from noise much better. Yet, as long as these new technologies are, at least at scale, slow and expensive, and given their proprietary nature, further work is warranted to improve the performance of the existing fast and open source reference extraction technologies. In this context, deep learning techniques will be needed to overcome the limitations of the existing solutions. |