Measuring the Performances of AnyStyle and Grobid against a Gold Standard

Cite

Boulanger, Christian

Wagner, Andreas

Max-Planck-Institut für Rechtsgeschichte und Rechtstheorie

Pagnotta, Olga Peroni, Silvio

Formal Metadata

Title

Measuring the Performances of AnyStyle and Grobid against a Gold Standard

Title of Series

New Approaches For Extracting Heterogenous Reference Data

Number of Parts

Author

Pagnotta, Olga

Peroni, Silvio

License

CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/69304 (DOI)

Publisher

Boulanger, Christian

Wagner, Andreas

Max-Planck-Institut für Rechtsgeschichte und Rechtstheorie

Release Date

2023

Language

English

Producer

Reinold, Fabian

Production Year

2023

Production Place

Frankfurt am Main

Content Metadata

Subject Area

Information Science

Genre

Conference/Talk

Abstract

In a prior study (Cioffi & Peroni, 2022), we analysed the available reference extraction tools to understand their performances off-the-shelf – i.e. by using them as they have been configured, without prior training. We evaluated them against a corpus of 56 PDF articles (our gold standard) published in 27 subject areas (Computer Science, Arts and Humanities, Mathematics, etc.). From that analysis, we have identified the two most promising tools for bibliographic reference extraction and parsing, i.e. Anystyle and Grobid, which are CRF based. We want to extend such a study by training these two tools against the same gold standard used in the previous analysis to understand how much the performances improve. As a result, we are also going to revise the code used for testing and comparing the reference extraction software to make it available also for others to be reused for similar analysis. The final aim of this work would be to develop a reference extraction service which enables a user to provide a PDF of a scholarly article in input and to have, in return, citation data and bibliographic metadata from all the references that are cited by the given article in a format that enables their ingestion in OpenCitations (Peroni & Shotton, 2020). As a first step, a series of tests have been performed to check if the two software have been recently updated. During the testing phase, some differences emerged comparing the results obtained in the current study to the ones of (Cioffi & Peroni, 2022). With those differences in mind, we are going to modify the evaluation code (Cioffi, 2022) used in the prior study to adapt it to the current version of the tools. Afterwards, we will proceed with the training phase. To train Anystyle, the software documentation makes available a set of instructions for refining the training of the tool against a gold standard (ours, Cioffi, 2022). For what concerns GROBID, a series of instructions are available, too, for training the tool with specific data. The work is currently ongoing, thus in the presentation we will show the results obtained by mid-May. However, the next steps will consist in the evaluation of the trained versions of Anystyle and Grobid and the creation of a new gold standard, to compare the results of the evaluation and the training phases to the previous version of the standard. The results will be processed in order to integrate the data in the OpenCitations infrastructure. References Anystyle, https://archive.softwareheritage.org/swh:1:snp:92bc79eb31b8e7fd760c985aa62f313ced3976bf;origin=https://github.com/inukshuk/anystyle-cli. Cioffi, A. (2022). Data for Testing and Evaluating References Extraction and Parsing Tools (1.0). Zenodo. DOI:10.5281/zenodo.6182066. Cioffi, A. (2022). Code for converting different formats to TEI XML and evaluating results (1.0). Zenodo. DOI:10.5281/zenodo.6182128. Cioffi, A., & Peroni, S. (2022). Structured References from PDF Articles: Assessing the Tools for Bibliographic Reference Extraction and Parsing. In G. Silvello, O. Corcho, P. Manghi, G. M. Di Nunzio, K. Golub, N. Ferro, & A. Poggi (A c. Di), Linking Theory and Practice of Digital Libraries—26th International Conference on Theory and Practice of Digital Libraries, TPDL 2022, Padua, Italy, September 20–23, 2022, Proceedings (Vol. 13541, pp. 425–432). Springer International Publishing. DOI:10.1007/978-3-031-16802-4_42. Grobid, https://archive.softwareheritage.org/swh:1:snp:4f734b61d425809bfd1f1d8d7b8b160edf81ef2b;origin=https://github.com/kermitt2/grobid_client_python. Peroni, S., & Shotton, D. (2020). OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies, 1(1), 428–444. DOI:10.1162/qss_a_00023.