Leveraging Large Language Models (LLMs) for automated soil health data extraction from ecological research papers

Zitieren

OpenGeoHub Foundation

Hackenberger, Domagoj K. Hackenberger, Branimir K. Derd, Tamara

Formale Metadaten

Titel

Leveraging Large Language Models (LLMs) for automated soil health data extraction from ecological research papers

Serientitel

Artificial Intelligence for Soil Health

Anzahl der Teile

Autor

Hackenberger, Domagoj K.

Hackenberger, Branimir K.

Derd, Tamara

Lizenz

CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/69682 (DOI)

Herausgeber

OpenGeoHub Foundation

Erscheinungsjahr

2025

Sprache

Englisch

Produktionsort

Doorwerth

Inhaltliche Metadaten

Fachgebiet

Informatik Biowissenschaften / Biologie

Genre

Konferenz/Talk

Abstract

The vast and growing ecological literature is an invaluable resource for improving our understanding of soil health and related environmental issues. However, accessing and extracting key information, such as the availability of datasets and associated code, remains a labour-intensive task that limits the scalability of research synthesis and meta-analysis. In response to this challenge, we are developing a novel AI-powered tool that uses Large Language Models (LLMs) to automate the extraction of key metadata from published ecological research papers. Our system includes several important features: 1. Literature search: the tool performs targeted literature searches based on keywords, author names, topics or predefined lists of Digital Object Identifiers (DOIs), simplifying the identification of relevant ecological studies. 2. Automatic retrieval of articles: Research papers are automatically downloaded for further processing. 3. Extraction of data and code availability: Using the fine-tuned LLM, the system "reads" the papers and extracts whether the associated research data and/or code has been provided by the authors. This extraction process focuses on identifying explicit statements in the text of the paper, in supplementary materials or in associated repositories. 4. Tabular output: The extracted information is stored in a structured, tabular format that allows for easy analysis, comparison and integration into broader research workflows. 5. Automatic data download: When available, the system automatically retrieves the datasets and code and stores them for further analysis or replication studies. We have developed and tested this package in the context of ecological and soil health research, focussing in particular on studies investigating the effects of environmental pollutants and land use on soil ecosystems. Initial tests of the tool have shown promising results in identifying publicly available datasets and codes and have significantly improved the efficiency of data extraction and synthesis compared to manual approaches. The application of this system in ecology holds significant potential to advance soil health assessment by enabling researchers to more easily access and utilise existing data for meta-analyses, machine learning model development and evidence-based decisionmaking. Further development will focus on refining the LLM-based extraction methodology, improving the accuracy of data and code identification, and expanding the tool’s capabilities to process additional metadata types. This tool provides a scalable and efficient solution to overcome the bottlenecks associated with manual data extraction and represents a valuable contribution to the ecological research community.