AUTOMATING LOD - Annif: leveraging bibliogr. metadata for automated subject indexing & classification

Zitieren

Zugehöriges Material

ZBW - Leibniz-Informationszentrum Wirtschaft

Hochschulbibliothekszentrum des Landes Nordrhein-Westfalen (hbz)

Suominen, Osma

Formale Metadaten

Titel

AUTOMATING LOD - Annif: leveraging bibliogr. metadata for automated subject indexing & classification

Serientitel

SWIB18 - Semantic Web in Libraries

Anzahl der Teile

Autor

Suominen, Osma

Mitwirkende

N. N. (Moderation)

Lizenz

CC-Namensnennung - Weitergabe unter gleichen Bedingungen 4.0 International:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.

Identifikatoren

10.5446/60332 (DOI)

Herausgeber

ZBW - Leibniz-Informationszentrum Wirtschaft

Hochschulbibliothekszentrum des Landes Nordrhein-Westfalen (hbz)

Erscheinungsjahr

2018

Sprache

Englisch

Produktionsort

Bonn, Germany

Inhaltliche Metadaten

Fachgebiet

Sonstige

Genre

Konferenz/Talk

Abstract

Manually indexing documents for subject-based access is a very labour-intensive intellectual process. A machine could perform similar subject indexing much faster. However, an algorithm needs to be trained and tested with examples of indexed documents. Libraries have a lot of training data in the form of bibliographic databases, but often only a title is available, not the full text. We propose to leverage both title-only metadata and, when available, already indexed full text documents to help indexing new documents. To do so, we are developing Annif, an open source tool for automated indexing and classification. After feeding it a SKOS vocabulary and existing metadata, Annif knows how to assign subject headings for new documents. It has a microservice-style REST API and a mobile web app that can analyse physical documents such as printed books. We have tested Annif with different document collections including scientific papers, old scanned books and current e-books, Q&A pairs from an “ask a librarian” service, Finnish Wikipedia, and the archives of a local newspaper. The results of analysing scientific papers and current books have been reassuring, while other types of documents have proved more challenging. The new version currently being developed is based on a combination of existing NLP and machine learning tools including Maui, fastText and Gensim. By combining multiple approaches, Annif can be adapted to different settings. The tool can be used with any vocabulary and with suitable training data, documents in many different languages may be analysed. With Annif, we expect to improve subject indexing and classification processes especially for electronic documents as well as collections that otherwise would not be indexed at all.