AUTOMATING LOD - Annif: leveraging bibliogr. metadata for automated subject indexing & classification

Cite

Related Material

ZBW - Leibniz-Informationszentrum Wirtschaft

Hochschulbibliothekszentrum des Landes Nordrhein-Westfalen (hbz)

Suominen, Osma

Formal Metadata

Title

AUTOMATING LOD - Annif: leveraging bibliogr. metadata for automated subject indexing & classification

Title of Series

SWIB18 - Semantic Web in Libraries

Number of Parts

Author

Suominen, Osma

Contributors

N. N. (Moderation)

License

CC Attribution - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/60332 (DOI)

Publisher

ZBW - Leibniz-Informationszentrum Wirtschaft

Hochschulbibliothekszentrum des Landes Nordrhein-Westfalen (hbz)

Release Date

2018

Language

English

Production Place

Bonn, Germany

Content Metadata

Subject Area

Other

Genre

Conference/Talk

Abstract

Manually indexing documents for subject-based access is a very labour-intensive intellectual process. A machine could perform similar subject indexing much faster. However, an algorithm needs to be trained and tested with examples of indexed documents. Libraries have a lot of training data in the form of bibliographic databases, but often only a title is available, not the full text. We propose to leverage both title-only metadata and, when available, already indexed full text documents to help indexing new documents. To do so, we are developing Annif, an open source tool for automated indexing and classification. After feeding it a SKOS vocabulary and existing metadata, Annif knows how to assign subject headings for new documents. It has a microservice-style REST API and a mobile web app that can analyse physical documents such as printed books. We have tested Annif with different document collections including scientific papers, old scanned books and current e-books, Q&A pairs from an “ask a librarian” service, Finnish Wikipedia, and the archives of a local newspaper. The results of analysing scientific papers and current books have been reassuring, while other types of documents have proved more challenging. The new version currently being developed is based on a combination of existing NLP and machine learning tools including Maui, fastText and Gensim. By combining multiple approaches, Annif can be adapted to different settings. The tool can be used with any vocabulary and with suitable training data, documents in many different languages may be analysed. With Annif, we expect to improve subject indexing and classification processes especially for electronic documents as well as collections that otherwise would not be indexed at all.