AUTOMATING LOD - Annif: leveraging bibliogr. metadata for automated subject indexing & classification
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 16 | |
Author | ||
Contributors | ||
License | CC Attribution - ShareAlike 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/60332 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Bonn, Germany |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Lecture/Conference
00:46
Computer animation
02:04
Computer animation
16:03
Computer animation
23:02
Lecture/Conference
Transcript: English(auto-generated)
00:12
Explorer. I'll let you do it. So our next presenter, I know, but I will take my paper
00:20
anyway, is Osmond Swarmanen from the National Library of Finland, the National Library, who's going to talk about leveraging bibliographic metadata again for automation, but this time for automated indexing and classification.
00:46
OK. Yeah, hello everybody. So I'm going to talk about ANIF, which is a way of using bibliographic metadata to improve subject indexing and classification. So first of all, you probably know, but just to recap,
01:04
what is subject indexing or classification? Libraries, archives, and museums have a lot of material, different documents, and to make it possible to find them, we usually attach subjects or tags or classes to them.
01:23
And this is called subject indexing or classification. You use a thesaurus or a classification usually. But this is a lot of manual work, so it would help if we could have something like a program that does at least, helps us do this efficiently.
01:41
And this is the idea of ANIF. Because as libraries, we do have a lot of metadata. For example, in Finland, we have the Finna discovery interface, which aggregates a lot of metadata from libraries, archives, and museums. So currently, it holds about 15 million records, and many of these are tagged with subjects.
02:05
And the idea is to do some machine learning or just statistics or other algorithms using this existing metadata. So we have a lot of just metadata
02:21
without the actual documents, but then we also have a much smaller amount of full text documents. And the idea is to use both of these for training. And I started doing this last year. I did a quick prototype using metadata from Finna,
02:44
and it was really put together rather quickly, but it worked well enough to get people excited. So beginning of this year, I started work on a successor that was on a more solid basis in terms of code.
03:01
So the prototype was just a loose collection of scripts, but the new one is a Flask web application. And the other main difference is that the prototype was built around an elastic search index used in a special way, but the new one has currently three different algorithms for automated subject indexing.
03:24
Other than that, they have the similar ideas, so multilingual, multivocabulary, have a REST API and so on. About the algorithms, so when you want to do your automated subject indexing, there are two main approaches.
03:40
There are lexical and associative methods. So a lexical method is, the general idea is that you take in text and you match terms within the text with terms in your vocabulary. So for example, if you have a sentence like this, renewable resources are a part of Earth's natural environment and so on,
04:01
this comes from Wikipedia. With a good algorithm, you can match it with the concept of renewable natural resources, because all of those words appear in the sentence, even though they are not in the same order. So this is a potential match, and in this case, it would probably be correct.
04:21
So this is a relatively simple method. You only need the vocabulary and you need the text. But then there are associative approaches, which, for which you need a lot of training data. And you use that data to learn which concepts are correlated with certain terms or combination of terms in documents.
04:42
And then you, for each concept, you form a model that could be something like this tag clause. So it's like a, correlations with different words. So in UNIF, I've used both kinds of algorithms. There are two associative algorithms.
05:01
One is a very simple TF-IDF similarity calculation, which is similar in spirit to what is done by text indexes like Lucene and Elasticsearch. And the other one is fast text, created by Facebook Research. So this is a machine learning algorithm for text classification.
05:22
You give it a bunch of classified documents and it creates a model and then does fancy stuff like word embeddings. But in the end, you can use it to classify new documents. Then, for the lexical approach, I used Maui, which is a tool created at the University of Waikato
05:42
in New Zealand originally. And to make it work within UNIF, we created a microservice wrapper so that it exposes a REST API that can be integrated. And these algorithms can be used either alone
06:02
or in combinations, which are called ensembles. So just to make an analogy, I have some pictures of musicians here. But these algorithms also tend to make silly mistakes. And there are many good and bad reasons for this,
06:21
but some reasons are, for example, your training data is usually never perfect. So there are problems with it and skewed. Then also with associative methods, you can discover correlations in the data that are not causative. So some terms happen to co-occur with certain subjects, but there is actually no connection.
06:41
Then there are problems with homonyms, like rock can mean either stone or a kind of music, or you have names in text that can be interpreted as concepts or words. Or then, of course, there's just random noise, especially in machine learning. But when you have an ensemble of algorithms,
07:01
of different kinds of algorithms, what happens usually is that the algorithm will make different kinds of mistakes. So again, if you have a band of amateur musicians, then each of them has a certain handicap, but they are all different. And if you're leading the orchestra, you're trying to figure out, okay,
07:20
these guys all have their problems, how can I sort of still make them sound good? And to solve this, going back to algorithms, we can do another level of training, so like a second order learning, if we have some additional training documents. And a nice way of doing this is called isotonic regression.
07:42
It's a statistical method to estimate sort of the trustworthiness of a particular algorithm in recognizing a particular concept. Also, it's usually implemented by the PAV algorithm. So I've used that in UNIF. I did some evaluation of these algorithms,
08:03
and I used some test corpora, which are quite different in nature. So these are all full text documents, existing full text documents that have been indexed with either YSA or YSO as subject vocabulary. So the first one is ARTO.
08:21
Those are articles from the ARTO database from our institution. It's both academic papers, but also less formal publications. Then we have some thesis, master's and doctoral thesis from the University of Uvascula, which are pretty long,
08:42
and they are usually centered on a single topic or a small number of topics. Then we have questions from Ask a Librarian Service, which has been running for many years at public libraries. So they are questions that come in from people, and then librarians answer them.
09:01
And then they have also been indexed with subjects. And finally, we have the digital archives of a regional newspaper called Satakunakonza. There are over 100,000 documents, but these were not indexed, but we took a random sample of 50 documents and gave it to four librarians,
09:20
and each of them would index them separately. This way we could sort of compare also between two humans. Some of these are available on GitHub, not all of them, because we can't share all of them for copyright reasons. And so I tried each algorithm in separation,
09:40
and the numbers here are F1 scores, so those are a combination of precision and recall. And now these numbers may seem rather low, but the problem with subject indexing is that it's really hard to get even two people to agree on the subject of a single document. So I would say that it depends a lot on the documents
10:02
and many other things, but generally speaking, a human level is somewhere between 0.3 and 0.5, according to many studies that have been done. So these are actually pretty good levels. So the three first bars are each algorithm separately,
10:22
and we can see that Maui, the yellow one, is performing best. Then the gray one is a simple ensemble where we just take the assignments of each algorithm and take a simple average of the scores they give for each concept. And the ensemble is always better than each algorithm separately,
10:42
so it can sort of cover some of the mistakes the algorithms are making. But then the two last ones are two variations of the PAV approach of trying to do second-order learning, and we can see that it usually improves results over a plain ensemble.
11:01
So this way we can build on combinations of algorithms to get better results. About the architecture of UNIF. So we start with some metadata, a large number of metadata from FINNA, and also a smaller number of full text documents.
11:21
And then we train the three different, there are three different algorithms. The Maui one is in a separate box because it's a separate microservice. And we use both the metadata and some full text documents for training. Then there is a separate fusion layer. This is the one that does the ensemble thing and the PAV. And for the PAV, we also need some more training data
11:43
in the form of full text documents. And then everything is packaged into a web app that provides a command line interface for administration and the REST API so that other systems can integrate with it. So in principle, any metadata or document management system
12:03
like a repository, institutional repository, can hook into the REST API to get suggestions about subjects for documents. There is also some mobile apps, which I'll talk about later, that can use the REST API. There's a website at unif.org where you can go and test it.
12:22
There's a box where you can type or paste text and press the analyze button and get suggestions about topics. Here, you just have to choose a good model. So this one, I took the text from the SWIB website and used the YSO English ensemble,
12:42
and it predicted that this conference is about open data, semantic web, linked open data, and so on. Libraries are somewhere a bit lower down, but they are there. But then, just for fun, I tried to create another model using Wikidata and Wikipedia.
13:02
Oops, sorry. Here, yeah. So I took the top 50,000 entities from Wikidata ranked by the number of site links. So these are the ones that appear in many different language versions of Wikipedia.
13:20
And I did the same thing, training on Wikipedia, English Wikipedia documents. And here, we can see that with this model, we can get the most relevant Wikidata entities for SWIB. But now we can see a problem here.
13:41
The first one, Lod, is actually a city in Israel. So it's a mistake, but this thing happens. But the other ones, many of them are pretty good, so it's not bad. So if you have some text and would like to know corresponding Wikipedia, oh no, Wikidata entities, you can try this out.
14:02
Okay, here's the command line interface. So it's really quite simple to make a new model. You have to load a vocabulary, and then you train with existing documents. And the formats that are required are really simple TSV files, or also SCOS for the vocabulary.
14:22
And then you can just test the model by analyzing documents from files. And then there's also an evaluation command. So if you have existing documents that are manually indexed, you can try how close the algorithm gets its own results
14:43
using different measures. There's a REST API. So here's just a very simple example, but you have some text, and then you give it to the API and say, analyze this, and the API will respond with a very simple JSON structure
15:01
with URIs, labels, and scores. So it's really quite simple to hook into this. And there's also a Swagger specification, so you get nice interactive documentation. So what kind of things can you do with Anif?
15:20
This is one early adopter. It's the University of Uvascula. They have an institutional repository where they store a masters and doctoral thesis of their students. So they use the REST API so that when a student uploads his thesis there, they use Anif to scan it
15:40
and suggest the subject keywords for this document so that you don't have to know, for example, what is there in the vocabulary. You get a list where you can choose which ones are correct. Then I did an experiment using, this was early with the prototype, but I took the Finnish Wikipedia,
16:01
which at the time had about 400,000 articles, and just performed subject indexing on this, and so it took about seven hours on a laptop. I think it would take a bit longer with humans doing it manually. So I got topics, one to three topics per article.
16:21
And here are just some random examples of articles and what kind of topics. Most of them are pretty good, but the red ones were wrong. So there are mistakes there. But anyway, I could get an overview. So I could find the most common topics in Finnish Wikipedia, and it turns out that they are football, ice hockey, warships, and pop music.
16:45
Then there are also some mobile apps. So the first one was a prototype I did early on. It's a web app, so it runs in the browser of your phone. So you can use it to scan, to take a picture of a document,
17:02
and it will do OCR using a cloud service, and then give the text to Anif, and then show the subjects. And then another one was created more recently by my colleague, which is an Android native app, which does the OCR on the device itself.
17:21
So it's much faster than the cloud service. You can just scan a document, and within seconds you will have the suggestions for what it's about. Okay, then we organized a hackathon recently at the National Library, and there was a competition to make the best product.
17:42
And this was the winning entry. It's a Chrome browser extension called Finna recommends. And the idea is that you have to install it first, of course, but you get this button in your browser. So first you select some text, some English language text in this case, from any webpage, and then you press the button,
18:02
and it will suggest books about the subject of the text. So it will analyze the text using Anif, and then figure out what it's about, and then search books on the topic. That's really neat. Okay, so how to get it?
18:21
Anif is on GitHub. It's a Python code base. I would say it's pretty good quality. I use a lot of quality analysis tools to make sure the code is clean, and it has good coverage of unit tests, and there's some documentation in the wiki about how to use it. And it's also on PyPI because it's a Python package,
18:41
so you can just install it basically with a single command. And it has quite a lot of dependencies, so please use a virtual environment. You can apply it on your own data. So the idea is that you just choose the vocabulary
19:01
that's important to you, and then prepare a corpus from existing metadata. Then you load the corpus or train the model using the corpus and the vocabulary, and then you can start using to index new documents. I've also been thinking about starting
19:20
some kind of community group on do-it-yourself automated subject indexing. So we could discuss different kinds of applications, use cases, algorithms, whatever, experiences. So if you're interested, please contact me after the conference. Thank you.
19:47
So. Yes, of course. Okay, normally I don't say thank you for a talk because this is your job, but this time I do, this was terrific.
20:01
Thank you. And I'm asking myself how many people were working on this fascinating project. And yeah, all my questions were answered. We can use it, I think, but maybe one question. Do you have some formal way of controlling your success?
20:26
Okay, so let's start with the easy one. I was mostly working alone on this. There was another developer early on, but unfortunately he left the library to work elsewhere, so I was left alone. I, of course, collaborate a lot with colleagues,
20:40
with the data and corpora and things like this. But then the other question is how to validate this. And so I've tried to use many metrics to compare against gold standard subjects for existing, I mean, existing manually annotated documents.
21:01
And mostly I found the F1 scores and also a measure called NDCG to be useful. But there's always the problem that because the subject indexing is, like from an empirical perspective, it is quite subjective. If you have two people index the same document, they will never agree completely.
21:22
So really what you need to do is to have somebody evaluate the result after it's produced. And it's not so difficult to do, it's difficult to do this on a large scale. So we did organize a workshop last year
21:44
where we had librarians index small number of documents in parallel and then we could get some measure of how closely their indexing aligned. And it was one third. So one third of the concepts on average
22:00
were the same between two people. And then we could compare it with what the algorithms at the time that was the prototype produced, which was about 0.22 or something. So that's one way of measuring it, but that's not necessarily the best way because as I said, the algorithm makes silly mistakes
22:21
and a measure like this will not detect that some concept is completely wrong versus if it's just slightly wrong. So you won't see the distinction. So we are planning to do more evaluations and also to integrate this into our own, for example, our own document repository so that we could get some more,
22:44
more evaluation, more experience. And also the people at the Uvascular University, they are collecting the data about their students using this tool. So which topics did it suggest, which ones were accepted, which ones were rejected.
23:03
So I hope to get afterwards some data on how well it works for them. Another question, sure. Thank you for the talk.
23:21
I wanted to ask, was this used on two languages like Finnish and Swedish, is this how? Yes, currently we have models for Finnish, Swedish, and English, so three languages. But there is no inherent limitation, I think,
23:42
at least when it comes to, well, I don't know, maybe problems, there would be problems with Chinese or Japanese or languages which have a completely different structure in how text is expressed, but this, in principle, is multilingual. So for example, we use NLTK to,
24:05
to do some pre-processing on the text, and it supports quite a few languages. And if it doesn't support something, then probably there is another toolkit that could be plugged in instead. We may have time for one final question
24:21
before we move on to our last presentation. Good, so thank you again. Oh, no, one more, yeah.
24:41
No, I haven't tried it, but I think part of the appeal here was to be to do it yourself using open source and our own data. I think coming from a rather small country with a language that is not understood elsewhere is, it probably wouldn't work that well,
25:01
but I haven't tried it. Good question. Thank you.