Automating metadata extraction and cataloguing: experiences from the National Libraries of Norway and Finland
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 15 | |
Author | ||
Contributors | ||
License | CC Attribution - ShareAlike 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/69649 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
12
00:00
Computer animation
07:27
Computer animation
14:49
Computer animation
15:26
Computer animationProgram flowchart
24:01
Meeting/Interview
Transcript: English(auto-generated)
00:05
My name is Pierre Boglid from the National Library of Norway. I work in the team working on text production, mostly focusing on legal deposits of material. And I will present this with Osma from Finland, sharing our experiences on automating metadata extraction and cattle.
00:29
So as the National Library of Norway, we are responsible for illegal deposits, which is both physical, in terms of physical preservation of delivered items, and more and more digital delivery of material.
00:45
So anything published in Norway, written or other, has to be delivered to us. We have a digital form, looking like this, for digital legal deposits, where publishers or writers can send us
01:01
material in PDF or other formats, and they can document themselves all of the descriptive metadata of the record. So when it's delivered in this way, we get a correct, minimal cataloging metadata, but enough to put that in our catalog. We do also open for a simple delivery of files that some big institutions and
01:28
publishers have used as well, where they can just send us files with no metadata, which has resulted in the situation of us having thousands of PDF files waiting to be cataloged and no annotation at all.
01:43
In that case, our catalogers need to go through these bytes of files and annotate them themselves. We have developed a tool that we internally call the demo for Digit-Out Mutak, a digital
02:01
intake of files, where it's possible to have this same sort of minimal cataloging that gets translated into a MARC record and uploaded to our catalog. But Cray material is a huge domain. We have way too many files for us to handle automatically.
02:26
We have scientific reports from public agencies, from universities, from research centers, and the sheer amount of documents is too big to do manually. The good thing with these documents is that most of them are templated and follow a quite strict layout.
02:49
This example from NIBEO, here you see the first and second page. And the second page is actually a tabular presentation of basically all or most of the cataloging descriptive metadata.
03:02
We need dates, title, offers, ISBN, ISSM. This is what we are interested in. Starting last year, in January 23, we started exploring this with the hypothesis that using a text layer from a PDF,
03:26
and by text layer, we mean not only the text content, but also the position, the font, the font size, so the actual layout of the text, and a set of quite simple rules, regular expressions and keywords and heuristics as to where we guess the information will be.
03:46
We can automatically extract enough metadata for simple cataloging. So we developed this software called Meteor, metadata extraction from Ophancia reports, public reports, which is open source in Apache license.
04:03
This is the GitHub link, also documented in the abstract for the presentation. It's written in Python using a PyNewPDF library, which is one of the most flexible and powerful PDF libraries we found, which is usable as a Python module or as a REST API. We'll have some demos later on.
04:25
And this has been integrated already in our production main pipeline for intake of digital material. The response type is very fast. It's only a simple algorithm on text, so it's quite fast.
04:41
And it's integrated as a suggestion provider. That was a pragmatic solution we found to have some automation without needing to trust it fully. We've had to focus on explainability, as you can see on these yellow stars on the right here. It shows that this author here was written automatically from information found in the codophile outside too.
05:10
So a cataloger that is curious as to where this comes from can go in and find this information. Finally, we've saved the suggestions systematically for evaluation, which gives us some great insight into the performance of this approach.
05:27
Since it's been introduced in production, we've run above 1800 reports. And on those, ISBN, ISSN have been correct in above 80% of the time.
05:43
Publication here has the same range, 86%. Long-wage is predicted as 95%. These models come from the long-wage technology group in Trumpset. When we look at authors, these four values are quite simple to extract.
06:05
ISBN, ISSN, here it's a fixed space for the values that can take. Authors, on the other hand, can be much harder to get. And when we look at strict equality for the list of first name and last name for the authors, we are at just above 50%.
06:25
If we look at when we get at least one correct name in that list, we are close to 70%. So that's a good indication that the algorithm looked into the right place to find information. Looking at title, it's about the same thing. We are at just below 50% if we look at strict equality.
06:45
If we allow for casing difference, like title case, phrase case, upper case, if we ignore that, we go to above 60%. If we allow for some prefix and suffix, so just some overlap or some extra bit of string at the beginning or end, we are at above 80%.
07:03
So this doesn't mean that we are happy with this wrong casing or wrong start or end of the title. But it suggests that the algorithm is looking at the right place. So that is quite promising.
07:22
Meteor is essentially limited because it's based on, let's say, expert system and simple algorithms. But also because PDF themselves can be hard to work with. PDF can typically contain text images.
07:40
This brief deal, for example, here is a logo. It's not text, it's an image. So to get that text, we will need to run some OCR, some head on it, which would increase the response time. And it will use possibly some errors. The text layer itself can be messy. Here on the right, you see, I highlighted the text layer on it.
08:02
And there is some sort of lorem ipsum-laden text. That is invisible on the printed version of the palette. But that is present in the text layer, so that would throw the algorithm off as well. Lastly, some questions we have are, what is the threshold for full automation?
08:21
When can we go from suggestions to actual writing of metadata? So what is good enough? We need some evaluation, some roadmap as to when it will be good enough, and some proper evaluation. Lastly, one thing we will explore in the near future is integration with catalogs and authority registries,
08:44
which we already have in some small parts. Some things like where we have an ISSN, we can, with the catalog, get the publisher lane, which would enrich the metadata for that, which would mean higher accuracy and better metadata,
09:00
but also a loss of generality, because we would implement this towards specific catalogs and registries. We haven't found a way to implement this while remaining general in the annotation. I will pass the word to Mosma.
09:23
Okay, thank you, Pierre. Hello everyone, I'm Mosma from the National Library of Finland, and I will continue from here. We have also been working on metadata extraction some years now, sort of in the background, and one of our important services that we provide at the National Library
09:41
is that we provide the repository services, in practice, the installations of the D-Space software to many organizations, such as many university libraries, but also, for example, government organizations. And in those repositories, there are many material types, like reports, thesis, articles, books, and so on.
10:02
And that's quite a big pile of prey literature. And all the users of these repositories, when they add new documents, they need to enter metadata, so it would be helpful to have some kind of suggestion service for that. And of course, there are other use cases as well, including the legal deposit services,
10:25
just like Pierre explained in Norway. And we have tested, in the past, tools like Grobit, and also Meteor when it was released as open source, but the results were not very promising for our materials, mainly because there are so many different kinds, and they are so heterogeneous.
10:45
But we also did some early testing with large language models, and they were promising, actually gave a lightning talk at last year's SWIB on this. And one of the findings from that was that we need better curated data in order to do that properly.
11:02
So, please give the next slide. So we created a dataset called Fingrelet, which includes links to PDFs, and then curated metadata from nine different DSpace repositories. So this includes academic libraries, but also government institutions.
11:23
And it has about 800 documents currently, which are split into a 75% training set and 25% test set for machine learning purposes. And it has this kind of simple doubling core style metadata format, with currently about 10 fields that are well curated,
11:43
and then some more that are not so well curated currently. And there are documents in three languages currently, Finnish, Swedish, and English, but we are adding some Northern Sami language documents as well. It's there on GitHub if you want to take a look. And the idea of our experiments is similar to what Pier explained.
12:05
We want to predict the metadata based on the text layer of a document. But in our case, we are using fine-tuned language models. So we take the first few pages and also the last pages of a PDF document, we take the text from there, and then we have fine-tuned a language model to produce the corresponding metadata.
12:27
So ideally, when you put in a document like this doctoral thesis here, feed it through the model, then the model will produce metadata that has information like title, creator, publisher, date, and all the identifiers.
12:43
Well, actually, the output is in JSON format, but this was simpler to show on the slide. So we have been testing many different language models, sort of smallish language models from these different families like Mistral and Llama.
13:02
And some of them we have also published on the Hugging Face Hub, which is a platform for sharing machine learning models and datasets. And our process of working with these models goes like this. So first, in the middle here is the yellow repository, which is the Hugging Face Hub.
13:24
And we take a base model from there, we download it, and then we take it to a fine-tuning environment. We are happy to be able to use the University of Helsinki high-performance computing environment for this. So within that, we fine-tune the model on the metadata prediction task
13:41
using the Fingrelet dataset, the train subset as the dataset. And then when we are happy with the results, we can publish the model back on the Hugging Face Hub and then further deploy it on our in-current service. So that's another service where we can run the model.
14:04
And then finally, we have added a new module to the metadata tool that can make use of a language model backend through an API. So we can then feed in new documents to Meteor, and it will use the language model to predict the metadata instead of the built-in heuristics that have been created originally in Norway.
14:28
And now for a demo, I think Pierre will show this one. Basically, this will show the two different Meteors, the one from Norway and then our enhancements.
14:41
Yes. Let's see, I'm taking back the discussion here. Right. Can you hear me? Yeah. So I take an example as this PDF file in English
15:05
just to have a language that is supported by both assistants. And this is the Norwegian instance of Meteor. Here are the values extracted in real time.
15:21
So here, language, title, publisher. The publisher has been matched against the authority registry. This takes a bit of time, so I'll let it load on the side. And the origin, so the explainability of the model is shown here. So 24 comes from the copyright on page two.
15:40
It's possible to go here and look where this comes from. If I run the same documents to the Finnish instance here, not using the finder heuristics, but the LLM,
16:03
it takes a very short time as well. We get the language, and comparing the two, we see that the year was not found for some reason. The rest is very close. The subtitle is taken here, so that's quite nice.
16:23
The model learned it by itself to take this. And there are some differences into the name formatting as well. This was split as first name and last name. This one was a bit different. That was it for the demo.
16:51
Okay. Thank you for the demo. And move on with the presentation. Please show the next slide.
17:03
So one thing that we have noticed here is that we can train the language models to follow some of the cataloging conventions. So even though in principle it's just picking up the text as it is from the document, but it can also transform it a little. So here is an example. This screenshot here is from a doctoral thesis.
17:23
And the name of the author is Tino Kärkönen. Yeah, please show that. So the model can split that into first and last names. It sort of learns or knows how to do that with most common names. And then for the title, the title was written in all capitals on the document.
17:41
And actually that document doesn't have it spelled out in any other way. So it's only available in all capitals. But the language model can throw that into sentence case and also separate the main title from the subtitle using the ISPD syntax with a colon in the middle. So it can do these kind of small adjustments to the information at the same time,
18:03
which the traditional heuristics can't do, at least not as well. Okay, next slide. And here's the little evaluation table. It's a little hard to read, but it's all color-coded. So the information in red is related to a Norwegian dataset of 130 reports with metadata.
18:27
And the blue information is related to our finger-lit dataset. So the first column here in red shows meta evaluated with Norwegian documents. And on average it got around 76 percent correct, as shown at the bottom.
18:44
The second column shows meta evaluated with our documents from Finland. And it got only 63 percent right. So our documents are more heterogeneous, so they are more difficult for these heuristics. But then the three remaining columns are all variations of language models.
19:03
The first one, Quentu, that's a very small language model, only half a billion parameters. And it got 85 percent correct. So that's something we can use, for example, running on our laptop. It's such a small model. The other one, Mistrallemo, is a larger model with 12 billion parameters,
19:24
and it got 91 percent of the information correct. And the final one is the same model, Mistrallemo, but we used not only our own documents for training, but we also added the Norwegian documents on TOK for training.
19:42
And that way we got a slightly better result of 92 percent. So this shows that pooling our datasets can also help in getting good quality results. So, some lessons learned from our side. First one is that we wanted to know whether these multilingual models
20:04
like Mistralle and StableLM, whether they actually work on Finnish and Swedish language documents. And they work surprisingly well, even though officially they are only trained on the big European languages. The second point is that this fine-tuning approach allows for data-driven expansion.
20:22
So if you want to identify a new element from the documents, for example, the DOI in some articles, we just need more training data that shows this aspect, but we don't have to write any new code. We would have to do if we used heuristics. The third point is that we were a little suspicious about these language models
20:48
because they are so resource-intensive, whether they can be used in real time. But it turns out that these smaller models, like around 10 billion parameters, can work in reasonable time, 3 to 5 seconds,
21:03
and we can run them on our local infrastructure. So we don't need to send those documents anywhere else. We can just process them locally in our own computing environment. And the last point is that this ecosystem of language models,
21:20
of fine-tuned versions of the tools and everything related is moving extremely fast. So it can be sometimes very frustrating to work because everything you do today might be obsolete tomorrow or next month. So you have to stay on top of things and be prepared to switch to something better if it comes out.
21:43
Okay, so regarding future work and our collaboration, so of course one thing we're going to continue is that we need to expand the dataset, probably add some more documents to get better results. And the pooling seems to be a good idea, so we want to do more of that.
22:01
And of course we want to fine-tune and evaluate newer language models and sort of try new things until we are happy with the results, if ever. And then the second point, to indicate the origin of information, that's something that the language model cannot currently do
22:20
as we are using it. So to be able to do that like the heuristics can do, we would need some more specific training data. So our curated training data would have to indicate where this information can be found in the document. Then we can also train the model to return this information. But right now we don't have that training data.
22:41
And in general we are just working towards a solution that could one day be used in production settings. Right now this is just very experimental and we haven't even figured out the final use cases or which systems could actually make use of this. All right, I'll pass it over to Pierre. Other questions that remain are the evaluation metrics.
23:04
Some years on ISBN are really easy to have a yes-no answer or put a title or a list of prefers can be harder to evaluate for a model. And we also want to improve the text extraction from PDFs,
23:21
looking at text and images and cleaning up the text layer when possible. And lastly we want to resolve the names to identifiers in authorities and authority registries and catalogs. The support we have now for the authority registry is only tailored to our own Norwegian authority registry.
23:42
We want to expand that and possibly using the W3C reconciliation API to find identifiers. Thank you for your attention. Yeah, thank you, Milia and Osma, for your talk.
24:01
Now we're coming to the Q&A. In the chat there's one question by Kat for Osma. Roughly how much work went into creating the fine-grained document set? Okay, thanks. Good question. So we haven't really kept a tally of all the hours,
24:21
but originally we had a summer intern who started this curation. So we took the metadata from these repository systems and then had a person compare that with the documents and did this kind of first round. It took maybe a few weeks.
24:41
And then later on we have refined that. Myself and my colleagues have been working. I would say maybe a couple of two to four weeks of work if you had been doing it full-time, but we are sort of doing it occasionally now and then.
25:01
The challenge there is that the original metadata and the repositories wasn't really that good. And the other challenge is that we need to enter the metadata exactly the way that it sort of matches the document. And often it happens that, for example, the cataloguer who put the document in a repository
25:21
knows more about the document or the subject than is actually available in the document. So we have to make them match otherwise. It's not good for training. Okay, thank you. Perhaps a follow-up question. I'm happy that you are stressing the open and transparent approach on this,
25:42
sharing your models and data. I wonder, have you documented somewhere how one could approach such a project to be open and transparent up front? Because if we want, people might like to do similar things but don't know how to approach it. Is there something there already?
26:08
I'm not sure I understood the question. The question is if I wanted to approach a similar project and develop some models and create some test data and want to do it up front in an open and transparent way,
26:22
is there anywhere some instructions on how to approach such a project if I haven't done it, if I don't have the experience like you already? Well, we did have experience with other tools and I think at the National Library of Norway there is also some experience in this
26:42
because you decided to make this open source and you shared information about it from the start. I don't know, it's just maybe best practice in the community to try to do this in the open as much as possible. But there aren't any guidelines I could follow?
27:02
Well, part of this is just working with open source but then there is of course open models or open training data aspects and there I think it's useful to look at the machine learning and language model community. For example, there are Reddit groups and other communities
27:23
where these kind of things are commonly shared so just look for these kind of communities for inspiration. There is a lot happening on the Hugging Face Hub so that's certainly one place to follow. I would just comment that Osma, the work you did and the thin gray lit dataset on GitHub
27:42
is very well documented as well. You have written how the data was collected and animated so that will be a good entry point to see more precisely the process. Okay, great. Thanks.