CONTROLLED VOCABULARIES - Semi-automated methods for BIBFRAME work entity description
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 14 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/60261 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Computer animation
Transcript: English(auto-generated)
00:00
Up next, we have Jim Hahn from Penn Libraries to talk about semi-automated methods for a bid framework entity description. You are now the presenter.
00:21
Okay, thank you. Let's see. So continuing on our adventure through automation and reuse of honest. I'm going to talk about Here we go. Basically, how we can help catalogers who are using this new vocabulary bibframe and how it
00:48
could be integrated into an RDF linked data editor. And so what I did was configure Library of Congress subject headings from the linked data service at the Library of Congress use that vocabulary.
01:06
And on it, and I use some combined sources, both from libraries. There's a consortium of libraries called Ivy Plus. I used them selected records and also from share VDE
01:35
Find ways that might help catalogers in creating new linked data descriptions and
01:42
in the bibframe vocabulary. It's the work description that you're pulling in subjects so You know, it's resource intensive to create these linked data descriptions and in some ways it's actually more time consuming than
02:01
Classical ways traditional ways and because you have these external authority entities, I think. You know this this web conference, you understand that, you know, we really need to do a careful selection and referencing of external authority entities and it's It's, it can be time consuming, but it's very important. And so I think you're increasing
02:23
the value of what you're doing when you when you add these external resources, of course. So how can we help. How can we help in describing the bibframe work entity in an RDF editor for the For the cataloger who's not only are they in a new editor, they're in a new
02:42
vocabulary and they're trying to describe this this resource. So how can we help them. Right. Here's a little screenshot of the RDF editor and my attempt. This is a selection of the RDF, Sinopia RDF editor developed it. Stanford Jeremy Nelson. He has a good article and code for lib and the development of this
03:06
So what what I trained it against was title to subject and I'll show that next. Okay, so the this title label here and potential suggestions that
03:21
In the proposed system, our catalog would be able to select or de-select Because not, not all these titles are going to be relevant. So everyone had an overview of on it from the last talk. And I really appreciate that it's open source and that you can use it to generate these subjects suggestions. What I did was
03:45
Trained against titles. So we have this title field to 45 a and then if it has a Library of Congress subject heading that was used. So these are that that was the training corpus and it was training trained in the way that I thought it might be used, which was
04:06
BIBFRAME title elements. And of course there's more we can do there. And I'll talk about that later. But this was the initial training. And I In the matter most you'll find this link. So this is the link data vocabulary converted to so LCSH SCAs wasn't at the time available in TTL.
04:32
So I needed to convert that. And at the same time before I used a really large corpus, the 9 million title subject pairs. I first tested it out just on a pen share VDE enrichment. So we have
04:49
A little under 2 million just at pen. And here's, here's kind of the some libraries have made investments like like Duke, they have quite a
05:01
few subject references in the 650 field. Now that's not the only place you could put a subject reference. There's also some limitations in that the LCSH vocabulary. I mean, it's possible to construct subject assignments that you might not find in the ID.loc.gov download
05:21
So we really are looking at subsets of data and doesn't make any Basically, I'm not asserting that it's comprehensive but These are, this is where the records came from initially and where you could find title subject pairs. And when I did analysis of IV plus libraries, it seemed to about 25% of the libraries had
05:47
Subject references. And I was only looking at LCSH right now that had the ID.loc.gov references. So it's a rather new area. And of course it's not required. But as we moved into as more libraries move into linked data cataloging, you may expect this to grow.
06:07
At the same time, this is the data that were available when I was doing the training. There's likely more data that could be pulled in now. And sort of a
06:22
Baseline analysis of genres. This is the 655A field. Now if a library has made use of this field, and I think some do and some don't, but you can use this to target Well, I'll just go ahead and say like some are more like the Penn Library. You can see we have more, we're using this more
06:50
for films. Stanford. Of course, it's a small selection, but of what they put in. There's more music. So this is that's kind of an interesting
07:01
Separation of concerns. And the reason I note that is because The RDF editor Suppose You know there's a genre target like the cataloger has Entered into the form that there's a specific genre, the editor may select sort of a
07:21
Stanford specific API as you know, you know, if they select music as a genre, then You might use that endpoint. So you can have either all schools combined into an endpoint or separate API is based on genre type You could you could slice this a couple different ways, but I feel like genre type is
07:44
One of the ways which you can target this and As much as we're trying to help catalogers. This is a way you can try to help the machine learning system, you could say, Hey, We trained you on this genre, more so. And that's what's that's what's been catalog
08:03
now. So perhaps this will give you a higher scores and of course the the Testing out the data really you get pretty good results without having to do too much configuration of algorithms and I'll go ahead and say that I think as
08:28
Machine learning has advanced the I think the field is moving towards a data centric approach which is making your data and your labels as good as they can be in putting less emphasis on algorithm development because
08:44
I'm sure you can you can set the ensembles, but at a certain point, it's going to come down to how good your training data are over algorithms. And I think that for a long time machine learning focused on algorithms over data.
09:02
So here's just an example workflow of Not super complicated. You have your RDF editor catalog. There's going to type in a title of within the BF work description. Title gets sent to the API from there you return suggestions with scores that can be business logic and Sinopio, which is just basically like set of instructions that will tell you
09:25
You know those that are higher in this threshold maybe display those now those are populated to the subject form field, but the catalog are always Has the choice to select what they would like and then
09:40
We're just hoping that this might be able to auto suggest subject attributes and this is, you know, sort of in contrast to completely automated cataloging and I think it in some ways tries to combine the best of both In the sense that catalog or choice is always
10:02
catalog of choice is always available. And then we're also trying to extend the professional expertise. So, you know, catalogers are learning how to work with machine learning tools at the same time that those machine learning tools are learning to
10:23
Make use of sort of human human expertise as well. And so, you know, what I'm what I'm planning at Penn is we're going to focus on some user evaluation of some of the machine learning outputs and of course you could study some of the algorithms which are enough can support for future experimentation.
10:46
I have also started to look at this is introduced kind of an older API, which perhaps some of you have seen this before. It's, it's, it's rather old. Other ways we might be able to auto suggest things and this isn't necessarily machine learning. I think
11:02
this is more what OCLC do is more along the lines of data mining, but they have an endpoint. That could provide access when provided an input of say ISBN or ISSN you can get back suggested authors fast subjects with your eyes and Library of Congress or shelf number call number.
11:27
That's useful in the sense that it meets the objective of trying to streamline an entity description. Using already existing data. So actually using the work that catalogers have already put in
11:41
And I think what classify this classification API. It's going through the much of what OCLC already holds and says, like, these are the most most assigned for For this say for this ISBN. This is the author that has most assigned if it's available. This is, these are the fast subjects that are most assigned and
12:03
This is the LC number that's most represented and you always have the option. The catalog always has option to select. So they can say, Okay, I see that LC number. I'm going to tweak that a little What catalog wouldn't want to do that right So the data flows very similar. This is a suggestion that I very recently provided to the
12:25
The service that might pull this in for the RDF editor, which is basically just to take either your ISSN, ISBN, send that to the classify web API. It uses XML. It returns XML. So you're going to parse that and then the business logic again could evaluate which parts that
12:48
Are relevant that you want to pull in and again cataloger selects relevant properties. So that's, that's my talk. I have to give, you know, much appreciation to the honest tutorial. That was really excellent.
13:04
The syntax library. I chose to use the C version of it right down to the, right down to, I know people built in things over this, but I use this to convert
13:20
SCA's from Library of Congress into TTL and then of course the OCLC research, the experimental classification service. So I will stop there. I'm done talking. I'll be happy to take care of questions.