We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Indexing of Special Collections for Increased Accessibility

00:00

Formal Metadata

Title
Indexing of Special Collections for Increased Accessibility
Title of Series
Number of Parts
14
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Challenge of Discovery: Recent large-scale initiatives focused the attention of the Smathers Libraries at the University of Florida on the need for significantly expanded and enhanced metadata for our digital collections, both retrospective and prospective. This requires new tools and changing roles and responsibilities for cataloging/metadata staff, including the application of automated processes, Improved and consistent metadata practices, and the development of new taxonomies. Projects that are described include the new genealogical initiatives with Internet Archive and Family Search, Portal of Florida History, the Digital Library of the Caribbean (dLOC) and the Cuban Heritage Initiatives. We have concluded the pilot project on automated indexing and metadata generation for the Portal of Florida History using 25,000 full text records and the JSTOR thesaurus instead of the Library of Congress subject headings to automatically index the collection and create a search portal. This paper gives a report on the results of the pilot and early application of the process to the entire UF Digital Collections to extract information from our complete digital collections (over 600,000 items/12 million pages) for the Portal. Results of the pilot show significantly increased retrievability, greater depth of accessibility via detailed subject metadata and then explores application to the entire digital collection. The idea of a more automated processes going forward to allow traditional cataloging to focus on the things that need individual attention and use automated tools to develop and improve metadata for other materials is explored since the large number of items (over 12 million and growing at 100,000 pages per month) makes it impractical to use traditional means. We are working with tools that have been developed for information products and services, but can be applied effectively to library collections. This paper covers three of the main themes of the conference: 1) Exposing Grey Literature to Wider Audiences using an open search portal 2) Confronting Obstacles in Accessing Grey Literature through limited metadata tagging and keeping the collection access obscure as sub sets of the Library OPAC, and 3) the Digital Preservation project covering everything from old newspapers, personal papers and county historical collections is the Lifeline for Grey Resources but if not widely available through the internet and deeply tagged using a controlled vocabulary the simple scanning of papers only creates the new microfilm dilemma of locking data away in in accessible places.
Computer animationLecture/Conference
Computer animationLecture/Conference
Computer animationLecture/Conference
Computer animationLecture/Conference
Computer animationLecture/Conference
Lecture/ConferenceComputer animation
Lecture/Conference
Transcript: English(auto-generated)
We are going to do this as a team because we want to share with you this project that we're doing with the assistance of Access Innovations. The libraries at the University of Florida are transforming access to library collections by transforming the way we catalog our collections and acknowledging the decreasing reliance on the online public access catalog for discovery.
The vision statement of the libraries calls for providing service at the point of need and digital collections are essential to that. Charles mentioned that Georgetown is a research one university. The University of Florida is also, we're a very large university, about 53,000 students.
Over a third of them are in graduate school or professional degree programs. We have six medical and health science colleges, over $770 million in research. So it's a very large, complex university.
We're also a land-grant university, so we have agricultural extension offices in every county in Florida, and we also have numerous research centers and other facilities outside of our home campus in Gainesville. Most on-campus students take one or more online courses per semester, and remote online degree programs are growing.
Service to users who are not physically in the libraries is very important, even though student use of the libraries is very high and growing. Historically, when we digitized materials from our print collections, the MARC record was the initial and often the only descriptive metadata, and Library of Congress subject
headings from the library were often the only controlled vocabulary assigned to the digital item. I'm not a fan of Library of Congress subject headings. Google has taught people to search with natural language, and using LCSH is often like asking our users to learn a foreign language.
I have to go way back. I'm sorry. Let's start over. I apologize. Got a little... This is it? Yes. Thank you. I'm sorry.
I guess I can't set my papers there, so maybe we'll have to hold them for you, Margie, when it's your turn, too. So our UF digital collections are very large and growing very quickly, so we are looking for automated processes to expand and enhance our metadata. Improved and consistent metadata practices must be defined and then rigorous prospectively
and metadata for existing content must be brought up to these new standards. This requires new tools and changing roles and responsibilities for our cataloging and metadata staff. We have a very large repository, the UF Digital Collections, which has more than 300
digital collections, over 545,000 items, and over 13 million pages submitted by UF libraries and partner institutions. We have several major initiatives that are driving us to make these changes in our metadata.
For example, Florida History is one of our preeminent collections at UF, both in print and digital form, but the content is drawn from many different collections, rare books, manuscripts, political papers, newspapers, maps, government documents, university archives, et cetera. The challenge is to identify all of that digital content aggregated and presented in a
coherent body of information, which we call the Portal of Florida History. As a result of this project, we turned to Access Innovations for assistance with this effort, and we jointly decided to begin with the digital and digitized theses and
UF graduates. So what you see here is the Portal for Florida History. In 2016, we began a pilot project, and we worked specifically with the thesis
foundations. And as you probably well know, if you were also involved in digitization, digital could be just a new microfilm, which is straight forward to create, but very difficult to
search and retrieve. The Florida Thesis Project, as we called it, was taking all the digital and digitized theses and dissertations, a collection of about 25,000 items, which covers a broad discipline, and see if we couldn't enhance the content.
Metadata for those to make them more retrievable. For about the last 10 years, University of Florida has been receiving digital deposit of electronic theses and the dissertations and performing retrospective scanning on the older dissertations. So it's a good-sized collection, and they have also begun the retrospective
scanning of all the previous theses. What we did was take software from the Data Harmony Collection, the Access Innovation Sys, and build a project to hold the data, to build a repository.
The data was extracted from the digital collection, including the existing metadata, which is in Mark Met's XML. We tested against three different controlled vocabulary, three Thesauri, which were very broadly, the coverage of them was very broad,
News Indexer, the National Information Center for Educational Media, and JSTOR, which covers a large repository. After we did all the testing and assessment, we determined that we really needed two Thesauri. One would be the topical one.
In this case, we chose JSTOR, and the second one would need to be Florida-specific authority terms. So we have a combination of the JSTOR Thesaurus modified for Florida, and a Geographic and Great Floridians file as well.
We also tagged the data for the Florida Thesis Project, and this shows you some before and after sample records. I don't know if this will work, does it?
Well, on your left, you see the original catalog record, which is very thin. There aren't very many keywords. There's not much in the line of genre. And on the right, you see the enhanced record, which has considerably more listed those
that come from JSTOR, as well as the additional ones from the Florida Thes. And you can see that there's a great deal more information for you to hang on to. Students can add their keywords as they submit their theses or dissertations, but they're not metadata accurate. And libraries
often add the LCSH headings, but as you know, particularly for special or great literature collections, those are very thin. And they use up with words like biography, theses, and electronic theses or dissertations. So out of the four genre headings, three of them are not particularly
useful for searching, particularly for conceptual searching. As you see from these examples, the first two and the last four lines remain the same in both records, but the automated process added 12 additional controlled vocabulary terms, which are theses and topicals
specific to the geography of Florida, and therefore made the records automatically inclusive for the portal of Florida history. What you see here on the left is an example of some of the University of Florida collections. And then
we move to the arrow on the top, which shows XML records exported from the University of Florida digital collections into the SIS repository. From that repository, we can export MARC records
to OCLC, we can export staff internal use and review, we can put them into a local repository with the enhanced metadata, and return them to the University of Florida digital collections. So as a result of all this work, University of Florida now has a taxonomy of Florida specific
terms that it can maintain and expand to use and manage both print and digital collections. But also because of the use of the taxonomy as the default search in the University of Florida digital collection Lucene software, instead of using the normal default in Lucene, which is full text, we get
80% or better accuracy in retrieval. Retrieval in just full text would be 55 to 60%, which gives an awful lot of false drops and frustration to the user groups. If we just add the terms to the full text, we only get a 6 to 7
percent increase in accuracy, so searching on the controlled vocabulary terms is very, very important. So we have done the theses, we have proved the workflow, and began work on some additional collections as well.
The crosswalk between the UDC collections and the SOBEX catalog at the University of Florida and now that we've done the assessment of the search results and enhanced the metadata, the pilot itself is concluded.
Do our little dance here. I will say that we were very pleased with the results of the pilot. The one record is just obviously a tiny sample, but many of the records ended up with that equivalent of quality of
search terms. So we're very eager now to apply these tools and processes to the rest of our digital collections. So SOBEX, which Margie mentioned, is the open source software platform that we use, and it supports 11 subject metadata fields, which have not been used consistently over time. This is certainly not a
unique occurrence, but it's one that needs to be addressed to improve access to our digital collections. We need to standardize the use of the fields. That's essential to development and application of the enhanced metadata and to support advanced search focused on one or more fields. All topical subject terms will be assigned to a single field. Geographic terms, place names,
corporate names will each be assigned a single field. They'll have authority files. The existing terms will be mapped to the appropriate fields and are replaced with controlled vocabulary terms. This will, as Margie said, give much greater precision. So, for example, Jose Marti could be an author. He could be a subject. He could be part of a place or corporate name.
We're doing a major project with the Bibliothèque Nationale de Cuba, Jose Marti, and it's going to show us a corporate name in that context. As with the Portal Florida History Project, we need to apply automated tools and techniques to existing collections in the Digital Library of the Caribbean and also to new collections submitted to DLOC, including
our new Cuban Heritage Collections. This will be particularly important for proven retrieval across collections that have come from different institutions with different metadata, schema, vocabulary, and languages. We hope that this process will allow us to continue to have metadata in the native language of the submitted material, but also in English
so that there will be an easier common search across disciplines. One of the new collections that will go into the Digital Library of the Caribbean is resulting from our agreement with the Bibliothèque Nationale in Cuba to create deep, broad open access to the Cuban Heritage Collections and into DLOC.
The British – the Bibliothèque Nationale estimated that 58 percent of its Cuban heritage materials are uniquely held in Cuba, and it is committed to digitizing those materials. But it's asking UF and its partners outside of Cuba to identify sources and digitize the other 42 percent, making bibliothèque control
essential to avoid duplication of effort and make the collection as comprehensive as possible. They have shared their bibliothèque records with UF, and we have agreed to establish an OCLC symbol for them, which we will manage and make sure that all of their Cuban heritage records are available in WorldCat. We will also – in the project management database.
However, 16,000 of the records that they provided to us while digital are not in MARC format. They're merely scanned images of catalogue cards, such as you can see on the slide, had I changed that slide. So we once again
turn to Access Innovations for assistance with conversion of these catalogue cards. Since we've already had the three-minute thing, I'll just point out to you very quickly that –
What you see here, those of you that are of an older generation might actually recognize what those are. And on the left, you see some of the conversion that the bibliothèque Nacional did. But for those where there was not a conversion
to MARC, what we got was a card that was undifferentiated in any way, so it was just text. And so we tried to separate that text and we're fairly successful in separating the text into a full bibliothèque record. And for search purposes, we needed to separate out not
the call number and the name and the title and so on, but we also want to place a publication and data publication and the publisher, because for searching and grade literature in general, those are important fields. And so we didn't want just a blob of undifferentiated material. What you
find through this thing that we fairly quickly described to you is that we are actually creating the records in the SIS database, no longer creating them in the cataloging application or the ILS. And that means that with the records created in SIS, we are
exporting to OCLC and the OPAC as appropriate to the University of Florida digital collections as well. So this is a more up-to-date version of the image that we showed you.
The SIS input panel, then you see the repository itself and from the repository, we're exporting the records to all of the other places. So we are not cataloging originally in MARC, but rather inverting the cataloging process to create a metadata record in SIS and export
it as a cataloging. This gives us a very streamlined workflow. It goes title by title and catalogers find it a very fast way to work. Once it's in the repository, we can output in any number of formats, including HTML,
and all the records can be both print and electronic. So to conclude, I wanted to go back, assuming it will let me change page I wanted to go back to the original title, which Margie
actually created for us, the death of the library catalog with a big question mark. And I want to come back to that title because I think when we look at these new metadata tools and techniques, you can ask the question, does this signal the death of the library catalog? And clearly most libraries continue to invest significant time and resources in
cataloging and in the integrated library systems that host our catalogs. But our students who have grown up with Google are much less likely to turn to the OPAC for discovery. So the answer to the question is not yet. But as research libraries like ours continue to place more and more emphasis on digital collections, there will be reduced emphasis on traditional title by
title cataloging and greater emphasis on automated metadata generation. As Margie said, we're inverting the traditional cataloging process. Automated metadata will become the source of MARC records rather than MARC records continuing to be the source of metadata.