Characteristics of a Well-Developed Grey Literature Repository: The Case of the International Nuclear Information System
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 30 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/59873 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
6
00:00
Computer animation
01:05
Computer animation
01:42
Computer animation
02:18
Computer animation
02:59
Computer animation
04:04
Computer animation
08:43
Computer animation
18:36
Computer animation
Transcript: English(auto-generated)
00:01
Hello, I'm Brian Bales, the coordinator of the International Nuclear Information System at the IAEA, and I would like to talk about the characteristics of a well-developed grey literature repository, and specifically about the case of the International Nuclear Information System.
00:20
So there's been a great deal of change in the information landscape and in scientific publishing, especially in the area of openness. If we look at OpenDoor, the directory of open access repositories, 15 years ago there were only 78 listed, now there are over 6,000, and there are so many sources of freely available
00:42
open information in science, including in Crossref, the directory of open access journals, Archive which has preprints, CORE, one of my favorites is PubMed Central, and so we are paying to free and open materials that would otherwise sit behind a firewall.
01:06
And in this changing information landscape, I've been thinking about the International Nuclear Information System. It has its roots in the very founding of the International Atomic Energy Agency, where the third purpose of the agency was to foster the exchange of scientific and
01:24
technical information. Now, if we fast forward to this year, INES is authorized to provide member states and other users with relevant, reliable, and up-to-date information in the area of nuclear science and technology.
01:43
So INES was founded and continues to operate in this way by 23 countries and two international organizations, and has since grown to 132 countries and 17 international organizations. But the basic idea is that member states and organizations will send us their nuclear
02:06
science and technology literature, including grey literature, and grey literature makes up a lot of what makes INES special and important in the world. So INES originally existed in print and microfiche form.
02:24
In fact, the first computer, as we can see here, was bought by the agency for the purpose of INES. Here we can see microfiche being created. Microfiche is where the references were held. The so-called atom index file was printed, and you would search in the atom index book
02:45
and then find the appropriate microfiche. So INES quickly, though, became a computer searchable database and was at the forefront of computer science. And if you think about all of the equipment and all of the time that was put into the
03:02
production of microfiche, the purchase of a computer, the great investment that the information management, so we are the inheritors of that today and the custodians of that bringing it into the future.
03:22
So INES continued to develop and eventually created a web-enabled interface, INES repository search, and this is, to me, quite user friendly. It allows you a simple search or an advanced search. In fact, here I've done a search on grey literature, and we see 8,000 references to grey literature
03:44
in our repository already. The first one is a reference to Kiyoshi Ikeda. Then we have a full text from Dobie Savage. And we also have a DOI reference to a paper by Joachim Schupfel.
04:04
So the growth in users in recent years has been quite remarkable. In fact, when INES became an open and Google searchable repository, it experienced a great jump in the number of users. In fact, since 2011, the 10 years from 2011 to 2021, it's experienced an 11,241 percent
04:29
increase in that time. So it's been a remarkable journey in the number of users and the usefulness of INES. I would say it's gone from being a specialized repository that a few experts knew about
04:45
to a general repository for the general public around the world. And so by most measures, INES has been a great success. It has over 4.5 million records. Over 2 million of these lead to full text.
05:02
About 617,000 full text are hosted by us locally. But then there are some ways in which it could improve. And I think that in looking at other repositories that we can see, we can model ways in which it could improve. For example, INES harvests sporadically.
05:21
In other words, we go out and find material in our scope. But we do this when we have time or when it's requested by the repository owner. Manual operations that have been developed over the 50 years of INES have slowed the ingest of materials.
05:42
In fact, lag time, which means I put an asterisk there, to explain lag time means that a piece of literature comes out. And then how many days, months, or years is it until that piece of literature appears
06:00
in INES? Well, it's sometimes years behind is the lag time. The corpus that's created in our scope, I estimate at approximately 250,000 pieces of literature per year. But the corpus that is ingested is approximately 125,000 per year, so about half.
06:22
And that includes both non-gray and gray literature, conventional and non-conventional. So as we think about redesigning INES or as improving INES in some ways, honoring what's gone on in the past and the success that it's had in the past, but also seeing
06:44
what's out there, incorporating what we can, and preparing for the future, I thought to look at a couple of repositories. And I've looked at many more than this. But here are just a couple of examples that kind of embody what I'm talking about.
07:00
If we look at the astrophysics data system, which is obviously in the scope of astrophysics, it has kind of a similar simple interface, but it also has examples on how to do more advanced searches.
07:20
In fact, if we look at some of the facts of the astrophysics data system, it has over 13.3 million references, although it's only existed for about 20 years, where INES has existed for 50 years. It harvests automatically with a daily frequency. Records always lead to a full text, and it's less concerned with accuracy, and it
07:45
invites user correction, so it's somewhat outsourced, the QA, to users. If users see a problem with a record, they can click a button and suggest an improvement to it. It has an API for automatic harvesting of itself, and it's done a special project
08:06
where it's gone back and comprehensively harvested from the historic coverage of core journals, and as I showed you, it has an advanced and very specific search available.
08:24
Another example is InspireHEP from CERN. It's very similar, isn't it? I mean, it has a simple search at the top, and then gives you examples on how to do advanced searches below through kind of a full text, a free text search.
08:45
The workflow of InspireHEP is quite interesting, and it's perhaps something that NS could emulate. On the left side, we see the automated workflow, where periodically, daily, as I said before, a crawler goes to archive, to the proceedings of science, to other
09:01
publishers, and extracts records. They are then sent to a literature workflow where keywords are extracted, where a record is in scope or not is found, and references are extracted. These are then sent to a curator who accepts or declines the submission.
09:25
Now, sometimes, I've heard from people that in certain cases and certain publications, it's completely automated. So, if they know that a journal is going to have in scope records where the keywords are well-developed, then they will go automatically into the
09:42
repository. A second workflow is that literature is submitted by authors or other people into the author workflow, and these are either accepted or declined if they're of significance or if they're in scope, and if the
10:01
metadata is well-developed, and then they are accepted into the repository. So, these are a couple of ideas that perhaps we could bring into Innes. So, InspireHEP is only one of several repositories that are run by CERN. You have the CERN document server. Zenodo came from CERN.
10:22
Anyway, right now, it has over 1.5 million references. As I said, it harvests automatically with daily frequency, and like the other one, they are less concerned with accuracy. They invite user correction and even user submission. It has an API for open extraction of its own content, and it has
10:45
advanced and a very specific search available. So, having looked at some repositories that I admire, I also thought to look at some standards, and there are definitely some standards for repositories.
11:00
There's Ferris Faire, OpenAir, Plan S, CoreTrustSeal, and others. But one thing in common is that all of the standards encourage openness. And let's look at each one of these in a little bit of detail. So, starting with Ferris Faire, this says that science, open
11:22
science, should be findable, accessible, interoperable, and reusable. And it has detailed recommendations on how to achieve these aims. Also, it has a collaboration with CoreTrustSeal that combines these two into a capability maturity model.
11:45
OpenAir says that it has a detailed standard and defines the recommended data fields that are used in automated exchange between repositories. So, that's an interesting one as well.
12:03
With Plan S, several science funders have gotten together and have defined how the science that they fund should be open access, and that includes open access repositories. It recommends things like permanent IDs for deposited
12:24
publications and authors, and the use of JATS XML, which is a standard for a data exchange, as well as an open API for the exchange of things between repositories.
12:40
The CoreTrustSeal trustworthy data repository requirements, well, they give you 16 areas that a repository should continue and define to be able to receive that certification, such as preservation, security, data reuse, licensing, et cetera. So, having looked at the successful repositories, as well as
13:03
the standards that are encouraged out there, these five characteristics seem to be those that are shared by the successful repositories and that are also compliant with the standards. And if we look at these, they are timeliness, openness,
13:22
preservation, user friendliness, and comprehensiveness. And just by chance, this spells topic. So, if we cover a topic well, then we will have these characteristics well in hand. So, let's look at these individually.
13:43
So, the definition of timeliness is being done at a favorable or useful time. And a piece of science, a publication in science is most useful, the closer it is to the idea having been formulated or the study having been done or the report
14:01
having been written. Each day that goes by, each month or year that goes by, the research becomes less and less valuable. So, if we look at the repositories I mentioned, each of these harvests on a daily basis and as soon as
14:21
something is published on the repositories that it's monitoring, these will be appearing in their repository. So, I would set a goal for Ennis that our journal articles and those that are sent to us that are gray literature be input within one week of publication or
14:41
one week of being turned into us. Now, in gray literature, we're meeting this. But in non-gray literature, we're not. Often, we're years behind. So, this is something that we need to work on. Openness. Openness is a characteristic of most of the successful
15:00
repositories I've looked at. And one question is how does openness benefit a repository? Why should a repository be open? Well, if you think about the mission of these sponsoring organizations such as the IAEA, does the IAEA want knowledge, information, science to be
15:22
the exclusive purview of wealthy and well-developed countries? Or does it want to equalize the playing field basically and for science to be available for everyone in the world? Not just those from wealthy countries, for those from less developed countries.
15:42
That should be the goal of most organizations. Most public organizations have this goal. And so, openness encourages this. Preservation. If we think about the mission of a repository, preservation ensures that ingested materials will
16:01
continue to be accessible and with their integrity intact. And it's all about the appropriate level of care when we're considering preservation. And I'm coming from an archival background, and there are definitely best practices in preservation and a preservation maturity model which says that
16:21
you should do periodic checksums, you should have redundant storage in multiple locations, and protections against malicious or accidental deletion. And I could share that standard with you if you would like. But that's something that we in NS need to adopt.
16:41
Additionally, user-friendliness. Now, user-friendliness, it should be obvious that we want a simple and understandable design that brings users back. Overcomplication was the old web. Cluttered web pages was the old web. Nowadays, people are expecting a Google-like interface to be able to simply type in a search
17:02
term, hit enter, and get results. But also, people are expecting to have an advanced search for those who are advanced users, scientific users, but also that simple interface that I talked about for the great majority of users. Comprehensiveness means that the site covers its
17:22
scope as completely as possible. We probably aren't going to get everything, but we should get as close to comprehensiveness as we can, and comprehensiveness should be the goal. And this means that our site would be a one-stop shop for everything that they need in nuclear
17:41
science and technology. And this means not only the most recent records, which is a great goal, as I talked about, timeliness, but we should also go back in time when we can. The most recent records should be the priority, but those going back into the past aren't invaluable. They're valuable as well.
18:03
So Innes could adopt all of their best practices and the best repositories in the world. It could take from the standards that have been given. It could improve in timeliness, openness, preservation, user-friendliness, and comprehensiveness. And perhaps more attributes could be found.
18:22
If you have any to suggest, please let me know. And furthermore, a maturity model in these areas could be developed. That could be the subject of a future paper. And thank you very much for your attention.
Recommendations
Series of 2 media
Series of 2 media
Series of 2 media
Series of 14 media
Series of 19 media