Challenges and Opportunities of Harvesting Open Access Materials
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Author | 0000-0003-1452-2785 (ORCID) | |
License | CC Attribution 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/59419 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
00:00
Computer animation
00:40
Computer animation
01:05
Computer animation
05:08
Diagram
10:38
Computer animation
16:09
Computer animation
Transcript: English(auto-generated)
00:00
I want to talk about the challenges and opportunities of harvesting open access materials. And this is a subject that has been, that I've been thinking about for a while. And first I'd like to tell you what led to this topic. Previously, I worked at the UN's international tribunal concerning the genocide in Rwanda
00:22
and it was in Arusha, Tanzania. And here you can see the courtroom on the right, Mount Meru in the background in the shadow of Mount Kilimanjaro. I thought this was neat, so I took a picture as I was leaving one evening. Anyway, there at the tribunal,
00:41
I created a platform for public access to the materials, the trial transcripts, the judgments, the appeals, the filings, judicial records of all kinds. It was an electronic counterpart to the physical archives that you see on the right. There I am in the physical archives.
01:02
Well, one day I was working there and I happened to be reading the news and I saw in the news an interesting scientific paper that was techniques of neutralization and identity work among accused genocide perpetrators. And this is a little bit off topic,
01:21
but I thought that their findings were interesting. In the paper, they analyzed trial transcripts and they found that the alleged perpetrators used a variety of neutralization techniques, they called them, to minimize their role in the genocide. I suppose we would expect that.
01:42
And they also found that there was no significant textual difference between those who were convicted, acquitted and those who appealed. So I was really interested. I read this paper and then I contacted the lead author and I asked, so did you use our repository in your work?
02:03
She replied, well, I think I've heard of it, but no, I used our university's database. So the university I found had out, actually several had copied since it was all for public access, the judicial records from our repository
02:21
and also a predecessor repository. And that's where many students were accessing the trial transcripts and other things. They were hardly aware that there existed the original repository. And so I couldn't, I was going to use this paper and say, look how useful our repository is
02:43
to our management. Interesting scientific work is being done based on our repository. Well, after talking to the lead author, I couldn't say that because she didn't get it from our repository. So that led me to the conclusion
03:00
that providing open harvesting inherently means a loss of control and credit to the originating repository. But a related question is, is this important if we view ourselves as facilitators providing information to the public
03:21
instead of as an owner of information, then I would be happy to see the information going out to as many places as possible. In this role, I'm not an author, so there's no repository copyright. Still, it would be interesting to know how other repositories feel about this loss of control
03:42
of what they might think of as their materials. So that's one question. So let's turn to other aspects of open harvesting, especially on the other end, getting instead of providing open access materials. So to do that, I'll turn to Innes.
04:02
Now Innes was founded in 1970 to meet one of the purposes of the founding of the IAEA, that is to foster the exchange of scientific and technical information on the peaceful uses of atomic energy. Now the scope has changed beyond that.
04:27
And it has become now over the years to provide member states with access to relevant, reliable and up-to-date information in the area of nuclear science and technology. You see, it's changed from just atomic energy
04:41
to nuclear science and technology. And that includes things like the preservation of cultural heritage items, soil remediation, biological experiments, patient protection, medical physics, all kinds of things like that. So the question is, are we meeting that goal?
05:05
And if we look back in the history, Innes originated in 1970, originally in print and in microfiche form. The agency's first computer was bought for Innes. And Innes quickly became a computer searchable database
05:22
at the forefront of computer science. And here's a video from the 70s that shows what Innes was about. And as I see all the equipment in this video, it's amazing the investment that the IAEA and the member states made in this repository
05:41
over the years. And you see here the early database. Now we see here Hans Gronewegen who wrote that, it was estimated that in 1977, the store of information relating to the peaceful uses of nuclear energy was around some 80,000 documents.
06:02
And in that year in 1977, I looked back and they added 68,831 records, or they got 86% of the estimated possible output. I think that's pretty good. But over the years, things have changed and the scope has expanded as I mentioned,
06:22
but also so has the volume of scientific materials. I estimate that in 2022, 250,000 records have been created in the world each year in our scope, yet we're only inputting about 125,000,
06:40
that is about half. Now we're exceeding the traditional and yearly goal of set by agency and past practice of 100,000, but we're not fully covering the scope. And so not really fulfilling our mission to comprehensively cover that scope. Therefore, we have the opportunity of filling the gap
07:03
by harvesting open materials, but this opportunity is not without its challenges. So first let us recognize that there are many sources of open and harvestable materials. You know, there's CROSSREF, DOAJ, ARCHIVE,
07:22
I call it, CORE, PubMed Central is one of the partners in xt, and that's a really valuable one. Scope three is an initiative from CERN and one that we participate in that kind of pays to liberate materials and make them open when they would be closed.
07:42
So that's a valuable thing. Now for less well-known repositories, we can look at OpenDoor, which has a directory of institutional, national and subject repositories. And we can see it's experienced amazing growth over the years, reaching some 6,000 different repositories.
08:03
So it's almost like there's a fire hose of available material. And even at the same time, there's a changing environment where groups, at least in Europe, such as FAIRS-FAIR and coalition S with their plan S, and these are groups where funders are joining together
08:21
to require that the research that they fund appear in open journals and open repositories and therefore be available for our repositories to harvest. Now, even with more and more information out there, there are several challenges that I've identified
08:40
and perhaps you can think of some more. One is varying copyright terms. As an example, I was discussing the possibility of opening Innis to automatic harvesting and the representative from Germany let me know that they had provided records that were allowed by the originators
09:01
to be published on Innis, but not further on so that we would have to accommodate that. Another example is this snippet below taken from PubMed Central. On there, there are a variety of copyright terms, all of which do have to be respected
09:20
and this is especially difficult when you're automatically harvesting items, but of course it's not insurmountable and most of these just require you to give attribution and though some are, as it says, there are others that are not machine readable, there's no license or there's a custom license
09:41
and so how do we deal with these on a case-by-case basis? Another consideration is in this share alike license. If you look at that, it holds that if you do harvest and republish, then you must offer this material with the same license
10:02
and without technological restrictions that restrict others from doing things that the license permits. Like if you require the PDF, the full text, only to be seen under a special viewer or something like that when they've allowed the full thing to be downloaded.
10:23
So that's another consideration, again, not insurmountable but it's something to think about when you're harvesting. Now, one problem that we have run into is varying standards and varying character sites. In this example, we were harvesting at the example of our partner in Cyprus.
10:42
Now they're using a D space repository, which is fine. D space is a very common repository. Most of the ones in open door that I mentioned or many of them are using D space. It makes it easy to enter records and also to harvest them
11:00
but it doesn't have some fields that we like by default such as the author affiliation. Secondly, in this example, we had some difficulty because our carrier language is English along with the accompanying Latin alphabet. So we expect that things like the description
11:21
or author would be in that alphabet. And they do have that inconsistently sometimes. So if you look at this, we look back at this record from their repository, it appears that the title, as we can see at the top here, is in Greek with Greek characters.
11:42
But in the data behind it, we see that it has an alternative title. So that's fine, we can work with that. But if you look at the format, it says 149 and then a Greek character, it says 30 and then two Greek characters. So those we would have to translate in some way.
12:02
So that's a problem that we automatically harvest but the metadata isn't really in our character set. So then we have to manually fix each record. So it's not as easy as it might look on the surface. Another problem is some open journals such as these
12:25
where it's difficult to filter applicable material but where many interesting materials are published. PLOS One is a good example. There are 272,000 articles there in all areas of science, same with scientific reports with 166,000 articles
12:43
and many of these others, so-called mega journals. And these are open mega journals that we would be able to harvest the metadata from. And yet it would be a little bit difficult for us to consistently get the applicable articles
13:05
that would work for our repository. So that's just another challenge. Another thing that I've noticed and I don't mean to pick on Cyprus too much but it's information loss. If we look at the record from their repository on the left
13:21
and the authors specifically, down on the left at the bottom, we can see the authors. Now we see that one has an ORCID, an ORCID. That is the one that works at their institution but the others do not.
13:42
If we look at the publisher's website which is Science Direct, Elsevier, we see that in the original record in the published article that the affiliations exist in the published work but they didn't make it into the Cyprus repository.
14:05
So if we harvest from the Cyprus repository then we've lost some of the affiliation information. And since we don't really support totally ORCID yet, we might lose the one that has the ORCID.
14:20
So we would lose information as it goes from repository to repository. It's almost like you lose fidelity as you copy an audio recording or make several copies on a copier. So another problem that you might have is predatory journals.
14:43
Now standards are needed and I've seen this in another repository where they concentrate, maybe many of you do. Look at the list of predatory journals and make sure that you're not getting from them. Of course, on the other hand, the authors who provided the articles,
15:02
perhaps that's valuable information that should be preserved. But I think overall, we don't wanna support the practices that they have where they're charging excessive publishing fees and things like that. Down below is another aspect where if you're going for the certification
15:23
that a journal is indexed by DOAJ because they have good standards on predatory journals and the like, they have to meet certain standards to be able to be published in DOAJ. And you might think, well, we can piggyback off of them. But some journals have the DOAJ logo
15:45
and say that they're indexed by DOAJ, but they're not. So you have to look at also this spreadsheet that DOAJ provides. So it's not straightforward harvesting open materials,
16:02
but the benefits definitely outweigh the drawbacks. These are just things that you need to think of when doing that. So perhaps first we could talk about various aspects and then I have some discussion questions here. But before we look at those,
16:20
perhaps there might be some who would like to comment on the presentation. So I guess it's a good time to stop sharing and we'll pause for any other comments.