Welcome Address: Open Science More than Access to Scholary Papers
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 19 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/39648 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | |
Genre |
5
13
00:00
Vorlesung/KonferenzBesprechung/Interview
02:35
Vorlesung/Konferenz
07:44
Vorlesung/Konferenz
08:32
Vorlesung/Konferenz
10:59
Vorlesung/Konferenz
11:42
Vorlesung/Konferenz
18:50
Vorlesung/Konferenz
19:56
Vorlesung/Konferenz
Transkript: Englisch(automatisch erzeugt)
00:00
So before I get to the feature attraction of our keynote and opening address speakers, I get to speak to you a few minutes about open science and how open science has essentially evolved from open access, which centered largely around the journal publication, to
00:20
now encompass all of the research objects from the research landscape, the data sets, the software, and gray literature, and all other forms of research objects. And let me just say that my presentation and my examples that I will give to you are very much largely from my perspective as a member of a U.S. federal science agency. So please keep that in mind that many of
00:48
the examples I show you will talk about it from that perspective. So why do open science in the first place? And we know from open access the big push was to provide free access to the
01:09
journal article for the purpose of accelerating scientific discovery, for democratizing science, and all that's true for open science, but open science is also much more about the concept of
01:26
reproducibility. And for science to be very high quality, for it to have integrity, for it to be believable, you need to have reproducibility. And to have reproducibility you need access to much more than the publication. You need access to all of the underlying research objects
01:45
in the science landscape. You need access to software, to data, to all gray literature, and many, many other forms of research objects. And that's, if you can see it from where you are, kind of humorously displayed in this cartoon where you have two scientists and one is written
02:04
out a very elaborate formula, and in the middle of that formula he wrote, then a miracle occurs. And his colleague, his somewhat doubtful colleague says, I think you need to be more explicit here in step two. And so that's really what's meant by reproducibility. Don't just accept that a miracle
02:25
occurs, require that they lay out how that miracle occurred, and so that's what we're trying to get at with reproducibility. So this is just another illustration of the same concept,
02:40
and this came from a PLOS Biology article in 2016 which described there being three pillars of open science, data, code, and papers. And of course we're all very familiar with papers, that's largely the journal article, but also the gray literature. But there's so much more in terms of
03:02
what really went into the science and enabling open science that you have to have access to the scientific code, the data sets that underlie that, and so forth. And let me just tease this out a little bit. In terms of papers, we have the journal article, the conventional literature, but we also have gray literature, the technical reports, the patents, the many forms of
03:24
literature that comprise that. And if you really think about it, data and code typically, there are a few exceptions to this, but data and code are not commercially published entities. So I think for this community we should really start thinking about data and code as forms,
03:43
and not necessarily literature, but they're forms of gray. They are gray science, gray research objects that we perhaps as a mission of this organization should even think more broadly than just literature. In any case, there are a large number of initiatives and communities that are
04:04
cropping up all over the world really to serve open science. And some of those are nationally sanctioned by governments, some are organically cropping up to serve certain communities. In Europe we have the European Open Science Cloud, U.S. subcommittee on open science
04:23
under the National Science and Technology Council, the not-for-profit Center for Open Science out of Charlottesville, Virginia, the National Science Foundation has funded data one around earth observation data, and in the realm of open source software we have this very organic
04:40
community under GitHub where so much open source software development occurs. And these are just illustrative, there are many, many others. But again, talking back from my experience in the Department of Energy, much of this movement came about from open access centered around the
05:02
journal article. And these policies go back at least 10 to 15 years where first we saw research universities putting out policies that any journal articles authored by their funded researchers need to be made publicly accessible, mostly through deposits in the
05:22
institutional repositories. And then funding agencies like mine, the U.S. Department of Energy, have started requiring the same kind of deposits for articles funded with their research monies. And most of those would fall under what I would call green open access,
05:41
and that has to do where the author is depositing into an institutional repository. Publishers being very mindful of this movement, of course, have established the business model of gold open access where authors will pay anywhere from 500 to $5,000 to the publisher to
06:01
make the article immediately and freely accessible. And then there are various other kinds of initiatives like Scope 3 and the particle physics community where physics-related organizations have redirected their journal article subscription money to pay publishers en masse
06:23
to make entire journal articles openly accessible. So back to my experience in the U.S. government, and I'll refer here to the Office of Science and Technology Policy memo of 2013, which came from the President's Science and Technology Advisor to the heads of all federal
06:46
science agencies, which told them to develop public access plans to lay out how they were going to provide increased access to the journal articles and to the digital data that result from
07:00
their research. And with respect to publications, that was defined as public access where the public can read, download, analyze in digital form final peer-reviewed manuscripts or final published documents using a 12-month post-publication embargo period. Again, that started in 2013.
07:20
That was under President Obama. But what we're seeing in this current administration is that same policy is being reinforced. There have been some working groups and now a subcommittee under the National Science and Technology Council, which essentially are carrying that same policy forward. So we think it's a fairly secure policy at this point in time. In the U.S. government,
07:47
of course, you have these agencies which are investing tens and tens of billions of dollars in research. And those tens of billions of dollars of research result in journal articles. And on an
08:01
annual basis, these agencies produce somewhere between 150 to 200,000 journal articles per year. And you can see some of the top agencies there, the National Institutes of Health, the National Science Foundation, my agency, the Department of Energy, producing 20 to 24 to 25,000 articles a year and so on. So we have a very, very large body of journal articles resulting from
08:24
federal research funding and which are all subject to this policy that told them to make these articles accessible. So in DOE, we, just like every other agency under the auspices of that memo, had to produce a public access plan, which we did. And we launched it in July of 2014.
08:47
And it defined our model for increasing access to publications and data. I've mostly focused on data here. But the policy did require that agencies lay out how they would provide increased access to the digital data sets. And our Secretary of Energy at that time,
09:05
Ernest Moniz, put out a memo to his direct reports, the associate directors of various programs in the national laboratories saying that researchers needed to provide their accepted manuscripts to OSTI, my organization, the Office of Scientific and Technical Information.
09:23
It told them that they need to develop data management plans for any funding proposals to say how the data would be managed, preserved, and made accessible, and that these requirements went into effect on October 2014. On the publication side, we had to have a tool to make
09:40
these things accessible. And we developed what we call the DOE Public Access Gateway for Energy and Science, or DOE pages, started it in 2014. And over that time frame, we have acquired 62,000 articles that are now freely accessible, that never would have been made accessible had it not been for this policy. And like every other agency, our model is one
10:06
where we're implementing that this public access occurs within 12 months of publication. So in DOE, we have 17 national laboratories. You've heard many of them. Los Alamos National Laboratory, Sandia National Laboratory, Argonne, Berkeley, Oak Ridge National Laboratory,
10:27
17 laboratories. And in the very first year of implementation of this policy, those labs reached a median comprehensiveness level of 30.2 percent. They were working with OSTI to make
10:41
30.2 percent of all the articles they produced accessible. Pretty respectable, being the first year of implementation, and most of that was simply socialization, where we're telling these people what the requirement is. So we couldn't expect to have a huge number in the very first year of implementation. In the second year of implementation, those labs had a median
11:03
comprehensiveness level of almost 50 percent. And in the third year, they had reached almost 70 percent. So very, very good progress by our national laboratories in making these 62,000 articles accessible. Now, the benchmark in all this across the entire government is the
11:21
National Institutes of Health, which had been doing public access long before that 2013 memo. In fact, they had been doing it even before 2008. And they've reached a comprehensiveness level of somewhere in the range of 85 percent. So that's sort of the goal that we're striving for in terms of where we're trying to get to with public access.
11:43
So as I said, I wanted to start out with publications and focus on that primarily, but there's so much more to research these days in the modern science landscape. The final paper is sort of the culmination of that, but there's so much more that went into the underlying data sets, the methods, the software, and so forth. And here you sort of see the model
12:06
of how DOE does its R&D investment. We put $12 billion a year into R&D. That money flows to our 17 national laboratories that you see here and several hundred grantees, mostly research universities in any given year. And the most immediate outcome of this investment are the
12:25
various kinds of scientific and technical information you see here, text, which we've talked about, which includes the journal articles and the gray literature, technical reports, patents, and so forth, data, various kinds of data sets, and software code.
12:41
We estimate very conservatively that there are about 50,000 such STI products produced annually. And it's the mission of my organization, the Office of Scientific and Technical Information, to make all of these research objects, all of the unclassified. DOE does a lot of classified work, but this is all about the unclassified side of it, to make all of those research
13:01
results as accessible and useful as possible. And so ultimately our mission is open science. So we kind of have it nailed down how we make publications accessible, but this challenge of making data and software accessible is a little bit of a different nut to crack.
13:23
And all of you, I'm sure, have heard of the FAIR principles of data, findable, accessible, interoperable, reusable. So we certainly try to promote all of those to ensure discovery and access to data. And I think the first and foremost of these four principles is findable.
13:44
The information has to be findable for it to be of much more use. And to us, findable means that those research objects like data sets and software need persistent identifiers like digital object identifiers and very good metadata. And I'll talk much more about how we go about that
14:01
in the next slide. But this is very much a two-way street. We can do all we can possibly do to sort of set up the infrastructure to forgiving identifiers to these research objects, but the research community itself has to play its part, too. When an author writes a research paper and they refer to a data set or software, they need to be the ones to have insisted that
14:24
they get a DOI for those research objects. And organizations like OSTI will help them do that, but they need to be in the practice of getting those digital object identifiers and citing them in their papers. Otherwise, if they're not citable, they're not going to be
14:42
picked up in the Google indexing of all this content, so it's not going to be listed there for discoverability. So especially for those of you in research universities, I really encourage you to encourage your authors to make sure that they're citing
15:01
these research objects. That's the key to it. But how do we go about giving these identifiers for data and software? And at OSTI and DOE, we rely very, very heavily on the digital object identifiers and in that, in our relationship with the international
15:22
organization that supports the issuance of data, digital object identifiers called data site. And so the way it works is we have data clients across DOE who have data sets that they want to be cited. They provide metadata to us. We then work with data site to get a DOI
15:41
assigned for that data set. We supply that DOI back to the data client where they're then able to take that DOI and cite it in journal articles and in other places. We also take those DOIs and that metadata and include them in our databases, which are then crawled by Google.
16:01
So it really sets the stage for a much broader discovery. We have about 40 DOE clients for whom we're providing these services already. We have about eight interagency clients through our membership and data site. We're able to provide DOIs for data sets from the Department of Transportation, the National Institutes of Health,
16:23
and we do those kind of things on a cost-reimbursable basis. Since 2012, when we've been doing these DOIs, we've issued 76,000 DOIs to data sets, so a significant number for improving discovery of data sets across both DOE and other agencies.
16:48
So that's what we've done for data sets, and we've been doing that, as I said, since 2012. Last year, we turned our attention to software because software in modern science, science is dependent completely, almost completely, on data and the supercomputers that generate
17:07
data and the software that drive those supercomputers or other instruments to produce data. And so in the realm of open science, it's not just necessary to have access to the data, but also to the software that helped to generate those data.
17:25
And so in 2017, we launched this product, DOE Code, which is, I could go on and on about it, it is essentially a platform where DOE software scientists are able to develop code and import it into DOE Code. Or they're able to go into DOE Code and do
17:46
the development there because we have our own GitHub and GitLab installations within DOE Code. Or if they're already developing code on GitHub, for example, we will go out to GitHub and scrape the metadata from GitHub and pull it into DOE Code. So our goal is to have a very
18:03
comprehensive collection of software within DOE Code. But this point here as one of the features, lost it, I'll just keep it right there, is just like we assigned DOIs for data sets,
18:21
we do the same thing for software. So anybody who has produced a software package, who wants to be able to cite that software in a journal article, the key to doing that is the DOI. So we will assign DOIs to software as well. Since its launch, we have about 1,800 projects in DOE Code. That's not comprehensive, but it's a big step forward in making this software
18:46
accessible. So all of these things together are setting the stage for a very tangible and illustration of open science. A lot of these things are kind of textbook, but this is what we
19:04
see as where the rubber meets the road. A person comes into OSTI's databases at OSTI.gov, and they find a publication. And because of the work we've done to issue digital object identifiers to data sets and software, they're able to seamlessly link from the publication
19:22
that they find to the underlying data set and the underlying software. And so on the publication side that you see over there, that could have been a journal article or it could be gray literature. So gray literature plays a very important role in this, and I think that's we need to really think about that so that software and data are cited in gray literature
19:44
just as much as they are in journal articles. But the idea is that from one platform you're able to go to all these diverse research objects seamlessly, and that's really the definition of open science. So besides some of the things we've done to enable broader discovery with the
20:03
examples that I just showed you, we're taking advantage of other things that both the private sector and government are doing to broaden the discovery of DOE's data sets and software. Many of you have heard of Google data set search, a beta product launched recently,
20:21
and what they need to do to be able to find data sets is to have DOIs in the constituent collections like those at OSTI. And so were it not for the DOIs that we're assigning and the metadata that we're assigning for data sets and OSTI products, Google couldn't find these things, and we've heard anecdotes from some of our data clients that said, hey, my data's showing
20:45
up in searches of Google data set search, and that's because of what we've done. And so that's really a means of broader discovery for data. And also in the government, there have been efforts because no one wants to try to go across 20 or 25 different U.S. science agencies
21:05
and find their data and software collections. So out of OMB, they set up data.gov and code.gov to be sort of combined catalogs of data collections and software collections across the entire U.S. government. And my organization, the Department of Energy, funds, not funds, feeds
21:27
the data collections and software into those catalogs. So thereby we're increasing discovery of DOE's content there, too. And then there are the very powerful federated search products like science.gov and worldwide science, where the advantage that these two products hold over
21:45
at Google, for example, is that they are able to search in realtime, whereas Google is usually based on an index that might be weeks or days old, and so they're searching in realtime, and they're searching very authoritative sources. Sometimes Google may be searching commercial sources
22:00
or other things that may not be what we call true science, so they're really getting at some hardcore science. And both of these products have shifted from just searching publications, the government and national scientific publications, to these other research objects, software and data. And tomorrow in a breakout session on worldwidescience.org,
22:21
we'll hear much more about how that's being done. So stay tuned for that presentation. And in closing, in terms of next science opportunities, I think that there's a lot of potential to go beyond just the publications and the conventional and gray literature and the
22:42
software and the data to find other objects in the research landscape, like electronic lab notebooks, to assign digital object identifiers to the facilities and instruments that generate outcomes of research. So I see us heading in those directions in the future.
23:02
You already know about ORCID in terms of giving identifiers for individuals, but whereas I've talked about research objects, giving identifiers for people is just as important because we want to be able to connect people to the publications and the data and so forth. And that also sets up opportunities for collaboration with those people.
23:27
And finally, I think artificial intelligence and machine learning, we're not even scratching the surface with the power that these kinds of tools and technologies can do for open science. They can, in many ways, it's possible for them
23:44
to even replace the digital object identifiers and by understanding what the data says and what the publications say to establish those connections for us. But then on the front end of that, where you're interfacing with this content, I see artificial intelligence really being
24:03
able to predict where science is going and where there are ripe opportunities for advances in science. So I think there's a lot of opportunity for machine learning, artificial intelligence, and tomorrow we have a session on worldwidescience.org and we'll have a couple of friends and colleagues from IBM Watson
24:22
who will talk about some of the potential here.