Open for what? Looking beyond 'open' as the goal for data
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 30 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/58441 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
6
00:00
Computer animation
05:35
Computer animation
11:33
Computer animation
16:51
Computer animation
18:38
Meeting/Interview
Transcript: English(auto-generated)
00:03
Hello everyone, I would have been lovely to meet you all in person but I appreciate anyone who has an opportunity to listen to this recording. And considering this is an asynchronous keynote opportunity. I've kept my remarks brief and hope that we can take this offline for some more interaction.
00:22
So as my title says today we're going to be talking about rethinking our goals for open data. So first of all, who am I? Hello everyone, my name is Daniella Lowenberg, and for those who I haven't had the opportunity to meet over the last 10 years, I've been involved in open data through various policy, infrastructure, and advocacy initiatives.
00:42
I oversaw the product development, most recently at Dryad, the open data repository, as well as I was the principal investigator for the Sloan Foundation funded Make Data Count initiative focused on the development of data metrics. I also co founded the FORCE 11 and COPE research data publishing ethics working group. And
01:03
before that I was at PLOS journals where I implemented the data policy across the operational journals. But today I am here representing as an IPA at the Administration for Children and Families at the US Department of Health and Human Services. So in other words, what does that mean? It means that I
01:23
am on loan from University of California Office of the President to ACF where I'm focusing on open data strategies. And so like the little emblem here says I am at OPRE, so the Office of Planning, Research, and Evaluation within ACF, where we focus on data and evidence to advance the well being of children and families.
01:47
And so today I'm hoping to weave similarities between experiences through academic research data and administrative data to speak kind of agnostically about where we are with open data. So to kick it off, let's start with the basics and the first premise of this is data are important.
02:05
Data are everywhere, data are the oil, data are monetized, they are going to keep being monetized, they are across every industry, and they are driving every industry and the concepts of open data, data infrastructure, and data sharing are not limited to academic research.
02:23
And data are creating jobs. So we've seen a rise in data science programs as well as divisions within organizations. University of Virginia here who's notable, of course, because they created the School of Data Science. And they published this blog post that noted that they predict that there will be
02:41
31% increase in jobs in data science which will be the largest growth in jobs. And so I'm sure that even though this number is staggering, it's replicable across all other countries that are investing in this as well. And so I'm pointing this out really because it should have us thinking about how we're going to support the influx of data capacity.
03:03
And of course, there's the policies. So there are research funder policies like the NIH with their upcoming data management and sharing policy which goes into effect this coming January, as well as the recent OSTP Nelson memo, which focuses on data management and open data to be a priority.
03:21
We've also been meeting the moment with how data have advanced treatments and have allowed us to track strains and the epidemiology of the COVID pandemic. And we also know that data are driving advances in the climate crisis and increasingly we're seeing more and more requirements for data sharing across various climate fields because of this.
03:43
In the United States there's the foundations for evidence based policymaking act, which came into legislature in 2018, and that calls for creating data infrastructure across the agencies so things like creating a chief data officer, like having metadata catalogs
04:02
of data like having public access plans for data. Okay, so premise one, we all agree, data are important, the world are depending on it we're trying to see everyone embracing this but the next premise is yes, I'm here to talk about open data, but I believe that open and accessible are means to the end, or as my title says,
04:23
open is not the goal for data. So taking a step back, what our data will did our factual information so that can be qualitative or quantitative. And I think that these are two points that are worth noting specifically about data which is that data can and should be
04:41
linked and don't need to be valuable just on their own, but data do each contribute unique information so each data set is holding a unique body of information. But the idea of open data and putting in mandates and infrastructure for this is not new genomics researchers saw the need for this sharing, and they started building repositories in the 70s and 80s.
05:06
The NCBI NIH started launching these repositories in 1988 ecology and evolutions researchers as well as physics researchers have long been sharing their data for decades, and their journals started to pick up on this as well and releasing policies.
05:20
In the 2010s we started to see more multidisciplinary larger journals that began implementing open data policies, and we started to see funders have a slider shift in this and institutionally we've been seeing as well a big emphasis on data management and data rights and retention. We've also just seen that this is not just in the US.
05:43
Beyond we are seeing in Japan, Europe, Denmark, Sweden, Germany, large national scale approaches to data sharing, as well as nationwide infrastructure and repository networks for countries all over the world are prioritizing open data infrastructure and this is just a snippet of that but we're seeing policies and strategies for government as administrative data as well
06:07
as research data like and I raised this because I often hear at conferences that in the academics world that folks refer to open data or data publishing as the new track or the innovative track something that this is a new
06:22
topic idea, but in the largest team of the world it's not and we're actually playing catch up to this. But in academic research we are seeing that folks are moving beyond the concept of just talking about open, and there's been a shift to focus on both quality and the responsible handling of data.
06:40
So of course there's fair findable accessible interoperable reusable which is the most popularized acronym, but it's also the least prescriptive as it's a set of principles as opposed to workflows or guidelines and so there are various disciplines that are looking into operationalizing fair and thinking about what it looks like across various data types.
07:02
Recently there was also the release of the care principles, thinking about how to sensitively handle indigenous data. And last year, as I mentioned, force 11 and cope teamed up and we released working group with recommendations policy text and flowchart guidance on the handling of data as expert repositories and publishers.
07:23
Last year, and there's for this last year 2022. The NIH announced the gray initiative the generalist repository ecosystem which focused on supporting multidisciplinary generalist repositories in a coop petition. So that's to try and advance the practices for non and CBI repositories and non disciplinary repository like say gene sequencing
07:48
and opening up and managing their data and this was important to come together to really support this new policy that's coming into play in January. But I want to make this very clear that open and accessible include secure and restricted access repositories like data
08:08
verse ICPSR institutional repositories commercial entities on the industry side and repositories at the agency level have invested more and making data available that require these access controls and secure environments. This is open data open does not have to mean that
08:27
it's available to everyone, but it should mean that the metadata is it should mean that the metadata are broadly available and findable fair care trust ethics, all of these principles and guidelines apply to data behind access controls. That's
08:43
important to reiterate this point to not restrict our understanding of open data to be in public use files. So yes, I'm trying to point out here that open and accessible are moving along swiftly and our intermediary points in trying to prove the value of data.
09:01
But even with this history and advancements the uptake and adoption of open data practices are not where they should be in academia, and much of this issue lies with the lack of compliance from funder and publisher policies, for instance focusing solely on data availability statements, instead of on data curation proper data publishing data citation. There's other barriers
09:26
like the cost of quality curation at scale, and the need for data management that accompanies that. And then there's also the need for large data infrastructure that doesn't yet exist. So it's important to recognize that everyone who's
09:41
listening to this and everyone that we're talking to at this conference and beyond, that we have to invest in data infrastructure and to make workflows as easy as possible for researchers. And this differs broadly by discipline so in a recent study by Ted or Sue at all that's linked here. They found that not only did the discipline of data affect open data practices but so did the types of data within those
10:07
disciplines. So, everything is kind of varied right now but this is really to say that it's happening more than we think and those old school discussions of I'm going to get scooped, we've moved past that the infrastructure has moved past that.
10:23
So, okay, we've acknowledged that data are important. I've acknowledged that open and accessible are key features of data, but then in the grand scheme of thing, why do we even want to push for open data, why do we want this broad adoption. Well, broad adoption of open data, step one. Step two is embarking on a goal
10:43
that do data are routinely used as evidence. So let's talk about how we get there. So again, data are factual information evidence, our data that are relevant to a question and are furnished proof that support a conclusion. So, raw data are not inherently evidence data can be inaccurate and complete irrelevant.
11:07
And let's talk about an example of this so looking at a weather dashboard here dashboards provide data, we rely on them in many aspects of our work and personal lives, but it's not a given that these data are sufficient to support conclusion
11:22
so these raw data can be factual points but they don't evidence a specific question, and that's okay it doesn't mean that these data don't provide value. So data out there in the world existing, that's great. It's not enough. Data being open and accessible, important, but still not
11:42
enough. Data archived in a repository with the citation, crucial, an absolute step that needs to happen and should really be what we're
12:01
having data in a repository with a citation allows for the community to assess validate and review the data, but this is still not enough. So the question is why invest in hundreds of thousands of data sets being preserved and made available every year. If they're not usable and the cost of that infrastructure to do so.
12:21
And the community discussion has often centered around success of policies repositories infrastructure everything based on the number of data sets published or data states statements, simply noting that data are available, but the value and return on investment lie and data usability and without that we're really just running out of loss.
12:44
So how do we get from data to evidence, and the key to that is making data usable. So features and attributes of data reuse are going to vary across disciplines and data types but broadly speaking there's some key areas. The first is machine and readable and accessible file formats and I don't mean that just for data files but
13:05
for the metadata as well. They need to be able to be executed on and run. When I refer to comprehensive metadata, I don't just mean the schema as well. I'm also thinking about a descriptive readme and a data dictionary that allows for someone to understand all the variables and how they were used.
13:23
PIDS here so persistent identifiers we should be pitifying everything and that's for findability. So for funders think about the crossword funder registry for institutions the research organization registry or roar for people orchid and for data sites preferably
13:42
Having REST APIs for pushing and pulling data and also allowing for file previews as essential. And then lastly, I've put your data citation so why would I put proper data citation for data reuse and it's because attribution is key. And while we don't know if data citation is going to be the proper metric or indicator for all types of data or disciplines of data.
14:05
The infrastructure is ready for folks to cite data, and I recommend checking out housing at all for recent bibliometric advancements and data metrics and you can find that through the make data count website.
14:20
So promoting data reuse though is not limited to technical infrastructure or those technical features that I was just referring to. Here's some examples of how libraries research offices funders publishers agencies anyone who's listening to this can support data reuse and so note that open data best practices means not simply tossing data over a wall into repository and calling that a success,
14:45
but investing in the usability of the data, both as the submissions and as the secondary users of the data. So building up the capacity for researchers to use published data, mix it link it run it is as important as investing in folks publishing the data.
15:03
I'm going to give a second here so that you can check out the rest of these that I've been talking over. So when data are usable when they can be run competed against understood, then the scientific rigor
15:21
can be assessed, the data can be trusted and the data can be used with other sources. Importantly, the data can be developed as evidence. So, if open and accessible are means to the end, and reusable data principles are key to data becoming evidence, then evidence is the means to the beginning.
15:41
So looking back at this slide here's each point about what it is that we should be thinking about with open data but really the North Star here is that data are routinely reused as evidence to drive discovery. The principles of scientific discovery, the US evidence act and open data act national and international initiatives globally
16:03
are all reliant on the ideas of data driven discovery and building trust to advance policy and well being. Importantly, in the current state of the world, when data are trusted and used to build evidence and evidence is used to influence change. It helps promote a positive feedback loop reinforcing trust and driving discovery along the way across open data initiatives
16:29
in academia government cross sector, the conversation has been distracted about making data open, put it in a repository full stop. And there's a prime opportunity for us to shift our thinking and prioritization to meet the
16:43
infinite potential of using data as evidence and this is reliant on shifting the conversation to data reuse. And it's happening. This isn't theoretical here in this great piece that was published as a blog. They point out examples from three different countries where this is happening. The
17:02
first in Brazil where they opened up income and expenses to expose corrupt practices. The second in New Zealand, when infrastructure was down after a natural disaster they were able to use open data maps. And the third in Australia, when they used open data globally to build evidence for the cause of the death of swarms of bees, saving in the
17:23
ag industry that relies on pollination. And so closing this out I want to reinforce the importance of open reusable data, adding the word reuse into all open data conversations. I also want to press again that infrastructure does not mean technical, always.
17:42
It can include human capacity data curation controlling costs both for technical infrastructure and for users and data reuse is really focused on data quality. Of course then I also put in rewarding and rewarding includes investing in the bibliometric work and infrastructure to support data metrics.
18:02
And last year is publicizing so we should be building on the success, noting when data driven discovery has really changed society science it's whatever that may be, and then building off of it so that we can continue to do so. And so I hope I've evidenced why open and accessible are key concepts in the longer journey of data being reused to build
18:27
evidence drive change and discovery. So please reach out if you have any questions and I hope to continue the conversation with you there. And thank you so much for your time.