Learnings on encouraging open data sharing in science with open source tools
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 14 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/53432 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Computer animation
01:13
Computer animation
05:32
Computer animation
06:03
Computer animation
06:51
Computer animation
07:05
Computer animation
10:02
Computer animation
10:34
Computer animation
11:25
Computer animation
Transcript: English(auto-generated)
00:01
Hi everybody, my name is Lily Winfree and I'm really excited to talk to you today about learnings on encouraging open data sharing and science with open source tools. I'm the product manager of the Frictionless Data Project, which is an open source project that's overseen by the Open Knowledge Foundation. And today we'll be talking to you about some of the collaborations with scientists that we've been doing as part of this project. And hopefully by the end
00:24
of this talk you will have some tips and tricks about how you can also have different collaborations to promote open science. I have a lot of links within this talk, so I've published the slides here and they're also published with the conference. And here's my contact information if you would like to reach out and ask me any questions,
00:43
I'd be happy to talk. So the theme for this talk is let's work together for an open future. And this theme works really well with the values of the Open Knowledge Foundation, which is to work towards a fair, free, and open future. It also works well for the values of
01:02
the Frictionless Data for Reproducible Research Project. And today we'll be going into some of the details about this Frictionless Data Project. So what is this project? We focus on removing the friction and research data to move from data to insight faster. And we are an
01:24
open source project. Here's the link to our GitHub repository where you can see our code. And we're also community focused. And by that I mean that we really depend on our community to give us feedback about our work and to ask us for improvements or changes as they need them.
01:41
This is a picture of my co-workers, my team that works on this project with me within the Open Knowledge Foundation. So I've been saying frictionless data, but what are some frictions in data? And if we were together in person, this is the time when I would ask you to answer this question. But hopefully you are watching this and you will pause and think about your answer
02:04
to this. Here is my answer. When I think about frictionless data, I think about things like, what does this column name mean? How was the analysis done? Who created the data? And things that I need to know to check the quality of the data. These are all issues that make data
02:23
more difficult to work with. You know, you can think of them like data cleaning tasks, and these tasks are oftentimes very time consuming and difficult. So the inspiration behind this project was to make all of these tasks easier so that it's easier to understand and work with your data. This project is a set of specifications for data and metadata
02:46
interoperability. It's a collection of open source software libraries, and it's a range of best practices for data management. Importantly, it's platform agnostic, meaning it's very interoperable. The big picture question that I want us to focus on today during this talk
03:04
is how can we collaborate to solve research data management problems? And to answer that question, we're going to be talking about two different types of collaborations that we do as part of the frictionless data project. And these types of collaborations are possible because we have funding from the Sloan
03:23
Foundation. So we actually do three different types of collaborations as part of this grant. The first is a fellows program, which is where we work with early career researchers to help them become advocates for open science and frictionless data tooling. The second is a
03:43
novel open tooling based off of our open source code. And then the third are the pilots. And these are more intensive hands-on collaborations where we're working with a researcher group that has an identified data management problem that they need to get resolved. Today I'll be telling
04:02
you first about the pilots and give you a specific use case, and then I'll end by telling you about our fellows program. So what does a pilot mean? A pilot is a collaboration between developers and researchers, and I'll call them users. And this is using a real-life use case.
04:25
So we work with a researcher group that has a use case that is a problem that exists for their research. This is a win-win situation. It solves a problem for the researchers, but it also gives us feedback and helps us improve our code. And all of this is done in the spirit of openness,
04:45
so we try to be as open and transparent as we can when we're doing these pilots. My example use case of a pilot for you today is frictionless data management with BcoDemo. And BcoDemo is the biological and chemical oceanography data management office.
05:02
They're funded by the NSF and researchers submit data to BcoDemo, who has data managers who then clean and process that data, and then they host the data and provide access for the public or for other researchers. BcoDemo really values fair data, so we were going to work with them
05:25
to make this data cleaning process easier and try to keep in mind the fair data principles. BcoDemo's main problem is that they get really messy data. They get really cool data as well,
05:40
but it's oftentimes very messy. So what they wanted to do is take this messy data, clean it, and then host it. And very importantly, they wanted to make this process reproducible. So this was the problem that they came to us with, and we decided to do this pilot collaboration where we were going to work together and try to solve this problem
06:01
using the frictionless tools. So our solution here was to create a frictionless data processing pipeline that answers these questions that the data managers needed to know. So the data managers need to know what is the data. They need to know the metadata or a description. Does the data seem valid? Are there any missing values or values that don't
06:24
make sense? Does the data need to be transformed or cleaned in some way? And how can this process be transparent? And it was also really important for the data managers to be able to talk with the researchers or give them feedback about their data while the researchers were submitting their
06:41
data. Because as you can imagine, you know, as a researcher, you submit your data, then you go off and forget about it. But it's difficult to be able to get that feedback to researchers when they're on a boat in the middle of the ocean. You can't really reach out to them and say, hey, did you remember to record the metadata? So it was important for BcoDemo to be able
07:01
to communicate with the researchers during this data ingest process. There are four ways that we work with BcoDemo that make up the frictionless framework. And this is our main Python code. And these are the four main functions that underline our Python code. If you want to look at that code in more detail, there's a link at the bottom
07:23
of the slides to our repository. And so the four main functions here are describe, which is where you're able to automatically infer and edit a file's metadata. So you can insert a file and get back metadata in a schema that describes that file. Extract, which reads and
07:43
normalizes data. Validate, which detects errors in a file and gives you a report stating what errors are so you can fix them. And transform, which changes a file's data and metadata or cleans it. All of these are based on the frictionless specifications for data and
08:03
metadata. I'm not going to go into a whole lot of more technical details here. If you would like more details, please contact me and I'd love to chat with you about this. So I do want to talk about how we collaborated in the open because I think this is one of our biggest wins from this collaboration. And we used GitHub to asynchronously communicate between
08:27
our developer team, Biko Deimos developer team, and data managers. Here's one issue that I thought exemplified this and the link to it's here if you want to look at it later.
08:42
So one of the great things about using GitHub to communicate is that this is a public document and you could go out and read this. Or future me could go and read this and say, what did we do in the past? I've forgotten. Let me go read through this. So it's great for communication between teams. So this example is where Amber,
09:03
one of the Biko Deimos data management team, was hoping to get some support for time zones and daytime type. This is a common data issue that Biko Deimos data managers deal with. So she alerted us to this issue. And then here's more conversation text from this issue between
09:22
our developer and the Biko Deimos developer where they are able to communicate and really discuss this issue in the open and get it resolved. Another reason why I'm showing this example is because we used GitHub's Kanban style board to document our work and to keep track of
09:43
it. So for instance, we had Stephanie ready to work on ordered priority section and then can move it to be waiting for Biko Deimos comments. And if you want to see what this looks like, this board, you can click on this link. But this worked really well for us, so that's why I'm recommending it for other collaborators.
10:04
So overall, in this pilot, we were able to implement frictionless code into the Biko Deimos system to create a pipeline where researchers submitted their data and then that data was described, so added metadata or schema. It was cleaned or transformed.
10:23
It was validated and then published. And now other people can access this data and understand all of these transformation steps that happened in a reproducible way. If you want to read more about that pilot or any of our other pilots, please go to our blog.
10:41
And I'm going to give a shout out that we have another pilot going on right now with the Dryad Data Repository. So stay tuned. In a few more months, we'll have a blog post about that as well. Okay, now I'm going to switch gears and talk about our fellows program, which is a different type of collaboration where we are really focused on enacting cultural change,
11:02
whereas with the pilots we're focused more on a product. What is a fellows program? Fellows program is a nine month paid fellowship where we are teaching and training scientists. Our main goal here was to create advocates for open science and for frictionless data. We focus specifically
11:21
on early career researchers for this program. Our major goals were all related to the umbrella term of open science, starting with teaching people, so they were learning about open science concepts and also about open source software. We wanted to create advocates for
11:42
open science, and this is really where that cultural change aspect comes in. The fellows also make blogs, so they do a lot of writing, and then they present workshops and talks. And importantly, we're also building community, which is really important, you know,
12:00
trying to create resilient open science advocates. Who makes a good frictionless fellow? We decided to focus on early career researchers, and we tried really hard to keep in mind diversity, and I'm pointing this out because I think that it made this program more impactful
12:20
to have a diverse group of fellows. We have people from around the world, and I think one of the best things about this program is sitting and listening to all of the fellows talk together about their ideas of open science, because people have different experiences around the world, and it's really important to hear everybody's experiences. The fellows needed to
12:42
have their own data that they can use. We were not really domain specific, you know, just scientists, as long as they had some data they could use. And they needed to have some experience with programming, but not a lot. And importantly, they had to be passionate about open science.
13:02
I want to briefly intro you to four of our fellows from our first and our second cohort. So first we have Ouso Daniel, who lives in Kenya, and he's a molecular biologist. We have Monica Granados, who's in Canada, and her background is in ecology and open science policy.
13:21
Katarina Drakulaki in Greece, who studies language, cognition, and music. And Daniel Acala Lopez, who's in Spain and is a neuroscientist. So I just wanted you to see that we have a wide variety of people from around the world and with different scientific interests and at different career stages as well.
13:44
What did the fellows achieve? They have a lot of open science discussions, which is maybe my favorite part of the program is listening to them talk about different things. For instance, here are some notes from our discussion about what open access means. And they also write open science blogs. Again, this one was on the theme of open access,
14:05
so this was during open access week. They also write tutorial blogs, which are more technical writing. And these are great for the fellows because they get an experience doing technical writing, but they're also great for us because now we have these blog resources that we can
14:22
show to other users. And finally, they give workshops. And again, these are great for the fellows because they get the experience of giving a workshop, but great for us because now we can give these workshop videos to other people as well. Everything we do is open, so all of this content is openly licensed. Some of the collaboration lessons that I learned
14:47
from the fellows that I want to share with you today are to meet people where they are and make it useful for them. These are tied together. This program really is about making some cultural change in academia with respect to open science, and that's never going to happen
15:05
if we can't show why open science principles are useful for researchers and if we don't meet people where they already are. I like to say that doing one act of open science is better than an entire checklist, and if somebody's sharing a paper, then that's great. Meet people where
15:24
they are already at. Be flexible and open, especially during the last year. We've all had to be more flexible, so I think that's a big lesson learned. Listen and have a learning mindset. The best thing about this group of people is that I get to sit and listen to them
15:43
and hear what they have to say, and they all need to have a learning mindset to learn how to code, learn how to talk to people about open science, and that has been really helpful for them. And then I want to say reuse existing materials. There's a ton of open science lessons
16:00
and teachings that exist on the web that are openly licensed that you can use. I have some examples at the end of this slide deck that you can also take that I want to share with you all. Okay, and with that, I want to summarize all of my top collaboration tips from both the
16:21
pilots and the fellows, and the first one is make it a win-win situation. This means where what you want is the same as what the user wants. This might be difficult to achieve, but as close as you can get to this situation will make everybody happier. I like to focus on career researchers because they are the future, and I think especially if you're trying to make
16:44
some cultural change, then that is a really great place to start. Think about diversity and inclusion. I included this here because I think that open science has a long way to go to be truly equitable, and that we really need to make an effort to ask questions like
17:01
who's in the room when open science policy is being made, whose voice is being heard, and whose voice isn't being heard. You know there's a history of open science policies being very focused on Europe and the United States, and that leaves out so many researchers around the world. So I really think that open science needs to take a hard look at itself and work on
17:26
its equity. The next thing I want to say is to embrace openness, which is tied into the next bullet point. Don't reinvent the wheel. There's so many existing open resources out there, and if you are creating your own resources, please license them openly so other people can
17:42
use them as well. And I also recommend having a code of conduct. Here's a link to our code of conduct if you would like to see an example. Again, it's openly licensed for you to reuse if you would like. And with that, I want to thank you all for your attention. Here again are all of the links, and the links to the slides, links to our code, our community shot, videos, our documentation,
18:07
Twitter, and you can email me. If you would like to talk about any of this in more detail, you know, get more of a technical demo or talk about collaborating with us in the future, then I would love for you to reach out to me. And I just want to briefly show you that we have
18:24
this is the list of the resources that I use for Making the Fellows program, and it's included in the slide deck if you would like to see that as well. And with that, I want to say thank you so much.