Learnings on encouraging open data sharing in science with open source tools

Video in TIB AV-Portal: Learnings on encouraging open data sharing in science with open source tools

Formal Metadata

Learnings on encouraging open data sharing in science with open source tools
Title of Series
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
The Open Science Conference 2021 is the 8th international conference of the Leibniz Research Alliance Open Science. The annual conference is dedicated to the Open Science movement and provides a unique forum for researchers, librarians, practitioners, infrastructure providers, policy makers, and other important stakeholders to discuss the latest and future developments in Open Science. https://www.open-science-conference.eu/ #osc2021
hi everybody my name is lily winfrey and i'm really excited to talk to you today about learnings on encouraging open data sharing and science with open source tools i'm the product manager of the frictionless data project which is an open source project that's overseen by the open knowledge foundation and today we'll be talking to you about some of the collaborations with scientists that we've been doing as part of this project and hopefully by the end of this talk you will have some tips and tricks about how you can also have different collaborations to promote open science i have a lot of links within this talk so i've published the slides here and they're also published with the conference and here's my contact information if you would like to reach out and ask me any questions and be happy to talk
so the theme for this talk is let's work together for an open future and this theme works really well with the values of the open knowledge foundation which is to work towards a fair free and open future it also works well for the values of the frictionless data for reproducible research project and today we'll be going into some of the details about this frictionless data project
so what is this project we focus on removing the friction and research data to move from data to insight faster and we are an open source project here's the link to our github repository where you can see our code and we're also community focused and by that i mean that we really depend on our community to give us feedback about our work and to ask us for improvements or changes as they need them this is a picture of my co-workers my team that works on this project with me within the open knowledge foundation so i've been saying frictionless data but what are some frictions in data and if we were together in person this is the time when i would ask you to answer this question um but hopefully you are watching this and you will pause and think about your answer to this here is my answer when i think about friction and data i think about things like what does this column name mean how is the analysis done who created the data and things that i need to know to check the quality of the data these are all issues that make data more difficult to work with um you can think of them like data cleaning tasks and these tasks are oftentimes very time consuming and difficult so the inspiration behind this project was to make all of these tasks easier so that it's easier to understand and work with your data this project is a set of specifications for data and meditated interoperability it's a collection of open source software libraries and it's a range of best practices for data management importantly it's platform agnostic meaning it's very interoperable the big picture question that i want us to focus on today during this talk is how can we collaborate to solve research data management problems and to answer that question i'm going to be talking about two different types of collaborations that we do as part of the frictionless data project and these types of collaborations are possible because we have funding from the sloan foundation so we actually do three different types of collaborations as part of this grant the first is a fellows program which is where we work with early career researchers to help them become advocates for open science and frictionless data tooling the second is a tool fund where we give small grants to developers to create novel open tooling based off of our open source code and then the third are the pilots and these are more intensive hands-on collaborations where we're working with a researcher group that has an identified data management problem that they need to get resolved today i'll be telling you first about the pilots and give you a specific use case and then i'll end by telling you about our fellows program so what does a pilot mean a pilot is a collaboration between developers and researchers and i'll call them users and this is using a real life use case so we work with a researcher group that has a use case that is a problem that exists for their research this is a win-win situation it solves a problem for the researchers but it also gives us feedback and helps us improve our code and all of this is done in the spirit of openness so we try to be as open and transparent as we can when we're doing these pilots my example use case of a pilot for you today is frictionless data management with bicodimo embicodimo is the biological and chemical oceanography data management office they're funded by the nsf and researchers submit data to bcodemo who has data managers who then clean and process that data and then they host the data and provide access for the public or for other researchers ecodemo really values fair data so we were going to work with them to make this data cleaning process easier and try to keep in mind the fair data principles
bikodimo's main problem is that they get really messy data they get really cool data as well but it's oftentimes very messy so what they wanted to do is take this messy data clean it and then host it and very importantly they wanted to make this process reproducible so this was the problem that they came to us with and we decided to do this pilot collaboration where we were going to work together and try to solve this problem using the frictionless tools
so our solution here was to create a frictionless data processing pipeline that answers these questions that the data managers needed to know so the data managers need to know what is the data they need to know the metadata or a description does the data seem valid are there any missing values or values that don't make sense does the data need to be transformed or cleaned in some way and how can this process be transparent and it was also really important for the data managers to be able to talk with the researchers or give them feedback about their data while the researchers were submitting their data because as you can imagine you know as a researcher you submit your data then you go off and forget about it but it's difficult to be able to get that feedback to researchers
when they're on a boat in the middle of the ocean you can't really reach out to them and say hey did you remember to record the metadata so it was important for bcodemode to be able to communicate with the researchers during this data ingest process
there are four ways that we work with beco demo that make up the frictionless framework and this is our main python code and these are the four main functions that underline our python code if you want to look at that code in more detail there's a link at the bottom of the slides to our repository and so the four main functions here are describe which is where you're able to automatically infer and edit a file's metadata so you can insert a file and get back metadata in a schema that describes that file extract which reads and normalizes data validate which detects errors in a file and gives you a report stating what those errors are so you can fix them and transform which changes the files data and metadata or cleans it all of these are based on the frictionless specifications for data and metadata i'm not going to go into a whole lot of more technical details here if you would like more details please contact me and i'd love to chat with you about this so i do want to talk about how we collaborated in the open because i think this is one of our biggest wins from this collaboration and we used github to asynchronously communicate between our developer team ecodemos developer team and data managers here's one issue that i thought exemplified this and the link to it's here if you want to look at it later so one of the great things about using github to communicate is that this is a public document and like you could go out and read this or future me could go and read this and say what did we do in the past i've forgotten let me go read through this so it's great for communication between teams so this example is where amber one of the beco demo data management team was hoping to get some support for time zones and date time type this is a common data issue that ecodemo data managers deal with so she alerted us to this issue and then here's more conversation text from this issue between our developer and the beaker demo developer where they are able to communicate and really discuss this issue in the open and get it resolved another reason why i'm showing this example is because we used github's kanban tile board to document our work and to keep track of it so for instance we had stephanie ready to work on ordered priority section and then could move it to the waiting for bikodimo comments and if you want to see what this looks like this board you can click on this link but this worked really well for us so that's why i'm recommending it for other collaborators
so overall in this pilot we were able to implement frictionless code into the bcodemo system to create a pipeline where researchers submitted their data and then that data was described so added metadata or schema it was cleaned or transformed it was validated and then published and now other people can access this data and understand all of these transformation steps that happened in a reproducible way if you want to read
more about that pilot or any of our other pilots please go to our blog and i want to give a shout out that we have another pilot going on right now with the dryad data repository so stay tuned for a few more months we'll have a blog post about that as well okay now i'm going to switch gears and talk about our fellows program which is a different type of collaboration where we're really focused on enacting cultural change whereas with the pilots we're focused more on a product what is a fellows program fellows program is a nine month paid fellowship where we are teaching and training scientists our main goal here was to create advocates for open science and for frictionless data we focused specifically on early career researchers for this program
our major goals were all related to the umbrella term of open science with starting with teaching people so they were learning about open science concepts and also about open source software we wanted to create advocates for open science and this is really where that like cultural change aspect comes in the fellows also make blogs so they do a lot of writing and then they present workshops and talks and importantly we're also building community which is really important you know trying to create resilient open science advocates who makes a good fictionalist fellow we decided to focus on early career researchers and we tried really hard to keep in mind diversity and i'm pointing this out because i think that it made this program more impactful to have a diverse group of fellows we have people from around the world and i think one of the best things about this program is sitting and listening to all of the fellows talk together about their ideas of open science because people have different experiences around the world and it's really important to hear everybody's experiences the fellows needed to have their own data that they can use we were not really domain specific you know just scientists as long as they had some data they could use and they needed to have some experience with programming but not a lot and importantly they had to be passionate about open science i want to briefly intro you to four of our fellows from our first and our second cohort so first we have uso daniel who lives in kenya and he's a molecular biologist we have monica granados who's in canada and her background is an ecology and open science policy katarina draculaki in greece who studies language cognition and music and daniel akala lopez who's in spain and is a neuroscientist so i just wanted you to see that we have a wide variety of people from around the world and with different scientific interests and at different career stages as well what do the fellows achieve they have a lot of open science discussions which is maybe my favorite part of the program is listening to them talk about different things for instance here was here are some notes from our discussion about what open access means and they also write open science blogs again this was on the theme of open access so this was during open access week they also write tutorial blogs which are more technical writing and these are great for the fellows because they get an experience doing technical writing but they're also great for us because now we have these blog resources that we can show to other users and finally they give workshops and again these are great for the fellows because they get the experience of giving a workshop but great for us because now we can give these workshop videos to other people as well everything we do is open so all of this content is openly licensed some of the collaboration lessons that i learned from the fellows that i want to share with you today are to meet people where they are and make it useful for them these are tied together you know this program really is about making some cultural change in academia with respect to open science and that's never going to happen if we can't show why open science principles are useful for researchers and if we don't meet people where they already are you know i like to say that doing one act of open science is better than you know an entire checklist and you know if somebody's sharing a paper then that's great you know meet people where they are already at be flexible and open especially during the last year we've all had to be more flexible and so i think that's a big lesson learned listen and have a learning mindset the best thing about this group of people is that i get to sit and listen to them and hear what they have to say and they all need to have a learning mindset to learn how to code learn how to talk to people about open science and that has been really helpful for them and then i want to say reuse existing materials there's a ton of open science lessons and teachings that exist on the web that are openly licensed that you can use i have some examples at the end of this slide deck that you can also take that i wanted to share with you all okay and with that i want to summarize all of my top collaboration tips from both the pilots and the fellows and the first one is make it a win-win situation this means where what you want is the same as what the user wants this might be difficult to achieve that as close as you can get to this situation will make everybody happier i like to focus on early career researchers because they are the future and i think especially if you're trying to make some cultural change then that is a really great place to start think about diversity and inclusion i included this here because i think that open science has a long way to go to be truly equitable and that we really need to make an effort to ask questions like who's in the room when open science policy is being made whose voice is being heard and whose voice isn't being heard you know there's a history of open science policies being very focused on europe and the united states and that leaves out so many researchers around the world so i really think that open science needs to take a hard look at itself and work on its equity the next thing i want to say is to embrace openness which is tied into the next bullet point don't reinvent the wheel there's so many existing open resources out there and if you are creating your own resources please license them openly so other people can use them as well and i also recommend having a code of conduct here's a link to our code of conduct if you would like to see an example again it's openly licensed for you to reuse if you would like and with that i want to thank you all for your attention here again are all of the links and and the links to the slides link to our code our community chat videos our documentation twitter and you can email me if you would like to talk about any of this in more detail you know get more of a technical demo or talk about collaborating with us in the future then i would love for you to reach out to me and i just want to briefly show you that we have this is the list of the resources that i use when making the fellows program and it's included in the slide deck if you would like to see that as well and with that i want to say thank you so much