Data Communities: Data Sharing from the Ground Up

Video in TIB AV-Portal: Data Communities: Data Sharing from the Ground Up

Formal Metadata

Data Communities: Data Sharing from the Ground Up
Title of Series
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
The Open Science Conference 2021 is the 8th international conference of the Leibniz Research Alliance Open Science. The annual conference is dedicated to the Open Science movement and provides a unique forum for researchers, librarians, practitioners, infrastructure providers, policy makers, and other important stakeholders to discuss the latest and future developments in Open Science.
so first off thanks so much to uh everybody who's been uh working behind the scenes to make this conference possible it is a very sophisticated operation we have going here for this open science conference and it takes a huge team to make it happen so i wanted to thank all of the organizers and all of the folks working on that so today i'm going to be speaking about data communities and how the how to more effectively facilitate data sharing to make open science possible
i want to start by acknowledging that covid 19 has made us think very differently about how research can be done uh we could dwell on the things that people can't do right now but also the things that they can do um and i think that uh data sharing is a really powerful example of how technologies and scientific practices can be enabled uh in really dynamic ways right now so i listed on one side just simply what many researchers found themselves still able to do even when their campuses were closed um and we're having to work from home uh everything from still being able to conduct literature reviews because the bulk of scientific literature is made digitally available um one could analyze pre-existing data sets review or organize their materials such as lab notebooks you can still connect remotely with your peers and you can work on sharing your data and also sharing your findings because of the mechanisms that are available there now of course one can you know it's important to acknowledge that it can be very challenging to work remotely and that we're working in very very difficult circumstances but i think we can also always also acknowledge that there is quite a bit of success here in terms of what is still possible to do even when huge swaths of our infrastructure are are compromised by the need to to not be around each other in person there is also um what uh some researchers can't do right now um there are experienced experiments that may not be able to be run with site-dependent equipment such as um you know when you're working with subjects such as animal species uh it has been a challenge to maintain specimens um that are being kept alive in labs um if you're working on a field site you you wouldn't necessarily be able to visit it right now um and then for those who work with human subjects you can't engage with them in person in many cases right now so what you can see here is that a lot of of the barriers that are presented to research in a situation like the pandemic involve certain forms of data collection whereas what is still enabled is uh even when we're in a situation like this it involves data sharing so a a great example of this uh is uh how data sharing was uh maximized around influenza virus genetics and how that's been important to uh understanding uh the science behind the pandemic um i highlight a really exciting uh data community uh which i do not know how to pronounce maybe it's gise maybe it's gis aid someday somebody will have to explain to me how to pronounce this but this is an interdisciplinary organization where genetic data is is shared related to influenza viruses and it has been tailor made by influenza scientists it was created uh as a response to the h1n1 uh outbreak in 2006 where it was found that the other mechanisms that were already available for sharing data um were inadequate um largely because um previous uh repositories had a stronger emphasis on uh anonymous deposit whereas uh what was really going to motivate researchers was an opportunity to still have an acknowledgement of their ip um so a group of scientists got together to develop a a better repository for data sharing following um h1n1 and what gisa today is is a public-private partnership involving a number of countries including germany the us and singapore philanthropy um and the contributions of scholars this is a database that has been really helpful during the pandemic this is where kova genome sequences have been shared and then uh people around the world can work with them such as through open source apps like neck strain where you can track mutations of the
virus so this issue of uh you know gs gis aid uh is important because it is an example of a data sharing community and i would like to argue that understanding data communities like this are really important uh for supporting open science because it involves supporting the work of scholars across institutional and geographic boundaries and it mirrors the way that scientists actually work this is something
that uh we care a lot about at my organization i work at a not-for-profit called ithaca snr where we study the activities of researchers towards uh coming up with opportunities to improve support structures typically in collaboration with libraries scholarly societies and publishers uh at ithaca snr we have an ongoing program where we've been studying the practices of scholars and how they vary by discipline we have done a number of studies over the last 10 years on different fields listed here including a number in stem fields where issues related to data sharing are very important building on that work um i'm going to talk a bit more today about what a data community is what it means to find ones that are emerging and how we can support them uh for those of us who are in positions such as in libraries or scholarly societies or publishers
so first just to talk a bit about identifying data communities what what they are how they work um
and and and some examples of them so when we think about the data sharing landscape um there's a reality that there's different ways that um data sharing is currently being supported you have repositories that are developed for data sharing that are more institution driven um and then you have those that are more compliance driven so that's when uh you have more generalist repositories typically responding to the needs of funders and publishers that are now increasingly requiring uh data sets to be deposited and then finally you have more community-driven models of sharing data where researchers from various communities have developed uh their or led initiatives to develop platforms to share their data these are typically represented in what we call a domain specific repositories so here at ithaca we really started with the question looking across our the studies we had done in the past what really makes data sharing work you know you you would have funders that are requiring these these kinds of things more and more um you have institutions that sometimes support it but what really motivates researchers on the ground to share their data because having people do this because they want to do it is likely the most effective way to ensure that we do it more so we looked across our number of studies and found a series of success stories uh where communities were already doing this work very effectively so just a few examples include things like the cambridge structural database where uh they've been successfully sharing crystallographic uh structures since the
60s um not even you know well before there would have been any sort of um requirement formalized requirement to do data sharing of this nature um then you have things like flybase which is uh what is known as a model organism uh database that includes genetic sequences um and then another example is design safe ci which uh includes uh data for natural hazards um and has really great curation on the back end
so uh something that um all of these examples have in common is that they involve the work of data communities so this is a formal or informal network of researchers who share or reuse a certain type of data
and it's not the same thing as a discipline in fact what you'll find in any of these examples is that there are scholars working on them from different disciplines coming together because they find the data that is in the repository to be particularly useful there are certain areas of science where this kind of work is particularly useful and effective one example is genetics um there are a number of different uh repositories set up for different kinds of model organisms to share their genetic sequencing um you have gen bank which is arguably like a large community that has multiple smaller communities in it another example is neuroimaging um it's a little bit less developed than the genetics world but is really growing um things like open neuro uh which used to be open fmri um is a great example of where um there's increasing work to share data through a scientific community um i think it's really important uh to acknowledge that it can be uh you have to be creative if you want to find data communities they're they're not the same as disciplines their membership is more fluid you don't have to have an official affiliation to join typically and um you can belong to multiple of these communities at the same time um but i do think that um with the growing mandate um and activity around tracking the outputs of data sets um we're gonna have more ways to identify these kinds of communities as opposed to just going to the repositories themselves and seeing who's been you know providing their data sets um and so i really wanted to call out um the work of the freya project where they were using pers they used personal identifiers to graph uh scholarly networks uh and so these are personal identifiers associated with data sets and so by seeing who's creating data sets and how they're getting shared you could start to map out much more effectively how different communities of scholars are relating to each other i also want to acknowledge that um not every form of scholar or discipline can map nicely onto a data community there are disciplines and subdisciplines where it's actually quite challenging to share data and one example of this is economics um reproducibility is really important to this field but it is also um you know very challenging to share data because a lot of it comes from private entities and regulatory bodies um so when you're thinking about how to support communities and encourage data sharing we also have to be mindful there are a number of disciplines and sub-disciplines where there are certain reasons why it actually is quite challenging to encourage data sharing
so uh then there is the question of what actually makes a data community successful what how do we know uh that they can work well um and how can we support them well first and foremost is the reality that bottom up development is really important um it's the kind of uh opposite to the mentality of if you build it they will come um typically data sharing is has been most successful when the impetus comes from the community itself as opposed to just a regulation um when a data community is is is trying to build itself up um it will conform to its own community norms around data sharing and respecting that is really important um to ensuring that people actually want to use a platform so going back to the example of covet 19 and gis aid um as i mentioned earlier this was a community where other platforms had been available in the past for for data sharing but the emphasis on anonymous deposit was a barrier to um encouraging people to deposit because uh there was an interest in being able to respect ip and acknowledge that so um with that platform being built that was something they took into account finally you need to have the absence or mitigation of technical barriers um especially around genetics this has been great because um there has been an enough of a technological advancement that data sharing makes sense and works and there are other scientific fields where that may not be the case
so the million dollar question is how can you find data communities when they're just getting started or off the ground um and i like to call these emergent data communities these are scholars who may be enthusiastic about data sharing um but they may not have fully established practices yet
so i'll give a few examples of what an emergent data community may look like this is uh air pollution research we've we've seen that some environmental engineers uh who work on air pollution are very eager to share and reuse their air quality data but don't have ways to do so yet or for example um uh spinal cord injury research uh you know there's a small but growing group of scholars maybe 50 or 60 that are really interested in spinal cord research and want to facilitate data sharing there more broadly and then finally um sorry slide delay uh so and so basically what what you see here um when when we're thinking about data communities is that they grow over time you start out with interested researchers uh or researchers who have a shared interest and um over time a process or an infrastructure is built up um once that infrastructure is in place the community can grow and then you have to start thinking about long-term sustainable sustainability
so as the commun a data community um is developing uh there are opportunities to support them um i i think it's really important for the various communities that work to support research practices to focus on how they can support emergent data communities because data sharing really can help overcome a number of the barriers to data collection in research communities and we've seen this firsthand uh with the pandemic when uh a number of the pieces of infrastructure have have not been allowed to continue because we can't meet in person or do things in person i think it's really important to attend to the strategies of successful data communities when trying to come up with strategies to support data sharing more effectively and we can also identify emergent communities um to build up infrastructure further and further encourage data sharing
so just to kind of summarize when you're thinking about supporting data communities more trying to build them up it's important to attend to what they really need data communities do need help building or identifying repository infrastructure these are scholars who typically want to spend their time actually doing the research and so uh infrastructure and having others work on that aspect of it is incredibly helpful they definitely need technical and policy advice around issues relating to metadata preservation privacy um scholars have expertise in the data itself but they can definitely benefit from the perspective of those who have expertise more focused on the use of data more broadly there is always the issue of sustainability it is very challenging to maintain an infrastructure encourage its use so guidance and advocacy around that is incredibly helpful and then finally there's a need for help to get the word out about different platforms or initiatives and get more researchers involved uh an example of how this work is being done to help data communities um i wanted to give a shout out to the rda covid19 working group so that's the research data alliance they have an international working group of librarians and other research data management experts that were formed in response to coven 19 they've released a very comprehensive set of recommendations for sharing coping 19 research data and a zotero bibliography of covid 19 related resources this is the kind of support work that is really helpful to those uh on the ground uh doing the work and sharing the data so this is a really great example of how we can support data communities so some just to sum up some some implications for for data sharing support and and how we would move forward with this um the whole concept of build it in they will come doesn't quite work here we can't be too top-down in designing infrastructure or coming up with policies we really do need to to look to the communities themselves their norms their needs and be responsive to that um as opposed to being too top-down uh institutional and generalist repositories have a role here they provide great they can provide infrastructure and curation supports especially uh is is valuable and finally um librarians and uh institutional support roommates they do have a challenge it's hard to do this work because it's mainly cross-institutional and i think there are certain jurisdictions who have an easier time with this than others especially um countries where at least there's a more nationalized approach to um how universities are organized and how research support can be organized but ultimately it's important to remember that this work happens across borders and so when thinking about how to create supports or services or infrastructures that needs to be respected so just just to sum things up uh here at snr we are continuing to focus on issues relating to uh data sharing how to best support open science including data communities um because a major piece of this is just really understanding how data sharing is happening who is doing it and what their needs are so we have a large study right now that's in the field where we are working with 21 u.s academic libraries to understand the research activities and support needs of scholars who do work with big data and uh the results for that will be coming out uh later this year we also are really interested in ithaca snr in how to evaluate the actual support services that are being designed to help scholars because a big issue around data communities is understanding what services are really helpful for them how can their institutions help different scholars of data communities recognizing that they're actually involving activities all across the world so we are developing an assessment program where we can track um and evaluate how universities are organizing their data services and um what the breadth of them are we've developed a tool to do this and have evaluated the landscape of data support services in the us comparing their size and their scale across different universities we are hoping to expand this analysis to other jurisdictions such as canada uh various countries in in europe and the uk australia and beyond and um we definitely welcome expressions of interests from institutions that may be interested in evaluating their data services um and seeing how they match up against um institutions elsewhere
so if you're interested in learning a bit more about what i presented today or about ithaca's work in general uh we have created an issue brief on data communities uh that allows you to hear about our work perhaps a little bit more at a better pace than me talking quickly in this presentation um and we also have a another issue brief where uh we have published our analysis uh of the assessment we did of data services across u.s universities comparing their size their structure the extent to which they're centralized within the institution or decentralized we also have a series of blog posts on different emergent data communities where we highlight how how they're growing and and and what they need
um i am not on european time zone and so i will have my meet the speaker hour right after my talk today um but please do get in touch uh beyond that if that's preferable for you i'm always happy to talk further um schedule time um in our overlapping time zones and i just really wanted to call out that um i'm always happy to connect beyond uh the the hour of my of my meet the speaker and uh beyond the conference and uh
that's that's it for me today thank you very much danielle really a fascinating look into data sharing communities in the us and all over the world we have a few minutes for questions and answers i'll say about 10 we have a little bit more extra time than we did in the morning sessions so i'll say about 10 minutes and i'll just go through and ask the questions you can give a brief answer if you will the first question we have is what about the role of incentives and awarding in data communities so i so and i just minimized my screen um sorry um so incentives um in data communities i would say that the incentive component is of less less important and on a certain level because the emphasis when we're thinking about data communities is that the scientists are really coming together to make their work more effective so the incentive is kind of it's less direct than if i do this i'll get an acknowledgement because doing their research and there is incentive there for a lot of researchers but that being said you can build incentive structures um into these activities uh you know again going back to this idea that with uh gis aid and covid19 data sharing that influencer researchers they wanted to keep their their like their you know they wanted to have their ip so there's definitely you know without that incentive structure they weren't going to share their data you know they're not it's that's not i don't want to be cruel it's not like they're not so virtuous that um anonymous was good enough right like i i would i think it is artificial to argue that you people are only doing things under the goodness of their heart a lot of different things that come into it but it's less i think it's important to acknowledge that it's less linear than like this funding body now requires that we all share our data tada it's a data community if that's not the way these things work you can get probably more scholars to share their data if they're required to do so but what happens with the data after it's shared um is still an open question it's that are the ones that are really working with the data once it's been shared so respecting how those um configurations are created organically is really important because ultimately we have a number of motivations for why we want more data to be shared one is you know just transparency um and you know but another is making sure people actually use it and do things with it so not just like a one-size-fits-all solution to that i would say thanks super thank you very much our next question do you have any advice or guidance on how publishers might be able to encourage researchers to share data via policies or any other means huh well i so i mean publisher is not a monolithic category right we have all different kinds of publisher configurations so like my advice to elsevier would be different than my advice to you know i don't know a specific society but i would say that um the places that really seem to get it right are are the ones that are very closely connected to the researchers and uh so you know really listening to those who are your editors and on your editorial board and making policies that are are going to make sense uh for the community and i'll i'll give an example of going back to economics which is a field where it's actually really really hard to share data and and so um you know this is something that you can't solve overnight just by making a journal policy um i heard recently that you know some there had been an evaluation of a suite of journals and economics where you know 20 to 40 percent of them had data availability statements that literally just said i can't share my data because it's restricted by a corporation or a regulatory body um so you can obviously the journal can't solve that on its own there's a lot of does that come into play here but at the very least it was considered a win on a certain level that they even had these statements so it's about meeting researchers where they're at and being very transparent about the fact that not every field or topic is gonna have the same going to be able to share data in the same way we talked about transparency a lot this morning that seems to be a common theme absolutely correct i have time for two or three more quick questions here coming up here regarding that we have quotes regarding the top-down approach not working for data communities would you please give specific examples of one that has worked and one that hasn't worked elaborating or contrast why that is so oh wow okay well that's a big question but for time constraints as briefly as you can um well so for example going back to you know data sharing and covid um there were other you know the the pre-existing repositories for sharing info influenza data prior to uh h1n1 um didn't have great uptake because uh the way that these repositories were configured did not really make sense for how those working on influenza wanted to share information about genetics and so a group of 77 scientists got together and they created gis aid and when they were making that platform they did it in a very they did it very specifically to ensure that researchers like themselves would want to share their influenza data so we're now at a point where this you know fast forward 12 years or and we actually have a platform that was nicely configured so that uh researchers could share similar data around cobit 19. super thank you very much and we have time for one more question here uh thank you for a great talk i can only second that emotion there you have presented an allow an analysis of data communities which form around sharing data and data repositories have you also thought about data communities which form around reusing data so i to so to me that i would have to know more about the distinction because reusing data is the same like it happens on the same platform you can the people who are you know sharing their own data these are typically communities where they rely very much on other people's data at the same time so going back to genetics this is a an area of research where you're not going to sequence every genetic strain yourself you really do rely on the work of many people many different places so you are typically like thinking about bio virus tracking you would be reusing tons of people data so i would say that's part and parcel of the same thing and it's a really important part to the question because again like data sharing isn't just transparency is important but um it isn't just for that it's for people to actually work on it and use um another example is reproducibility right like it you may not be necessarily using it for some tiny component of your own project somebody else's data but um it's not just about transparency you need the data to be shareable in a way that actually works so that when you're testing things out to make make sure that it was accurate that you can actually run something or do something with it so it's the the the emphasis on use is really here here so i i agree entirely