Data: Changing Paradigms for Data Management, Publication and Sharing #1 - 23 April 2015

Video thumbnail (Frame 0) Video thumbnail (Frame 1638) Video thumbnail (Frame 3610) Video thumbnail (Frame 5784) Video thumbnail (Frame 10379) Video thumbnail (Frame 13533) Video thumbnail (Frame 15188) Video thumbnail (Frame 17699) Video thumbnail (Frame 21547) Video thumbnail (Frame 27194) Video thumbnail (Frame 28554) Video thumbnail (Frame 29568) Video thumbnail (Frame 30877) Video thumbnail (Frame 32163) Video thumbnail (Frame 33397) Video thumbnail (Frame 35700) Video thumbnail (Frame 37327) Video thumbnail (Frame 38736) Video thumbnail (Frame 40248) Video thumbnail (Frame 41919) Video thumbnail (Frame 43577) Video thumbnail (Frame 44813) Video thumbnail (Frame 46193) Video thumbnail (Frame 47383) Video thumbnail (Frame 48864) Video thumbnail (Frame 50799) Video thumbnail (Frame 51710) Video thumbnail (Frame 53719) Video thumbnail (Frame 55248) Video thumbnail (Frame 59486) Video thumbnail (Frame 61284) Video thumbnail (Frame 62581) Video thumbnail (Frame 63958)
Video in TIB AV-Portal: Data: Changing Paradigms for Data Management, Publication and Sharing #1 - 23 April 2015

Formal Metadata

Title
Data: Changing Paradigms for Data Management, Publication and Sharing #1 - 23 April 2015
Title of Series
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2015
Language
English

Content Metadata

Subject Area
Abstract
Addressing today’s environmental challenges requires that we change the ways that we do science, harness the enormity of existing data, and develop new approaches to managing, publishing and sharing data. In this webinar, Professor William Michener will: -- provide a historical overview of data management and data sharing, focusing on lessons learned from past and emerging large ecological and environmental research programs (i.e., “big ecology”) -- review some of the current impediments to data management, publication and sharing -- discuss solutions to these challenges including various tools that support management of data throughout the data life cycle from planning through analysis -- explore new approaches to publishing and sharing data such as the Dryad digital repository and DataONE. -- glimpse into a future vision for how informatics can better enable science, highlighting some of the activities that are underway with respect to changing the scientific culture (e.g., altmetrics, semantic annotation and provenance tracking). Professor William (Bill) Michener Project Director for DataONE — a US National Science Foundation project that supports cyberinfrastructure development and community engagement for the biological, environmental, and Earth sciences. He presently serves as Editor for the Ecological Society of America’s Ecological Archives, Associate Editor for Ecological Informatics, Board chair for Dryad Digital Repository, and Board member for the Cornell Lab of Ornithology, the Organization for Tropical Studies, and the Federation of Earth Science Information Partners (ESIP).
Information management Programming paradigm Information management Term (mathematics) Data storage device Shared memory
Point (geometry) Freeware Enterprise architecture Open source Real number Connectivity (graph theory) File format Virtual machine Open set Product (business) Internetworking Software Associative property Logic gate Source code Enterprise architecture Touchscreen File format Internet service provider Code Open set Product (business) Type theory Internetworking Integrated development environment Software Personal digital assistant Sheaf (mathematics) Right angle
Area Group action Mathematical analysis Shared memory Open set Perspective (visual) Food energy Logic synthesis Computer programming Number Type theory Integrated development environment Different (Kate Ryan album) Forest Touch typing Integration by parts Integrated development environment Right angle Quicksort Endliche Modelltheorie Communications protocol Communications protocol Initial value problem
Principal ideal Implementation Scheduling (computing) Divisor Real number Multiplication sign System administrator Approximation Logic synthesis Computer programming Expected value Latent heat Term (mathematics) Repository (publishing) Implementation Addition Information management Standard deviation Broadcast programming Consistency Projective plane Shared memory Planning Sound effect Computer network Bit Term (mathematics) Process (computing) Data exchange Software Repository (publishing) Website Quicksort Figurate number Procedural programming Data management Resultant
Point (geometry) Focus (optics) Observational study System administrator Projective plane Open set Information privacy Open set Integrated development environment Different (Kate Ryan album) Personal digital assistant Statement (computer science) Energy level Quicksort Traffic reporting Resultant
Standard deviation Group action Information management Multiplication sign Source code 1 (number) Insertion loss Proper map Formal language Fraction (mathematics) Bit rate Repository (publishing) Exception handling Area Pattern recognition Gradient Quadrilateral Metadata Bit Student's t-test Flow separation Sequence Data mining Type theory Data model Process (computing) Frequency Repository (publishing) Self-organization Right angle Figurate number Point (geometry) Laptop Slide rule Virtual machine Similarity (geometry) Protein Metadata Product (business) Attribute grammar Number Frequency Term (mathematics) Boundary value problem Ranking Personal area network Addition Information management Dependent and independent variables Standard deviation Graph (mathematics) Information Frame problem Uniform resource locator Personal digital assistant Universe (mathematics) HTTP cookie
Group action Natural number Plotter 1 (number) Authorization File archiver Set (mathematics) Virtual memory Attribute grammar Number
Addition Logical constant Freeware Multiplication sign Disintegration Login 3 (number) Mereology Digital library Sign (mathematics) Mechanism design Term (mathematics) Software repository Authorization
Beta function Link (knot theory) Computer file Link (knot theory) Information Digital object identifier Mereology Metadata Product (business) Inclusion map Repository (publishing) Square number Authorization Cuboid Species
Covering space Web page Web 2.0 Revision control Information management Debugger Mathematical analysis Bit Line (geometry) Data management Element (mathematics) Planning
Standard deviation Dependent and independent variables System call Graph (mathematics) File format Shared memory Planning Bit rate Template (C++) Number Goodness of fit Process (computing) Different (Kate Ryan album) Personal digital assistant Data management Spacetime
Morphing Addition Information management Touchscreen Content management system Observational study Computer file Multiplication sign Set (mathematics) Directory service Mereology Metadata Interprozesskommunikation Number Type theory Mathematics Integrated development environment
Domain name Slide rule Service (economics) Variety (linguistics) Civil engineering Multiplication sign Bit rate Library catalog Entire function Field (computer science) Term (mathematics) Software repository Repository (publishing) Different (Kate Ryan album) Repository (publishing) Summierbarkeit Position operator
Asynchronous Transfer Mode Service (economics) Connectivity (graph theory) Projective plane Metadata Computer network Bit rate Replication (computing) Mereology Metadata Computer programming Number Type theory Subject indexing Search engine (computing) Software repository Bridging (networking)
Presentation of a group Asynchronous Transfer Mode Kepler conjecture Statistics Service (economics) Web portal Variety (linguistics) Line (geometry) Connectivity (graph theory) Computer network Image registration Philips CD-i Counting Open set Metadata Computer programming Subject indexing Estimation Different (Kate Ryan album) Repository (publishing) Software repository Personal digital assistant Interface (computing) Computer-assisted translation
Dependent and independent variables Touchscreen Constraint (mathematics) User interface Principal ideal domain State of matter Range (statistics) Mortality rate Number Type theory Latent heat Personal digital assistant Repository (publishing) Network topology Authorization Cuboid Species
Laptop Metropolitan area network Personal identification number Computer file Computer file MIDI Content (media) Virtual machine Set (mathematics) Streaming media Plastikkarte Metadata Entire function Personal digital assistant Network topology Hill differential equation Species Row (database)
Area Complex (psychology) Computer file Mathematical analysis Set (mathematics) Mereology Total S.A. Digital library Product (business) Number Visualization (computer graphics) Repository (publishing) Bubble memory Repository (publishing) Thetafunktion Identity management
Meta element Addition Nim-Spiel Beta function Service (economics) Variety (linguistics) Multiplication sign Source code Open set Product (business) Wave packet Visualization (computer graphics) Bit rate Personal digital assistant Website Ranking Arrow of time
Addition Group action Greatest element Dependent and independent variables Touchscreen Database Open set Lattice (order) Wave packet Element (mathematics) Mechanism design Video game Software framework Cycle (graph theory) Data management
Web page Addition Information management Information Link (knot theory) Cycle (graph theory) Shared memory Student's t-test Coma Berenices XML Video game Video game Cycle (graph theory) Data management
Information File format Computer-generated imagery Projective plane Planning Maxima and minima Grand Unified Theory Product (business) Repository (publishing) Information Extension (kinesiology) Data management Descriptive statistics
Meta element Addition Logical constant Source code Shared memory Metadata Set (mathematics) Analytic set Login Mereology Analytic set Metadata Sign (mathematics) Software Repository (publishing) Term (mathematics) Software Repository (publishing) File archiver Musical ensemble
Point (geometry) Laptop Presentation of a group Information management Open source Code Variety (linguistics) Direction (geometry) Open set Metadata Data quality Power (physics) Twitter Product (business) 2 (number) Mechanism design Different (Kate Ryan album) Term (mathematics) Collaborationism Electric generator Kepler conjecture Information Closed set Projective plane Shared memory Planning Uniform resource locator Process (computing) Repository (publishing) Blog Arrow of time Video game Right angle PRINCE2 Cycle (graph theory) Resultant
Trail Slide rule Octahedron Multiplication sign Source code Set (mathematics) ACID Mereology Semantics (computer science) Equivalence relation Usability Product (business) Bargaining problem Personal digital assistant Flag Software testing Physical system
Area Enterprise architecture Focus (optics) Variety (linguistics) Projective plane Open set Computer programming Twitter Product (business) Number Faculty (division) Mechanism design Mathematics Different (Kate Ryan album) Energy level Metric system Associative property Library (computing)
I'll be talking about some of the changing paradigms for management publication sharing which ultimately is really leading to what I would refer to as open science and in the next few minutes I'll be covering several things
one I'll start off with some brief definitions talk about some of the benefits of data sharing which will probably be intuitive to most all of you they don't want to focus some examples on data sharing any college in the United States and provide a brief history in terms of what's happened there over the last few decades then I'll talk about some of the challenges and solutions to open science and then conclude with some best practices for promoting open science and also what we have in store for the future so without further ado data sharing is pretty obvious in terms of what the definition is for that this comes from Wikipedia but I think we've been all agree that this is a fair definition it's just basically making data available for use by other investigators so that's a simple definition if we look at a more
comprehensive and useful definition today this one that's put forth by the open knowledge in the open definition Advisory Council and October of 2014 I think is a really good starting point and they define an open work that that supports open science is following the three key principles and the first one is there's an open license such as a Creative Commons license that's associated with the data product and this includes the freedom to use build on and modify and share the data product secondly refers to accessibility and this is the data product should be ideally available via download from the internet without any financial charge for that and then lastly and I think an important one in for us to think about is the open format side of it and this refers to the fact that data should again ideally be machine readable available in bulk and then provided an open format or the very least can be processed with some kind of an open-source software tool and again this promotes the open science of a work type environment so with respect to data
sharing I think again we're all probably familiar with a lot of the benefits of data sharing the most commonly cited one is that it's for the public good so data are valuable products of the scientific enterprise and they should be treated as such secondly is public trust and we've seen a lot of examples the literature the last few years that have focused on things like climate gate and other real challenges to the scientific environment about misuse or misinterpretation or some cases fraudulent data that have been produced so this again creates a need for enhancing the public trust in science a third key component is one of the benefits that we've seen documented in several publications including some by Heather d'ivoire are the increased credit that scientists get from sharing their data products in this case if you make your data available as a product then it's more likely that your publication is also going to be cited more than those publications that don't include or make available the data then lastly there's one that's been appearing on the international radar screen lately which is association with human rights and this is that you know sharing bait or availability of scientific data is considered with a human right by UN and other international bodies but from my
perspective is an environmental scientists background is that by sharing data we can more easily and readily tackle some of the grand environmental challenges that we face today and that's exemplified by all these different magazine covers from Time magazine The Economist scientist science and others again that focus on many of the challenges like climate change energy usage and so on that were facebook now and will be for the next probably many decades so if we step back
and think through how data sharing is evolve in ecology I'm going to use the United States as an example here if one I'm more familiar with it and secondly there have been some advances there that have been adopted internationally touch upon briefly first of all we go back to the International biological program this was like the first decade-long large-scale international program focused on ecosystem science and this was done in a number of biomes around the world forest of grassland areas internationally you can see the example from the stamp on the lower right the Government of Canada created a stamp recognizing the importance of the National biological program and again it's been in was implemented in many countries around the world at different pace but all during that one a decade Dave colon wrote a book on called big ecology that focuses on the International biological program and many of the subsequent programs that evolved from that the thing about ivp that was interesting was it was really geared from the inception as being a program that would facilitate modeling and synthesis of data across all of these different biomes internationally and to have that as one of its goals it was interesting that John Porter and Tom Callahan a 1994 analysis looked at that program and reported that data policies and protocols were never elaborated nor even agreed to in principle under the ibp program there were some major successes coming out of IBP and that was largely due to a number of smaller working groups for synthesis efforts the individuals contributed their data that was done a more of an ad hoc thi passion and I think arguably most of the data that were collected under the ibp program are no longer totally accessible for use by scientists so that was a first sort of stepping stone into this whole concept of open science and how do we support that type of an effort and it was not necessarily a big success if we step fast forward a
couple decades in the US from the mid 60s and 70s up to the 80s then we had the inception of the long-term ecological research program the United States this started in nineteen eighty with uh I think six initial sites that were funded there now roughly two dozen sites and the long-term ecological research program in the US the first decade there were no real specific guidelines respect to how data were managed within and across sites and that created some real challenges that were reckoned recognized by both the National Science foundation-funded the program as well as many of the researchers that were involved in the projects themselves this led to in 1990 LTR guidelines for safe site data management policies and this came out and again in 1990 the challenge with it was was that lay down some guidelines but every site was sort of given lots lots of leeway in terms of how they implemented those recommendations so there was again a bit of a lack of consistency with respect to how they were managed and shared across the network by 2005 this was recognized as being a challenge and all of the site principal investigators got together and came up with a much stronger policy that required that data standards data requirements be standardized across the entire network and that was approved by the LTR Coordinating Committee in 2005 and then under the caption the figure caption there you can see that since then we now have about 20,000 data packages that are readily available freely downloadable through the LTR portal that have been created by the healthy our program so that has been a huge success that's led to lots lots of synthesis efforts subsequently there
have been some external factors as well that have influenced data sharing and data management policies in the u.s. the first one steps back to the National Science Foundation which in 2001 released its policies for data sharing and they they had the expectation that investigates with shared data and other results of the scientific process within a reasonable time and cost under the Bush administration in 2007 that we had the America COMPETES Act which required procedures be put in place to facilitate data exchange across all the different federal agencies in the US and that was recently strengthened this year in fact about a month ago with the NSF public access plan which describes the implementation schedule for sharing both publications and data in public repositories these are all expected to go into full effect in implementation by 2016 2017 in addition under the Obama
administration in the US we've had the last few years of bigger focus on what's called the open government initiative and this also was recently updated just a week ago with some of the formal guidelines there that require different levels of access and openness to both data and publications there are two
major studies that have looked at data sharing across the scientific community this is one from Miley publishers that on which just released fairly recently and then about four years ago we had another study that was completed by one of my colleagues in the data one project carrollton apur and sarah her colleagues and that one focused on environmental scientists community in particular and I'm going to share some results from both of those studies so first of all with respect to the wiley study i think one of the really interesting results there was if we look internationally at data sharing we passed sort of the tipping point where most scientists now agree with the statement that they are quite happy and interest in sharing their data ten years ago this would not have been the case and I think with five years from now obviously even a much higher percentage that agree with that sentiment there are reports some differences across countries and some differences across disciplines as well some of the reasons that researchers are hesitant to share their data are highlighted on the right side of the chart here are more going to a little
bit more detail in this slide here and group some of those challenges together and recognize that there really four major impediments to data sharing one is researchers want to make sure that they receive proper credit and attribution for creating the data products they do secondly and I think a challenge that is bunnell are still with us is the fact that many of the tools that investigators have access to from managing a data such as metadata management tools have not been particularly user-friendly or necessarily readily available and I will highlight one particular exception to that later in my talk education has been another key area where I think most researchers would argue that they need better education about several things one is best practices for managing data secondly are and I think this is universal probably all of us on this webinar would agree that it's very difficult to fully understand legislative responsibilities and other issues associated with intellectual property rights confidentiality and ethical aspects the legal jargon can be quite convoluted and in fact there are real challenges when we across international boundaries what may be legal appropriate in one country may not necessarily be legal appropriate in adjacent country so we definitely need much better education respect to that the third so example here under education is perception and clearly in the past a lot of scientists that felt like well if I share my data my cookie scoop Oh be misinterpreted some misused my data one that really got me was about ten or fifteen percent of the wily respondents so that they felt their data were not relevant and I think if I were particular seeking additional research funds from responsered I would probably not admit that my data are get relevant anyway I think education it has gone a long way to help flip the tipping point about the perception sport data sheet lastly incentives and disincentives and encourage data sharing that's clearly recognition that for things like the tenure promotion process and so on we need to make sure those incentives aren't back there to support researchers for sharing their data I wanted to
highlight a couple of figures here just to emphasize some of my prior comments and amplify those two a bit so we look at the upper left pan on the quad chart here Halloween we see something refers to the long tail of orphaned data one of my colleagues Brian haida warned proposed this several years ago and I think it really makes sense in that most investigators when they really try and deposit their date or manage their data they recognize that there are some big well-known repositories out there probably some of the more commonly known ones are genbank protein data bank for sequence data and protein structure data in particular and communities of rallied around those and now its status quo to deposit your data in those particular repositories meaning researchers though have not had access to similar type repositories well that is changing and those that have not in many cases archiver attempted to preserve their data on their own laptops or desktop machine or possibly in a university or some other location in many cases those sources are not secure for the long term and we know what data being more fund and potently being lost over time and this is also amplified by the figure in the upper right which is a figure from a paper by Tim Bynes one again one of my colleagues published this in 2013 it's a really great story about how data undergo entropy over time how they're lost over time he and his colleagues surveyed a large number of data publisher publishers and journal articles to determine whether or not the data were still available and they found that over time a fairly rapidly drop off the data were lost to get over a 20-year period a large percentage of the data we were totally unavailable after a 20-year time frame and there's a fairly steep gradient in terms of the loss of information over time I think another point that really amplifies the need for education is illustrated in the lower right bar graph which is from a colleague of mine Carol tuna / it was published in PLoS ONE about three years ago and she was able to document the fact that most researchers did not use a metadata standard for creating their metadata the second highest response rate was the researchers said that they used a metadata standard that they created their own laboratory which arguably is not necessarily a community-wide standard and then only a much smaller fraction used some of the main community-wide standards like international standards organization 1915 for geospatial data EML are ecological metadata language and several others there and then lastly the picture of the smokestacks in London industrial period highlights the fact that most scientists if they are interested in discovering data really have very little idea about where to go there are many many repositories out there a lot of them are small as indicated by the tiny smokestacks there a few of them are quite large like maybe the gym bang smokin a couple of others but it really is difficult to find data that have been archived or preserved in many of these smaller repositories so in the next few minutes I'm going to talk
about some of the solutions to these challenges and first of all highlight
the fact that we respect to credit and attribution many scientific journals especially the big name ones like plots nature of science ecological monographs and others now require authors to share their data that are that underlie articles in those specific journals and also importantly there are quite a number of new journals that are emerging that are called data journals and some examples there include the geoscience data journal Giga science for exploring a large data sets at republishing groups scientific data and then one that I've been involved with for the last decade there bounces that ecological archives for publishing ecology data papers in
addition to date of journals there are some important data repository solutions out there one of which many of you have probably heard of is the Triad digital repository this is geared towards publishing data that underlie scientific publications they're roughly 75 or so major journals now that are members of the Dryad consortium plus they're roughly another hundred journals i think that have had their data published in triad by the authors this provides a mechanism for again linking them providing access to the data for the long term and then linking that to the
publication and i'm going to show you how that works in the next couple slides so in dryad and author when they were us submitting a manuscript to a journal they are requested by that journal to also submit the data too dry add and then triad makes a sport available to the reviewers of the manuscript so that reviewers can look not only the manuscript to see whether or not you know the findings are well described but also they can look at the underlying data as well be a dryad if the paper is in fact accepted for publication then the data are made available at the same time and importantly there's a recommended citation for both the data and the paper as well so what that looks like is this
if you in the upper left square gray box you can see there's a journal paper that's in systemic biology and the data and metadata files are in a package that is below that which you can easily access and then importantly dry add the repository links back to the journals as well so that someone looks at the data they can also go back and read the journal articles to where the data came from and get more information that way and importantly and Reija provides that again that recommendation for citation for both the paper and the Dryad data package so that authors are essentially getting credit for both the papers that are produced as well as the underlying data products and this is what it looks
like in the literature so here's a paper by Joseph Skaro and colleagues and they're signing a data product in the Dryad issue repository by San at all it's accessible through that digital object identifier the dryads citation so
in the next little bit i'm going to cover some of the tools that i think of instrumental in helping promote open science more broadly and i'm on howard cover a few elements of the data line goal is illustrated here one from data management planning through collection assurance preservation analysis and so on so with respect to planning there's one tool in particular that's proven extremely valuable on both the US and the United Kingdom and that is the DMP
tool in the UK it's a web accessible version in the u.s. it's a downloadable packers that you can access this is what it looks like here at the front end web page for the DMP tool and what it does
is I've signed in as myself through the University of New Mexico here you don't need to belong to any particular University in order to do this it's easily downloadable and what this does that steps you through the process of creating a data management plan there's now required in the US and UK and many funding agencies including private foundations such as the Wellcome Trust for Nevada more foundation others in this case i showed the national science foundation's requirements the biological sciences Directorate and you can see in the lower right there's an open blank space where an individual will create their response to set of questions about data collection formats and standards and the University of New Mexico above that provides some guidance with respect to the answer is provided for this particular template so the DMP tool steps you through all the basic requirements for a good data management plan that would satisfy a large number of different funding agencies then this can be published and you can in fact share this data management plan with your colleagues and others as well so
some other tools that are very important
morpho for supporting metadata creation management this is a package that can be downloaded and is used by anyone is great for in particular dealing with ecological environmental many other types of observational data what it
looks like is this is just one example of the swartz screen and you can type in the name of the submitter for this the creator of the data set other information such as an abstract keywords using thesaurus such as maybe the NASA this global change master directory the temporal coverage for the data set of spatial coverage and so on and then you can upload the data file as well in addition morpho provides access to a number of other screens where you can go into much more detail about the data that is provided and as part of the metadata and it can be a paid it easily updated revised over time so it's a very useful package in that respect under the
preservation umbrella I wanted to
highlight a couple things or one in particular here to start off with the re three data org again many researchers are not sure what public repositories exist or why this is a great resource for that it's a constantly growing this is a couple weeks or maybe a month or civil terms when I uploaded this slide at that time there were about a thousand reviewed repositories I'm sure it's much higher now but you can do a quick service here for under maybe a variety of keywords you might enter and it will point out those four posit or ease that meet those particular needs and you can or you can just browse through the entire catalog so desire this is a great way to discover what data repositories exist for different scientific domains and fields with respect to discovery on
this is where I want to point out a
couple things there are you know clearly a number of approaches one can follow in discovering data is and things like Google or Bing or other search engines but quite often they don't lead to the types of data that you're looking for so I want to introduce data one this is a project that I'm associated with in the US it's an international program to federate across data repositories and we have three components to the data one infrastructure one is what we call coordinating nodes and these provide a lot of the broad services for replication other network-wide services the indexing and search tools are available through the coordinating nodes and essentially metadata from all the associated data repositories that are part of the Federation are surged as part of data one services the what we call member
nodes are all of those different data repositories worldwide now that we have a couple from Australia that will soon be made available three data one infrastructure as well including the ICO's web portal and these are again all the different repositories worldwide and have actually host host the data but they have shared their metadata with the data one cat ye index of service there's a third component what we refer to as investigator toolkit and these are a variety of different tools where we most cases provided a direct linkage to the data one data resources so a tool like one are connects data one with the our statistical analysis program allowing researchers to easily access data one data data that are held in any of the data one affiliated repositories do their analyses and possible generates new data that then might be uploaded to again one of the affiliate repositories so this is what the website looks like
feel free to check it out it's dataone org basically you might click the search button at the very top and then that
would lead to a more advanced tool search here with so in this case I've just typed in the World Tree I could have specified a narrow range of dates or specific countries or type in a bounding box typed a state within the US for example and it would have narrowed down the search based on those
particular criteria so looking at this Bron response and I got back this next screen which listed a number of data sets a very bottom contd at all on growth mortality of tropical tree species India lots of others and I could using the faceted search tool above do some additional constraining on the responses by focusing on data from a particular repository or by particular author or even add in some additional keywords so
it's a very effective way to identify and get access to scientific data in this case we're looking at the metadata for the content at all paper I mentioned previously on tropical tree species in India and if we read through the if we scroll down and looked at the entire metadata record here we may actually decide we want to download the data and we can do so by clicking the download button download with the data and metadata that are associated with the data files so this is what it looks like
these are the various data miles and metadata records associated with that particular data set again we can download the entire package and have ready access to that on our laptop or desktop machine and so with respect to
data one we're now about thirty large national international repositories that are part of this and we're now hitting roughly half million data products that are associated with the theta one repository another area that is really
helping facilitate open science is in the area of analysis and visualization and I wanted to provide a couple of examples here there are a number of tools like Kepler taverna and fist trails that make it possible to create workflows or scientific work clothes that string together a lots of complex analyses that we can then share that workflow with others people can possibly repeat the same set of analyses we've done or modify those work clothes change them to be their particular needs and possibly upload the workflows
to a other site where they can be downloaded reused so this is one of the workflows I just wanted to mention in case you're interested in some advanced visualization tools this one is called biz trails it's open package that you can easily download and use it to create some quite sophisticated visualizations as depicted than the right side of the panel here and in addition this trails does some really nice add-on services it collects provenance data for how the beta products or in this case the graphics were generated so it's easy to look back and see the sources of the data that went into that and so on and
then there's a nice tool that's been created through care available and David Rohrer in the United Kingdom this is called my experiment and it's a great way to upload your workflows a whole variety of packages including Verna Kepler vis trails and others and in this case we see a paper our workflow is created by paul Fisher there's an abstract about the workflow but it does there are ratings by the community rated 4.6 out of 5 and you can see how many times the that workflow has been viewed and then it's also been downloaded 1,600 times and you can if you care you can credit click the green arrow on the right side and actually download that workflow attempt to repeat you know rerun the same workflow or again modified and upload it back to my experiment for others to use training
has been really key as we've been again moving into this open science framework we've done a lot of training through beta 1 there are lots of other groups have been involved in supporting training as well one of the things we've
done in data one is also make an addition to hands-on training which we do at various professional Society meetings in the US and elsewhere but we all so have created what we call it best practices database so you can either click on one of the elements with the data life cycle that shows up in orange the bottom of the screen so if you were interested in what are some qa/qc mechanisms you can click on ashore it will bring up some best practices with respect to that there's also a you can search entering in your own keywords for various best practices and then importantly in the center there there's a thing called the best practices primer which we created this in response to the community requested that there be more or less a data management as very simple data management guy that could be easily read and digested and I could immediately start managing the data better so we've created this primer on
data management is non pages long very short you can download this chair and provide it to your students in your classrooms and so on and it goes through all the best practices respect to the data life cycle there are winners links in the document to additional tutorials and other information about managing data you can access as well so I want to
conclude with a coma things further just some basic rules or basic best practices for data sharing and contributing to
open science the first one is to I think it's very important to create a data management planning if you have access to a tool that allows you to publish that data management plan do so this really helps formulate a good solid practices for managing the data before a project gets underway and what you're doing is basically stating how you're going to manage data during the project and then after the project is completed secondly is to use some of the tools
like morpho to document your data to the maximum extent possible and this means creating the additional descriptions about the data so that someone that's not familiar with the data can understand interpret it use the data correct so this requires again lots of details about the methods of Floyd where the data are located formats lots of other information morpho is a great tool for helping you step through those requirements for developing a good solid description of your data and then lastly is to preserve the data in a ideally a community repository and then if you follow all the steps and you created a data product that should be ready and sufficient for data sharing discovery and we use by others the third
recommendation is to publish your data metadata in either a data journal or something like a dryad which is an open digital repository so that you and others can easily go back to the data that are associated with publications and then also we used and continue to build science based on that earlier data set forth in addition to data it's quite
important that you and others be able to understand the methods that went into creating that data set and possibly analyzing and interpreting it so this is a where the aegis of the fourth recommendation here which is to publish your analytical workflows and software or clothes can easily be published in my experiment and then many scientists now use additional repositories like github to archive their software code for the long term and share that with the community as well and then lastly I
would argue it's important to publish results and open journals lots of them out there no possible on ecosphere that provide free access to the publications that the side to side is created so
where are we heading in the future when Hal I just a couple future directions in the open science movement the first one
and I think this really encapsulates a lot of information your soul step through it from the top down the very top we see a generation of ideas that's what step in the scientific process and a lot of scientists nowadays are getting ideas from places like science blogs Twitter feeds and others and then on the right side it shows you know which is you're generating ideas we want to share those you can in fact do so via something like an open laboratory notebook which there are many on the market a free and open source solutions as well and you might not do this as you're developing a research project creative lab notebook and share that information with the folks working in your laboratory or close collaborators or others the second step planning or research and writing proposals again you can get lots of great ideas from literature I discovered in delay or researchgate other locations and then if you develop me your proposal you may interview that through somebody can open up to your colleagues like Google Docs which I think the last five or six proposals that I've written in terms of undertaking research and then going through that whole data life cycle there are a lot of tools that you can take advantage of and then lots of places where you can deposit the products as well and share those so on the right side for example we see the DMP to github for auditing a code associated with maybe organizing and management data quality assuring the data k & B is a repository for archiving the metadata and the data products and then workflows again can be archived in my experiments which were derived from say our Kepler is trails other workflows on the left side and then when we go to the inseminary results there are lots and lots of places we can do that nowadays we can share our posters via things like fig share power point presentations ving SlideShare code versus github be Prince via pierre je for examples on others and then publications the open source mechanisms like plus and then data metadata through a variety of different public repositories
now in data one we're doing a couple things to help promote open science in the future one is working on a provenance tracking system now and the way this works is we're looking at historical CTB data from an oceanographic crews these were day that were collected in 2014 we can actually look at it and download the data via data one and on the left side it shows the sources that might have gone into creating that particular data set this is a product that will be released in another year or so so it's not finalized yet this is part of the usability testing we're doing but this is well that's what it might look like so we see the sources for that particular data so on the left side and then that data said stored OCT need data from the gulf of alaska may be used by two publications subsequently and those will be highlighted and clickable on the right side so this is again one example of being able to move the reproducibility of science and being able to document where data sets were derived from and how they were subsequently used and then
lastly I think another key activity that were involved in is creating semantic annotation tool and this is for data originators as well as others that may have used a particular data set and want to come back and add in some notes to that so in this case we see an example where several people are adding in different comments about the data set and this may help amplify some methods that were used by the data creator or there may be a couple red flags that were identified and questions that were identified by users that could then be responded to the data originator for example so this is the equivalent of adding post-it notes to data products so they can intend to be used and gain value over time the last slide I had
here was on the this whole topic of all metrics again I think this is really helping lead that trend towards open science and that there are mechanisms now like impact story and this is a creation of other people are sub her colleagues so it's an enterprise where it tracks the contributions of researchers in a whole variety of different areas that will highlight for example number of papers you had the number of downloads of those papers number for the number of citations of those papers the number of tweets that you've had on your Twitter account and so on lots of other ways of documenting scientific productivity the this all metric approach so again this is something that I think we'll see more of a focus on of metrics over the future and I'll conclude with just this again
this is our website data one not work these represent a lot of the communities that we try to work with over the last several years of creating data one the top left there's a senior scientist associated level change research program the bottom left is a librarian associated with University of California is interested in providing education resources for faculty members associate with the library and then on the right side as a young investigator associated with the lake baikal research program who's interested in reproducibility as well as providing tools to her colleagues on the project and your attention thank you
Feedback