We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

4. Data Management Planning

00:00

Formal Metadata

Title
4. Data Management Planning
Title of Series
Part Number
4
Number of Parts
6
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer
Production Year2013
Production PlaceOxford

Content Metadata

Subject Area
Genre
Abstract
What is a Data Management Plan? Why do we need them and how do they relate to our day-to-day research work? This short lecture introduces the concept of the DMP, set within the context of the scope and scale of data produced in modern scientific research, and breaks the how-to process down into short-, medium- and long-term project management stages. Please note that, as a result of student feedback and the analysis undertaken for the Post-Pilot Report, this lecture will be offered as a mini-workshop in the official materials release.
WaveSimulationSharewarePoint (geometry)Data managementArithmetic meanExecution unitScaling (geometry)Flow separationSet (mathematics)Different (Kate Ryan album)Link (knot theory)Mixed realityVirtual machineWordUniverse (mathematics)BitConnectivity (graph theory)Data miningPort scannerComputer fileObservational studyData storage deviceMedical imagingSquare numberAreaProper mapLevel (video gaming)PlanningProjective planeFunction (mathematics)Traffic reportingPhase transitionProcess (computing)SequenceLine (geometry)Theory of relativityTerm (mathematics)Order of magnitudeDrop (liquid)1 (number)Multiplication signMathematicsVideo gameError messagePhysical lawResultantStatement (computer science)Grass (card game)Local ringView (database)Web 2.0Musical ensembleLecture/Conference
Computer wormSystem programmingEmbargoMetadataBackupData storage deviceTerm (mathematics)Phase transitionScale (map)Value-added networkFunction (mathematics)MetadataLevel (video gaming)DatabaseDifferent (Kate Ryan album)InternetworkingConstraint (mathematics)Set (mathematics)CASE <Informatik>Data miningRoutingGroup actionFrequencyPhase transitionMultiplication signLine (geometry)State of matterData structureRevision controlRight angle2 (number)QuicksortResultantContext awarenessInformationInternet service providerMereologyNichtlineares GleichungssystemCellular automatonImplementationView (database)NeuroinformatikNumeral (linguistics)Computer fileEvent horizonRange (statistics)Observational studyComputer programmingScaling (geometry)Lie groupPhysical systemBasis <Mathematik>Projective planeLocal ringStructural loadStudent's t-testMaterialization (paranormal)Default (computer science)PlanningPoint (geometry)Disk read-and-write headTerm (mathematics)File formatElement (mathematics)Block (periodic table)Data managementInitial value problemInferenceComputer scienceDomain nameForm (programming)Parameter (computer programming)Arithmetic meanFunction (mathematics)CausalityRaw image formatBitEndliche ModelltheorieSupercomputerCentralizer and normalizer
Local GroupRepository (publishing)Computer fileOffice suiteQuicksortScaling (geometry)EmailData managementGroup actionPlanningError messageLink (knot theory)File formatMoment (mathematics)MereologyMultiplication signProjective planeDependent and independent variablesSign (mathematics)File viewerLevel (video gaming)CASE <Informatik>Lecture/Conference
Phase transitionBitGroup actionComputer fileProjective planeRepository (publishing)InformationScaling (geometry)Block (periodic table)Parameter (computer programming)Content (media)File formatData managementNumberSeries (mathematics)Natural numberStructural loadMereologyLogicRight angleCondition numberDescriptive statisticsPolygon meshRow (database)Physical systemComputer virusLevel (video gaming)Numbering schemeWebsiteObject (grammar)Game theoryMass
InformationWeb 2.0Data storage deviceFile archiverStudent's t-testSlide ruleScaling (geometry)Universe (mathematics)MereologyDigitizingAreaLecture/Conference
VideoconferencingRing (mathematics)MereologyOpen setComa BerenicesSlide ruleComputer-generated imagerySpecial unitary groupComputer animation
Transcript: English(auto-generated)
Right, so I'm going to talk to you for about ten minutes today, but once I've had a few words about the data management plans component of your phase one, I'm actually going to be handing over to our guests for today. So here at the front we have Jun Xiao and
Graham Klein, who are both from the Department of Zoology. So once I've talked about data management, then they'll be telling you a little bit about workflows and what we actually use them for in research and provide a bit of a software demo for you. So if you think back to the lectures we saw last week, it was all themed
around the idea of reproducibility in scientific research and we looked at why that is actually a problem at the moment, and it's a growing problem, and obviously through your assessments you're starting to think about this coherent research story while you're delivering an array of different research outputs that are adequately delivered as a coherent whole rather than
just aiming for one report or a scientific paper. So the lectures this week are going to start looking at slightly smaller components of that. So today we're going to think about the data outputs that you're actually going to deliver over your research career and how you can actually integrate the idea of data management into the way you're actually working. Tomorrow we're actually going to start looking at the publication process and how
that's changing so that you're actually going to be adequately prepared for that when the deadline comes. So I will say of data management, it's always easy when you're looking back in the later stages of your defill or in early stages of postdoc when you realise there are certain things you wish people had told you to do or told you more about before you started your defill, and I will say one of the things that is most
consistently flagged up by a lot of scientists, for myself it's true and also for a lot of my colleagues and many other people that you will meet, is I really wish that someone had actually got me to do a proper data management plan at the beginning of my defill, because of course when your research is
developing initially it doesn't really feel like it's imminent or urgent, and then later on once you're actually churning out huge amounts of data you realise that it would have been great to have a proper idea of how you were actually going to approach this. We're talking not just in the short term but actually once you're starting to think about
how you're going to deal with long term issues surrounding your research and the associated data. Okay, so I'm going to keep this one as short as possible, so we'll start off just this bit called the data wave, just to provide you with some idea of the actual scale and scope of scientific research data. Now with the kind of smaller scale projects like the ones you're working on this week, some of them will
be producing reasonable amounts of data, but even that would be a drop in the ocean compared to the scale and amount of the research data that you are going to produce and others are producing at the moment, worldwide for research. So this is actually a report from the European Commission from 2010, and in the light of this what is
often referred to as the age of big data, what we're actually realising is that they're actually starting to plough more funding into not just how we deal with data and how we educate people to deal with data, but in actually making sure that we have the infrastructure that can actually handle storing that data and people accessing that data.
So as a bit of an example, this report actually takes you through why policy surrounding scientific research data needs to change and what the benefits of that are. If we just take this example for starters, one day a high throughput DNA sequence can read about 26 billion characters of the human genetic code.
That's going to be 9 terabytes, 9 trillion data units over the course of one year. Now if you start scaling up and you imagine how many such sequencing machines we actually have worldwide, you can see very rapidly, even if as a researcher you just look on your amount of data as, oh well, it's a lot of data for me but I can deal with that.
You need to actually start thinking about your own data in relation to that integrated whole. For example, 10 years down the line if you produced some data now and people were wanting to pool several different data sets of either this size or even many orders of magnitude bigger, people are going to need to understand your data, be able to
access it and actually mix data sets potentially so that they can data mine and find out links between different kinds of data. So you might be saying, well okay, I'm never going to use a DNA sequencing machine. This issue is by no means local to that kind of a problem.
No matter what area, what department you end up in for your refill, you're going to have to be thinking about the large scale data storage probably at some point. So medical image analysis, a lot of people from the DTC often end up going into the biomedical research units over in Headington. It's estimated that medical imaging may ultimately start to account for about 30% of all data storage.
You're going to be looking at really big files if you're, for example, looking at brain scans and trying to analyse those images for an Alzheimer's study or they also have many examples of the amount of mammogram data that's produced by the US in many medical research studies.
It's just an absolutely huge amount. Astronomy. You may or may not have heard of the Square Kilometre Array. Now this is a big project to effectively, aiming to be completed by 2020. The idea will be this vast area of land that has a radio telescope every square kilometre that's looking up to the heavens and actually pulling as much data as it can about the universe.
And that's going to be absolutely incredible. The possibilities for scientific exploration are absolutely immense. But it's going to be, they estimate, producing around one petabyte of data every 20 seconds. So we need to start growing the infrastructure and the methods that we're using to deal with our data now. And similarly things like systems biology, the 1000 Genome Project worldwide where they're taking a thousand
or more different individuals and sequencing their genome, trying to draw inferences from that about different populations. And even computationalists, because I mean you're probably going to be working on either some form of high performance computing or if you do the kind of stuff I do, I mean I've got
to define huge domains of cells that each have lots and lots of little equations ticking away inside them. I might want to, for a particular numerical study, run that population of cells over a vast range of different parameters. I need to actually have the infrastructure in place, not just to actually run that kind of code, I need to be able to store the data, I need to be able to make it clear how that data was actually produced and so on and so forth.
So you're going to need to work out how you're going to manage your data over the course of your default. So we've already seen, you've been trained a lot in the last couple of weeks in MATLAB programming where you're actually creating the data. Last week we actually looked at licensing, which is your means of actually legally enabling the reuse of that data for particular causes.
Today we're just going to look at the idea of preserving data and also metadata. So I know we've spoken within groups a bit about this in some cases. The metadata is basically the data about your data.
So if you had a raw set of data and you provide metadata, the metadata would typically include the parameter set that you used to actually create that data and the initial conditions you used, any details of the model it was put into. Data about data is what metadata is. Ultimately you want to be able to put all of these into some sort of central available reserve so
that either yourself or other scientists can look at your data set or your data set in a sort of larger pool of data sets and potentially mine it for connections, analyse it in different ways and reuse it. So what's in the data management plan? We've got to start thinking about what's happening in the short term and what's happening in the long term.
As you're producing stuff on a day-to-day basis, you're going to need to think about what file formats are you producing that data in. As we said, the metadata, local backups, that's something that often loads and loads of default students just don't think about in the short term. And once your data gets hit and you lose six to 12 months worth of data in one go during your default, there's no way back from that really.
If you've actually got a plan in place to begin with about regularly backing up, whether that's daily or on a different time scale, it's going to help you be a much more efficient researcher. And at this stage, the research phase in the short term, you do want to think about the ownership of your data.
Are you the data owner? Are you the sole data owner? Are you one of many data owners as part of a research group? Because as we've seen with the licensing, that will affect how you actually are able to use your data and the different constraints and ideas that you need to take into account when you're actually working out long term whereas your data head is and how you handle that.
This middle bit, you might turn the dissemination phase. Is that a question at the back? No, okay. So in the dissemination phase is that point where you've moved away from just on a local level working on your research each day and you're starting to work out how you're going to release that data. Now, in your case for this assessment, the dissemination phase is pretty obvious.
It's that point where you're going to start uploading different things to GitHub and releasing them to your successors. But in the case of if you imagine you were working in a research group on your D film, you're going to have to work out what materials are you going to release? When are you going to release them? How are you going to release them?
For example, for the purposes of you protecting and safeguarding your own research or if you're working with industry, would you actually need to embargo that data for a certain period before you actually release it to the public? These are all things you're going to have to take into account. Then in long term, you actually need to start thinking about preserving your data.
You might not always see what elements of your data are useful. There's often the case many, many, many years down the line somebody might actually find that they want your data sets. You'll find this if you try and access a particular paper you need and it's a computational study from, I don't know, 1995. There will often have been data available on that, but you just can't access it and sometimes that's a real stumbling block.
So as part of trying to make science more efficient as a whole, if you actually adequately plan for this right at the beginning of your research and keep reappraising that plan as you go, it will make you a much, much more efficient scientist. Because it's automatically helping you to structure your research output.
It's going to enable, as we said, data mining that involves your sets of data. In particular, it's making you more efficient and increasing the longevity of your research. And you'll begin to see this more and more as you continue through your deep film. So what do I want you to do for today?
Well, this document is actually up online. You don't have to read this right now. It's not necessarily essential for your assessment, but you will find it a useful piece of information, useful resource as you carry on your research. So the Digital Curation Centre is actually specifically there to actually provide resources and advice regarding data management.
So I've actually stuck this, this will be up on webinar this afternoon if you wanted to start thinking more about how you're going to manage your data. And I know ultimately what you want to know is about, well, how does it relate to the assessment? Now, we're not going to ask you for a large scale data management plan. Obviously, on the timescale of your project, you've got, you know, an awful lot to get through.
But what I would like you to think about is how are you managing your data for this? Is the one person in your group that's going to be largely responsible for the data? Are you actually going to collectively own that data? So if I had a question for you about that data, to whom would I need to go in your group?
That sort of thing. So what we're going to do, many of you have already actually hunted down David Shotton's 20 questions online tool for data management plans. So I will actually send you an email link for that. It was having some server-side errors over the weekend, so some of you may have accessed it and thought, oh, this is going wrong. What on earth do I do? I will email this out to you now.
So what you'll need to do this afternoon, get together in your groups. It should only take about five to ten minutes. It's really, really speedy. I want you to answer these 20 questions that relate to your work at the moment. And as you'll see, I want you to save this data management plan as an XML file and just pop that in your GitHub repository.
And that means that when you're arriving in the office tomorrow and I switch you over onto a new project, your successors will be able to look at your data management plan, and they'll be able to understand why you've actually structured your data in the way you have, what file formats they should be expecting. And for example, it will also help summarize things like any licenses that you've wanted to apply.
I will send you this link, but it will look a little bit like this. And David Chopin has actually designed this as a series of questions to assist early stage, early career researchers in thinking about how they're going to manage the data side of their research. So you can see to begin with, just pop in either your name or your research group name for the purposes of this exercise.
You don't necessarily have to worry about any of the studentship numbers that it's asking you for. That's not necessary for this particular assessment. But what you will actually have to do is start thinking about the nature of the data that you're producing.
So you can see here, often it's really handy actually, once you're actually working with much larger scale projects to understand what file formats are being dealt with, because often it can cause no end of stumbling blocks. But interestingly, who owns the data? We started talking for purposes of licensing about, you need to make sure you're fully aware of who owns any of the content
or any of the data that you are choosing to license, because you can't license something that you don't already have existing rights to. So this is why I want you to actually complete these data management plans as part of the group. And for example, how will descriptive metadata, so for example those parameter sets or initial conditions,
how will you actually be creating that metadata? How will you incorporate it into the files that you're delivering? And then we start moving on to, you can see in tabs three and four, a little bit more about those intermediate or later stages. We're starting to move away now from just producing the data,
actually thinking about how you're sharing it in the short term, moving into, if I actually told you that you had to preserve the data from this project in the long term, how would you choose to do it? So I will be around all afternoon to actually answer any questions on this, and I realise this is a very small scale project, so some of these may not seem as relevant now as they would be
if I were asking you to do this for example for your summer projects for the DTC later this year. But hopefully what it's designed to do is actually get you thinking about certain questions you should be asking yourself at the start of a research project if you're actually going to deliver it right through to the end in a way that's going to be useful for other people.
What you will need to do once you've filled in the answers to those 20 questions is literally just go to save here, pop in a file name, and once you click save in XML it will allow you to actually, it will pull all of the information you've put into that form and save it as an XML file. That just needs to go up on your GitHub repository.
For phase two, when you actually open up another group's repository tomorrow morning and then you find their XML file, you will actually be able to come to this load tab once you've downloaded that XML file from them and actually open it up you'll be able to load it into, there we go,
and you'll be able to actually load it in, so you'll be able to understand what the actual wishes of the group were that you've inherited the project from, you'll understand how they structured their data and how they would recommend that you manage it. So I realise that sounds quite formulating,
but once you actually integrate it into the way that you're actually doing your research it will actually prove really, really useful. So now I will just speak to the very, very end, last slide is actually here, so we'll just stay on that one. I'll put these upon Web Learn, so that should provide some info about the Digital Curation Centre and there's also a couple of really useful resources, there's a whole leaflet about how Oxford University will provide facilities for you to manage your data,
for you to archive your data and store your data. So if as a DPhil student you actually find that you have the need for large scale data storage, those leaflets will actually provide you with lots of people you can contact about how to achieve that.