Michael Wilson, STFC at the DataCite summer meeting 2012

Video in TIB AV-Portal: Michael Wilson, STFC at the DataCite summer meeting 2012

Formal Metadata

Michael Wilson, STFC at the DataCite summer meeting 2012
Meeting a scientific facility provider's duty to maximise the value of data
Title of Series
Part Number
Number of Parts
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
10.5446/6566 (DOI)
Release Date

Content Metadata

Subject Area
Pay television Meeting/Interview
Satellite Sensitivity analysis Building Code Source code Workstation <Musikinstrument> Solid geometry FLOPS Computer Neuroinformatik Fraction (mathematics) Particle system Atomic number Forest Set (mathematics) Square number Process (computing) Endliche Modelltheorie Rhombus Physical system Area Covering space Channel capacity Data storage device Electronic mailing list Cluster analysis Computer Sound effect Parameter (computer programming) Bit Lattice (order) Degree (graph theory) Process (computing) Telecommunication Order (biology) Endliche Modelltheorie Software testing Right angle File viewer Rhombus Resultant Row (database) Multitier architecture Functional (mathematics) Inheritance (object-oriented programming) Pay television Computer file Observational study Real number Channel capacity Virtual machine Event horizon Power (physics) Architecture Computational physics Robotics Nichtlineares Gleichungssystem Robot Graphics processing unit Focus (optics) Inheritance (object-oriented programming) Tape drive Server (computing) Mathematical analysis Coprocessor Particle system Personal digital assistant Physics Window
Point (geometry) Observational study View (database) Multiplication sign Virtual machine Set (mathematics) Parameter (computer programming) Neuroinformatik Sign (mathematics) Roundness (object) Meeting/Interview Different (Kate Ryan album) Software testing Endliche Modelltheorie Scaling (geometry) Information Moment (mathematics) Data storage device Electronic mailing list System call Vector potential Data management Personal digital assistant Telecommunication Universe (mathematics) Right angle Quicksort Fundamental theorem of algebra Spacetime
Service (economics) Link (knot theory) Multiplication sign Source code Video game Virtual machine Set (mathematics) Digital object identifier Mathematical analysis Mereology Metadata Computer programming Neuroinformatik Single-precision floating-point format Automation Circle Endliche Modelltheorie Data structure Contrast (vision) Summierbarkeit Address space Physical system Social class Area Noise (electronics) Broadcast programming Cycle (graph theory) Mathematical analysis Electronic mailing list Metadata Database Bit Library catalog Digital object identifier Type theory Process (computing) Emulator Personal digital assistant Order (biology) File archiver Video game Website Right angle Quicksort Cycle (graph theory) Resultant Address space
Multiplication sign Set (mathematics) Replication (computing) Neuroinformatik Data model Mathematics Set (mathematics) Physical system Mainframe computer View (database) Namespace Computer file Moment (mathematics) Metadata Computer simulation Physicalism Bit Digital object identifier Type theory Data management Computer cluster Energy level Quicksort Relief Row (database) Laptop Point (geometry) Game controller Identifiability Computer file Observational study Real number Virtual machine Mathematical analysis Student's t-test Regular graph Metadata Internet forum Software Touch typing Energy level Data structure Mobile Web Airfoil Multiplication Code Core dump Coma Berenices Basis <Mathematik> Computer animation Software Personal digital assistant Universe (mathematics) Revision control Theory of everything Wireless LAN Local ring
Area Group action Dependent and independent variables Regulator gene Metadata Planning Metadata Type theory Frequency Coefficient of determination Frequency Different (Kate Ryan album) Computer cluster Universe (mathematics) Different (Kate Ryan album) Row (database) Selectivity (electronic) Quicksort Routing Embargo Row (database)
Satellite Group action State of matter Insertion loss Mereology Sign (mathematics) Bit rate Meeting/Interview Different (Kate Ryan album) Set (mathematics) Endliche Modelltheorie Website Exception handling Physical system Point cloud Area Satellite Electronic mailing list Sampling (statistics) Bit Process (computing) Sample (statistics) Cycle (graph theory) Embargo Purchasing Point (geometry) Observational study Computer-generated imagery Login Event horizon Metadata Theory Traffic reporting Euklidischer Ring Hydraulic jump Projective plane Variance Planning Usability Coma Berenices Library catalog System call Performance appraisal Plane (geometry) Explosion Software Personal digital assistant Universe (mathematics) Object (grammar)
Area Information management Process (computing) Meeting/Interview Coma Berenices Theory of everything
Yes like a willful I whether the science and technology facilities which is a Research Council fund research stockholder Research Council of the Hollywood would facilities at what we do a lot of is we run big science facilities so big facilities cost money produce data with an obligation to get the best use that data we get what we do describe what we did what we do with data that I raised issues that So what we do They facilities helping do what big sellers particle physics we played ball and manage the UK subscription that
Billion euros facility we we'd do the same thing for European Space Agency that runs about a billion euros mission we also billed the the cameras in most of the satellites watching you that the operate about minus 270 degrees tried people the atoms station as possible so there's sensitive as possible Speciality that we get lots of data from that we do various things astronomy and European Southern Observatory in other areas of those for big facilities at next 1 was coming along the Europeans Extremely large telescope was agreed funding this week that will probably be a 10 billion EUR facility on the square kilometer race which will cover Australia Southern Africa up to the equator and my only nor was spread across the the forest Chile will be in the order of another 10 billion euros or big of real and is small size small solid years we managed to UK subscriptions to Tyrell which awful swelled talk about the the accession mold that a neutral source we also How about national Sally another neutral souls who likes to talk about the laboratory and X-ray source code Diamond delays national facilities tend to come in at about half a billion euros generally We could persuade minister to write a check for bout of believe so we can either get 1 national facility your contribution to an international human rights bigger checks and in order to keep writing nickel to show that he's getting his money so are going focus on the last long Small science and talk a little bit about what these devices to for data that mutual source is an X-ray sources said The big microscopes you'll watching me lights coming in the Windows palace The meeting going in your eyes the eyes of the detectives these things have not particle particle accelerators that with around protons was round electrons produce X-rays produced neutral The light coming in the window somebody brings along a little crystal light bounces off the crystal occurs into a detector detector reducing date in your case data goes right to the right Inuit enact the data goes from the detectors to act computers so each other data archives the data we got about 3 million files
20 years that files each file can be be hundreds or thousands of records and the diamond lustfully viewers we got 100 million files But we also take the data from serve in the last 3 years about 11 petabyte that may sound talking after me for a fraction of what they produce We've also got computers that people could use to analyze that these posts and we lost a few months only well a true we currently have you case but powerful computer and also the UK most powerful graphics processor based computer that is used to analyze a source of data We'll go large commodity computing customers 7 thousand processing course which is used to avoid certain data we gotta A new place throughput put super data cluster which transfers the data warned terror A 2nd um which is used to analyze the earth observation data And build models of climate change altogether we currently got uh take robot system with a hundred petabyte capacity for storage and with doubling Allied data every year So a lot of big money is gone in and is producing a lot of data we have to maximize the value of that so We list the last With why researchers Do You used to realize this is a list of what we want to what we think we want to do to get the most out of that data we go at researchers access it comes off these machines in U.S. they wanted the next day actually they also wanted back 10 years Holland when they forgot where they put their own copy of the government what they labeled as when they find some new result they want combine it so we've got to preserve it we've got to keep it we gotta keep them a list them but we need to have we need other researchers to validate those results the results of the analysis conclusions they take the data they can't afford We do new experiments on these big pieces of equipment you cult reruns of events in history were observed so They need to get hold of that data If published papers we what exactly as was set in the previous told dear wife reference precise data that was used until that function precisely we want to promote people doing better studies yes these big equipment they're expensive but they still have a certain sensitivity obviously people contain multiple datasets and often find effects
The good when anyone dataset again they have to reference through the deal House point That big machines the big computers are just along the use model climate change is an obvious example Also things like how galaxies of formed how the universe was formed those models needle left parameters set They also need dataset test again they need to to get access for different views of those of the people who don't know experiments that the equipment was there settings were they needed a whole lot of new information to help them understand what the data means is useful to Gradually uses a lot of data we get data for experiments or signs we don't know about as was mentioned as morning his status that um is a continuous status of the 3 100 years collected 6 times a day Started off being collected so that people could get chips rivers without going round you have it's really important in climate change studies have that sort of long-term status said nobody thought about using it that before 30 years before other Davis said again with people come in with complete New you say they don't know a lot about how data was originally selected what's done with major
A lot more information on the day of the stuff we're doing is About fundamental science Jurgen view a lot of the stuff we're doing is also Had we get that new drug recounts how we get that new material to make a call the based computers the silicon chips Of the limited growth potential those sorts of activities that have been enormous financial pale So there is a recumbent bike patents hold issues of Rights control management we got issues The moment about U.S. courts are demanding ELECTRONIC data actual store for stored information in defense patent court cases in November last year the judge demanded the full ever want stored information most 18 years that with relevant actively disclosed patent case last year a similar richer financial issue where he made a similar judgment about 6 years ago the company said it could dispose of 20 billion fine so it's not just a matter of having the data Saying here it is or not we actually produced Don't produce it faces enormous so we gotta keep on 20 scale we also talk about evidence-based policy-making alone policy stuff coming out of climate change based on climate change models based on the data sets can be decades or centuries so we have to preserve it for a long time we have to be able to identify people have defined it They have traced back through the literature where that data's is being referred it was referred to in the journal Proceedings of the Royal Society 1797 the disease They have to be overflowing And yesterday for status as I get down with the users involved unless less about the As I go down the payoff is longer longer rode the future away from when the data was originally going down list the probability of actually any single dataset reducing the payoff is lower and lower So you have to make an investment although probability of potential high impact we have we have to try and understand that space data and reused benefits we tried but we understand why we're doing it we don't know because economics enough yet
But we know that DOI sit in there because their import way of maintaining reference latest Data reused With the various models of data used various life cycles We go through its essential in this cycle We got scientific publications we got various sorts of data being preserved in archives people have to discover that data and has a crew available to But we would promote that day to read particular truly as Aaron metadata catalog right at which contains access to the data from those facilities be Microsoft 6 resources be contains also access to install a superb pedestals people's which so when people say how we have some time on the facility they write a proposal would keep them when they come along to get when they scheduled tell them yes we could have that type kind of facility you can use it for what you ask for when they turn up their data they run the experiment data gets put on the archive that make them then
Please note that database calibrated settings Coming off the machine that calibrated cleansed ocean goes in the car they can download it they can analyze it and do what they like with it after that they will have some results which will put into a publication that publication will then use a DIY to cite original source data that publication goes into the or heavily DOI so we could follow the chair and no single human being is touched anything to to do with metadata so far so this is a big contrast what's happening the social science area where in order to get this to the scientists who wanted the people who Validate the results we do this automatically go further down list I show that we got to add more more the during that human that's what we need to understand that We have lots of computers involved in this process may of course include access to DOS service into the data site system is an integral part of the show which Schreiner neurotic with automated metadata collector We could schedule a proposal That tells us who's doing what do the Fonda is what they're doing the instrument tells us what the data is what data settings is settings with the publication tells us what we analysis of what results the DOI is address the links these things together except by percentage people not use facilities don't actually do what we said in the proposal They said we going to look at a piece of gold ass annoying similar later with a bit of class I mean it they look some because they say they bring alone Chris full put it in the big microscope structures Might take 3 years for 24 years just to grow that crystal took 2 microns size they might be looking The human rights as a single launch crystal and they want to achieve the final structure that particularly done the guy got price that's what we will do our dear wife because people want to know where Cepeda days ago price In other cases people so upper other things actually right because that is what it was where we get the paper published about something out of the circle Which date to do we want I've noise he said system produces raw data is recalibration varies caliber that's automatic that's all we got they do some analysis that means they run of computer programs
What computer They're producing various bits of the derived data what to right we don't have that they gone way back that laboratories and so we want to live on took place but we encourage goes back the same problem that the see have on other people have well we're waiting for the researcher give us something where they don't necessarily see the immediate we also have issues that immediately to broaden the problems Which Beazer software to the you come actually come we do have to retain and preserve that piece of software do we wanna put two-year wireless piece of software so that we can say that profit and this is where the data came out of the experiment this is where calibrated that's later this is where the user took away their where this particular pieces software the the software as well as for the data so that somebody could How do wish to preserve the pieces So we got software from the 1960 used to run on IBM 360 mainframe with idea we've actually got simulators of IBM 360 mainframe little run on laptop PC's today so we can Still run the 40-year-old software This sort of replication can be Forget how would we Reference the dealers which data should actually be cited now we have Dreyfus we want facility we get paid dividends in our structures about facility we allocated being time backlist anywhere between 5 minutes in 2 weeks depending on the type to and the toluene that we allocate that being salinity alike we say This is where your date is going the investigation could cited even if all the data actually exists What level Should the We assigned time we say you've got a day four-day now We want facilities we assigned DOY on the basis of the only because that's the Obviously managed in that time they do an experiment experiment produces 1 or more data sets each dataset 1 hundreds of files so we have a metadata structure which relates experimental facility toe in in the middle of the week of Apr . multiple datasets mobile data relief capsule aikido into local studies have grown proposal but multiple experimental forum further that's on a topical access control issues was related material proposals publications
Other legal issues we the the investigator is because we Haiti's grow old with giving him time we have got but if they don't get married change the names of the university's we don't touch manage These individual identified so the say when these India Data persistent identifiers break identifies in this structure the moment where allocating Dear The experiment level that's what's being so we will get to the point where we are without way restructured the namespace the dealers we remain management for that provision that allows website Davis who individual foil or indeed records wouldn't say when it's the prize-winning when when it's that dataset concerned that we expect to hear about the end of the year says this is the but he really it's battle he spent all your money a lot of schools of real estate show that wastes 18-year-old physics students Kazakhstan refused they will find that particular data that the that that's the 480 billion euro wanted so we are gonna have to be identified regularity we gotta system little support in the current sister however when do we published what we have about 1 % latest commercial we don't publish it issued is that we don't need not my case commercial just collected take it away and we got back over the machines of White Plains
Did But we got different sorts of facilities they do different types of science Different types of science at different policies about how long it should have Access to their data and normally the summer the dog period says scientist have debated for his own use for so we said that normally around 3 years still has a Ph.D. period often universities got Ph.D. regulations If we would select data rout might pose Ph.D. in doubt whether original research But What's really important to everybody we record who access data so that later on we notice somebody's access It is quite common talk about a volley data we also have a real problem with his idea we have bargain metadata it was an example for 2004 and you know it's quite in astronomy Well Spanish a research team and there was discovery of a new planets in
This was greeted with all the response of press would greet somebody saying we discover new planet and turned now actually they were group of people in California who also founded and Previously announced workshop where they would go into a discovery that it set out the Spanish people actually got 11 announced then they to the record of where those Californians have been which data they'd be looking at a later told the their thinking of announcing a new plan this is the data they were looking at to announce a new plan We'll look at that But we can see a planet less announced it that serve Polly be able to get access to the record which some honest interstate access the data they jumped to took 1st announcement of a major discovery so the scientists a really wary of knowing exactly whose access their data and yet Letting people know whose access the data won't necessarily what stage sometimes with better data Even knows that A group of cancer research is in a particularly universities being funded by a particular drug companies to look at a particular chemical is enough for another drug company does think obviously vaccine would collect types of cancers and then look back then they think that could be a potential new drugs to kill cancer in that area
Just giving away the title could give away commercially variance we got issues about how we embargo the metadata for volley access to the logs for how long we embargo
But the data itself we want to avoid data misuse not just in those ways we got lots of satellite us observation data Problem except because some great detail observation data events such as airplanes crashing into New York into what which in 2001 at regional is wanted for reasons we will deal with so we have to decide how to do with case federal case we also got leaves you and into that'll catalog climate change date for which their reports would we want to make available to the public so they understand the issue we got We can only make it available in such a way that the conspiracy theories out there aren't such jump on 1 plane part of it and make something up with certain data We don't want people suddenly saying that where the black calls on their making them going to destroy the universe that we do not publish samples the teaching but if we do That we would receive their state publishing at all because that pleadings You'll stop signs has to be verifiable but we got these real political rates as we and lastly I'd Ursola comeback point made really plummet earlier read a lot of money is being invested in this experiment a lot of money into the data collection we need Justify get boost use out of that deal lies a vital part of that process data that might be necessary for software we need the legal publications the cycles not we also need to determine what return on best How do we calculate different people want it in different ways European Space Agency have financial model Based on space missions of long-term infrastructure some Research Council's of infrastructure some people believe in different ways of doing these evaluations nor costs involved what thoroughly is a little bit about earlier this risk loss mentioned earlier of the evaluation of further EDS a UK data parts but by the SLC was several things are going along with people looking at what are the costs were return lots of different approaches not quite toolbox yet where the evidence was talking about will compete purchase the probably will turn into another 1 is a European project called ensure which is looking at a total investment also in the commercial area of preservation based system but half of all we can do many things with a repeat of the half the YTO preserving data what your objectives in evaluating the activity involved Aziz said I put on a list that different objectives different groups different lifestyles and tried that with these
Process of air that 3 areas but with interest to go west to toe