Matthew Woollard, UK Data Archive at the DataCite summer meeting 2012

Video in TIB AV-Portal: Matthew Woollard, UK Data Archive at the DataCite summer meeting 2012

Formal Metadata

Matthew Woollard, UK Data Archive at the DataCite summer meeting 2012
Persistent identifiers in practice. The UK Data Archive's approach
Title of Series
Part Number
Number of Parts
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
10.5446/8392 (DOI)
Release Date

Content Metadata

Subject Area
Source code Presentation of a group Trail Digital object identifier Mereology Replication (computing) Copenhagen interpretation Message passing Document management system Bridging (networking) Personal digital assistant Different (Kate Ryan album) Permanent Order (biology) File archiver Self-organization Process (computing) Information Ranking Summierbarkeit
Point (geometry) Greatest element Service (economics) Information Source code Expert system Computer Mereology Revision control Wave Broadcasting (networking) Process (computing) Meeting/Interview Telecommunication Internet service provider Revision control Operating system Statement (computer science)
Ocean current Dataflow Group action Identifiability Multiplication sign 1 (number) Codebuch Set (mathematics) Water vapor Insertion loss Function (mathematics) Approximation Revision control Wave Mathematics Meeting/Interview Term (mathematics) Object (grammar) Uniqueness quantification Resource allocation Physical system Form (programming) Area Information Digitizing Physicalism Bit Type theory Wave Process (computing) Intrusion detection system Personal digital assistant Series (mathematics) Order (biology) PLS (file format) Self-organization Quicksort Object (grammar) Physical system
Slide rule Sensitivity analysis Codierung <Programmierung> Decision theory System administrator File format Price index Open set Approximation Code Metadata Variable (mathematics) Revision control Wave Mathematics Independent set (graph theory) Human migration Term (mathematics) Energy level Information Condition number Fundamental theorem of algebra Addition Information File format Metadata Term (mathematics) Variable (mathematics) Hand fan Mathematics Subject indexing Computer animation Personal digital assistant Series (mathematics) Revision control Condition number Moving average
Beta function Logarithm Multiplication sign File format Archaeological field survey Bit error rate Set (mathematics) Insertion loss Water vapor Instance (computer science) Open set Mereology Subset Mathematics Meeting/Interview Different (Kate Ryan album) Object (grammar) Matrix (mathematics) Row (database) Process (computing) Information Resource allocation Physical system Metropolitan area network Observational study Web page Computer file Interior (topology) Metadata Computer Bit Staff (military) Mereology Instance (computer science) Digital object identifier Sequence Demoscene Type theory Wave Process (computing) Repository (publishing) Different (Kate Ryan album) Website output Figurate number Row (database) Point (geometry) Ocean current Dataflow Implementation MUD Service (economics) Observational study Link (knot theory) Computer file Library catalog Digital object identifier Metadata Number Revision control Wave Frequency Goodness of fit Latent heat Bridging (networking) Internetworking Hierarchy Energy level output Macro (computer science) Design of experiments Data type Mobile Web Information management Planning Coma Berenices Basis <Mathematik> Database Incidence algebra Library catalog Line (geometry) Mathematics Subset Pointer (computer programming) Computer animation Function (mathematics) Revision control Object (grammar) Hydraulic jump Capability Maturity Model Library (computing)
Web page Backup Presentation of a group Gradient Multiplication sign Real-time operating system Metadata Mathematics Lecture/Conference Term (mathematics) Boundary value problem Selectivity (electronic) Endliche Modelltheorie Traffic reporting Form (programming) Moment (mathematics) Metadata Bit Volume (thermodynamics) Subject indexing Process (computing) Repository (publishing) Function (mathematics) File archiver Library (computing)
Point (geometry) Context awareness Presentation of a group Identifiability Virtual machine Numbering scheme Digital object identifier Metadata Computer programming Uniform resource locator Chain Kernel (computing) Row (database) Process (computing) MiniDisc Category of being Descriptive statistics Source code Presentation of a group Moment (mathematics) Computer program Metadata Library catalog Lattice (order) Term (mathematics) Virtual machine 10 (number) Category of being Word Uniform resource locator OAIS Process (computing) Event horizon Computer animation DDR SDRAM Personal digital assistant Repository (publishing) Website Self-organization Key (cryptography) Quicksort Table (information) Row (database)
Point (geometry) Probability density function Presentation of a group Computer animation Lecture/Conference MIDI output Quicksort
Yeah well on Matthew will as bridges A director of the UK data archive and the UK data archive Is that a sore subject specific digital archive we deal with it Social and economic data in the main and the organizer nation has been running our 45 years I was set up in 1967 and this is That really not Particularly about legal or ethical issues it is about It is a really use case of how we implemented digital object identifiers within the UK data archive Interacted with data them and some ideas about how we might be able to move some of these Initiatives Forward in the future and how we might do that With members of latest so but I have to say that this is a presentation which I have given once before in a very near variant of this at a workshop on persistent identifies in Berlin last month it is up to date might have tweaked it at this conference but there are still some broad messages in here which I'm pretty certain that everyone is going to be is is going to be aware of I like to think about the reasons behind the citing data it goes back to to the White dude want you try and measure impact why you need to cite data and I think that these at 6 reasons here including helping tracking the impact use of data collection All the key drawing because at least within the social sciences and and I think every presentation you hear about data citation data societal digital object identifiers these 5 6 these 6 pound reasons I usually brought to the fore Um in different orders at different rankings pound but I think that
It's really important that we do we continue to recognize some of these principles it's about the sources Survived credit It's about reputation and it's about impact it's about reliable information about the data but also as Andrews Pointed Ivory clearly this morning it really helps also to find and
Access later that another process of pointing it now this is Our approached citation as in the past and you don't do not supposed to be able to read it if you can read it And your glasses or contact lenses To that will focus and we have an end user agreement with all users when users come to us and they take up data they have to click on the bottom of this not It's not as big as You updating your Mac operating system I agree but is quite long and tucked in here it is it says acknowledge any publication with a printed electron broadcast lawyers been involved in this house has broadcast data based wholly or in part of data collection the archeology data service have once had some of that data used in a performance so it it is possible and and then underneath that it says that supply the relevant data service provider that's us up with a graphic details of any published work based your on the data collections this Is the old style but this is still within or end user agreement and are approached a citation is is pretty straightforward at it should provide enough information to ensure the exact version could be found this You could see it says in his 2nd edition but it and this has been the way in which It also people to cite data since the mid-70's And its North an acknowledgment This is A citation we don't Like people who cite data might like to to thank my friend Professor Expert letting me use his data and by do have examples of this and I've forgotten to bring them with that but it's not an acknowledgment we want people to cite data so obviously we think we wouldn't be here otherwise I wouldn't be here otherwise we think that the use of persistent identifies data should be similar to those that use of the chat
And we take these 2 terms and break them down persistent and identify and we want to make sure that were actually doing what it is that those things highlight the systems must meet in during perhaps not but wanted but For a very long time and at some stage in the future that we can't necessarily comprehend and identifies must be unique it sounds pretty straightforward but in the case of the persistent identified should be you name it sounds pretty straightforward but is not actually always put into practice when your thinking about the citations and the allocation of persistent identifiers to stuff and I think Also the other Areas Would have to think about losses about The fact that the digital objects whatever it is that is being cited needs to be clearly defined in order to ensure that the appropriate granularity of that object Identifier being given to it and as we had this morning this is not just a by citing data citing research I votes persists and identifies of the assault can be used to identify individuals researchers taxonomist all sorts of other things as well it's not To think this is Only going to be of use the research output of 1 sort or another so we went through A reasonably long process In order to try and Implementors digital Checked identifiers because all data collections in themselves on just digital objects their collections of digital objects that they made and many of the older ones include there's still physical at their there so physical manifestations of code books in paper nor full of been digitized we also Because we because we Ron organization that is making data Available in a usable form to the end user We make changes to data and this could happen quite frequently it's not just only ingest process When we discovered that a statistical organization within the UK has given us some personally identifying information that we should lead out of these data that we would remove it so The data has changed these persistent identified It is no longer persistent we have to think about the way in which data is versions and we need to try and do this in a commonly understood manner which deals not only with Editorial changes but also with changes in terms of the longitudinal data that data collection exercise where data is added in waves to data sets over time we wanted to make sure that Our understanding of those with rule-based about human mediated so that we could have a little bit of flexibility over water significant or a high-impact change and we also wanted to make sure that we were able to implement some of these things in a machine actionable because the huge cost of digital preservation data archiving is human and where we can get humans
Most of the work flow will reduce the men work for the better it is We wanted to integrate the processes with digital preservation activities but we also want to make sure they work within all current infrastructure work flows we want to get it right 1st time so this is this is important around 15 % of things we ingested any given year and that's about 200 data collections every year about 15 % of maturing within the 1st year some of the new edition was out where these types of changes
Mrs. slide
Some of them
A new additions but they're also changes to underline metadata um which are slightly low impact change so what we've done is we've said we got high impact changes and we got low impact changes and we also recognize that social science uses the majority I say to the majority of social science users want the most recent version they don't want the old version most social signed moose social sciences on
Using data to replicate will validate of the people's research they using it for for their own research so we made the decision that the users will have older versions made available to the mound and information about those older versions should be And should be available but they have to demand and in most cases we go back through our systems and take older versions but we can't make them available to user demand and then we got this roll off after blue impact that changes a change in a reference the spelling of a variable the removal of administrative information metadata spelling corrections adding index terms adding documentation adding changes or making a change to access conditions say that We're getting toward 10 cent of all collection there is in some ways in some way restricted access sensitive level welding sensitive data the government and the about 60 % of it has another access conditional at which means that you can't just come along and take it away about 30 per cent of all collection is open 21 or registration but but a considerable proportion has an access conditions that means you can't just come along and take it and then we decided to codify some of the high impacts Fans You'll see that New variables new codes New whites saying data which was missed coded changes in file formats significant and again this is a problematic would what we mean can measure significant but significant changes in documentation and change in access conditions if It's a change in access conditions from clues Open then that's a minor change if it's a change from restricted from from Restricted that's a major change and what we did
As we started to take these about take these ideas and turn them into A straightforward work flow and we decided that it would think about instances and you can see that We have 3 different types of change and we have an internal change just process is something that we do nobody knows about it but it's not released Publicly that's an internal instincts if there's a low impact change and we have a new X-terminal incidents with the same persistent identifier and if we have a high impact change we have a new external incidents and the new persists in Bladen So that's the methodology and then you will mobile data site but last year we started work the bridge Lybrand I decide to try and allocate digital Judge identifies 206 thousand bought collections and about you know you know the rest about and It discussion with data side and the British Library we came to the opinion that we better Allocate identified to call metadata but we originally thought it would be better to have allocated to Call metadata but even the titles of some of all studies change at adding a new wave to longitudinal survey is a new title for goodness sake said that's not persistent enough so we did work on this basis of allocating at a DO IT the metadata Which relates to each external incidents described Together the data collection and the digital object identifiers result would jump at age at which point the all of the external instances that should look like this so this is time period 1 user comes in looking at the survey survey waves want 13 it has a digital object identify which shows the study number
Which is off collection study number and that Points to instant specific data and metadata The user comes in at plan period too The DOI has changed the title has changed and there's a pointer still to be new current instant specific data and metadata for just go out of the current time and if user comes at a time period that the instant specific date from metadata is line of that but but only for that time period so the user sees the other Staff but they don't be they are not able to get the data from the earlier versions and that's what it looks like an or catalog reasonably straightforward says the citation and this is version 2 a data set In fact it's no version to because its version 12 but its version 12 the 2nd new versions and were introduced and than this version 3 above and on the right-hand side there would be a pointer to the The catalog Record where the user would be able to get hold of the data to the process that we went through and there's a question this morning about what had you had you actually do this well it's these 6 or 7 steps down that on needed that we need to do but we Minter New DOI through data side we update the change Within all systems we created new citation file that Z since the catalog and Bob's your uncle and if we have to update the catalog record is a little bit about It's a little bit trickier because we have to keep the old cattle record as a record of what was present in the past Found Since these things off all not entirely um not always entirely clear to everybody and there is actually a loss To be said About what you should be putting off to your publishes annual publisher's slashed we thought that we would keep an all-time readable identified the UK DOE at within within year the DOI we're putting in excess and for me I know that that means study number that It means that we could we use for use very similar numbers also to define things at the summer Collection level so Study No. 1 It is pointing to a metadata incidents when we get to the stage of using digital object identifiers to point the dataset versions of dataset we can still use this sequence and this hierarchy of of former Tampa and well again with scene that this morning this is a hugely helpful and hugely useful things that that they decide provided it allows us to be able to see a little bit about how people are coming to us and whether they're coming to us through through data side Skit about 1 3rd Measuring impacts and the impact of research was the topic of the keynote this morning and what really interested me what really interests me is how do we managed to assess in any way the impact The service but we're running high while people using the data But we are providing access Because we're not really carrying out research will we are infrastructure but all research councils Treat house a little bit we all research because they are not always sure precisely what the differences and this is a relatively straightforward example Justin using Google to be able to look How long data has been cited by others and available only Internet that Google is a bit of a The Hobbs told and odd like to find much better ways of assessing the DOI is that we issue which cited data which we hold all used Because then we can start to have a look at some of the impact that we're having as a service we start running some of the Matrix bombed on this because this is the only evidence that we have found really now the only evidence that we have as I said If you expect researchers to cite data properly we should be able not just the mind these dear wife from D open Internet but also from some of the deeper And more closed parts of it that this is another Initiative where we'd like to work with some publishers set off challenges for the future of looking at low levels of granularity of data especially by subsets of individual files And with current yearly pointing to metadata we don't even point with dataset um but some sets of quantitative data Are increasingly important especially This figure here is the GDP Or Guyana in 1976 that's an identifiable Datapoint held in macro data You should be able to cited It shouldn't just before . 6 whatever but the number is you should be able to cite down asset sales level in a database we also want to make sure there are clear look Clearer relationships between different types of object and again I'm really pleased to to see the announcement from beta site which includes some of these issues movements toward some of these issues having better relationships between research articles which are held by publishers or institutional repositories and research input the data The data is also research at put as well and we need to make sure the research I puts which of data related other research at puts which a data especially as we move more water were we use culture where people should be rewarded for use rather than just rewarded for creation of the relationship between 2 data sets Because they're so easy to manipulate Icahn take somebody else's data or I can add another variable to apply publisher shouldn't get credit for the whole thing so we need to make sure that back into relationships between owner and creator and distributors little more sorted out and also from Research Council's we won't try and find a way of making sure that I put some researchers are better links together because they're off to amend there's there too many researchers on this planet to effectively disambiguate them manually that 1 of the challenges for the future and this is This is touching on the idea that Andrew for the implementation and was talking about this morning and we feel that as we're moving in the UK into a culture where where austerity is important and the cost of looking after data
Um easing is increasing recalls there's more human effort involve weakened still look after tons and tons of stuff it's not the volume in terms of the size of of the of the bits of data it is the human activity for checking so we're trying to move toward a model in it's a bit Contradictory to the 1 that after Presented In the previous session we want to see Institutional repositories doing more we know that they're not very good at the moment but they will improve they will improve over time and we think that Not only can they look after the journals but they also look after The data not not saying the institutional reporters should be looking after data for 50 or a hundred years but they should be that the ground where they can be placed to be looked all after Raul the hue rated looked after Backed up for 10 or 15 years so that something like these That the data to use index we heard about this morning to be applied to that and we can will be Top 50 100 fires and data collections which we used in the last decade let's move those in a specialist repository using institutional repositories as some form of selection process But that's the future and in the meantime we still has as a specialist later archive we still need to be able to find old that on users find more data on the solution again the boundary provides
During during 3 data side and then possibly we could go to the Australian national libraries Digital's And possibly well I don't like the idea might be deceived the 150 things going down the side of the page Why do you think that using a single API to interact with fighter a metadata stole All with an institutional repository probably isn't going to be effective we need to try and drag some of the data out of of these things put it centrally and then Attack about Later Barnes That stand in real time that's going to be really really difficult but It's not an impossibility depends how frequently data changes and I think that uses a problem began to be happy with 2 . 8 Roth the 2nd but we never know
Am I think Sorry I just think that Digital object identifiers can provide some of the gloom which holds this as a mortal together because that is the relationship between research and data being stored in war of the different in all of the different repositories of the point here is that at some stage in the future we need a way of referring want digital object identifier 1 URL to another in a reasonably permanent what I think the other thing that we might need to think about in the future and I can mark Croat gave a presentation at this meeting But last month he said a program is as likely to follow you a person and I don't think he's right and I think a program is much more likely to follow URL so it posed the question and he also Oscars we didn't really we did have a discussion but we didn't get much toward is there a specific properties missing from from data site and if there were 1 the specific property missing at the moment which we used data this I think would like to be able to add is the rich metadata appointed to rich metadata And appointed to rich and And the description of the rich metadata identified so in my case this would have a URL or a or in fact even a persistent identifier which would go to my catalog record OAI of my catalog record and the rich metadata scheme would say DDR and a machine would be able to whip through data sites catalog and wrong as Andrew solution this morning was the search on title that search on the whole the metadata records that all metadata records All have profited they take days to produce their tens of thousands of words and some of them It would be nice to search all of those through a single 1 Lynnette on raising awareness whipping tries to a raising awareness within the UK on this put on your tables a brochure which which is cold and data citation and what else would be called what we need to know and that's that is very short and this is
And it well it has sort of pictures and it's not branded by my organization has branded by the Research Council and they did a very fine job won't be using some illustrations because this is the same illustration the use of the front of the big glossy had to do impacts Journal a brochure Soviets also had been really good In helping us with this And data site the British Library and others have been really good in making sure that endorsed that hopefully
It is Corrects and it is what you need to know I just want to finish up just by acknowledging some people who have input in the presentation and I also want point at sort last week that might assist which is the International Association of social scientists something something something something something and have also just published a short and If he died today citation which is even sure of this and tell was you almost as much so required recommend that To you as a useful adjunct to your that I reach activity thank you