We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Matthew Woollard, UK Data Archive at the DataCite summer meeting 2012

00:00

Formale Metadaten

Titel
Matthew Woollard, UK Data Archive at the DataCite summer meeting 2012
Untertitel
Persistent identifiers in practice. The UK Data Archive's approach
Serientitel
Teil
5
Anzahl der Teile
10
Autor
Lizenz
CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
Produzent

Inhaltliche Metadaten

Fachgebiet
Genre
Kopenhagener DeutungProzess <Informatik>QuellcodeInformationPermanenteWeg <Topologie>DatenreplikationSelbst organisierendes SystemDigitalisierungCASE <Informatik>Coxeter-GruppeDatenfeldDokumentenverwaltungssystemWeb SiteDifferenteObjekt <Kategorie>IdentifizierbarkeitMessage-PassingDigital Object IdentifierOrdnung <Mathematik>Bridge <Kommunikationstechnik>RankingPackprogrammVorlesung/Konferenz
Wechselseitige InformationInformationPunktQuellcodeVorlesung/Konferenz
SichtenkonzeptVersionsverwaltungBefehl <Informatik>QuellcodeComputerWellenlehrePhysikalisches SystemEindeutigkeitImplementierungInformationMathematikOrdnung <Mathematik>Selbst organisierendes SystemFunktion <Mathematik>DigitalisierungMAPBildschirmmaskeBitMereologiePhysikalismusTermQuick-SortFlächeninhaltVersionsverwaltungBroadcastingverfahrenNichtlinearer OperatorCASE <Informatik>Prozess <Informatik>WellenlehreSchnittmengeBetriebsmittelverwaltungMinimumObjekt <Kategorie>CodebuchIdentifizierbarkeitMultiplikationsoperatorZweiDienst <Informatik>TelekommunikationExpertensystemGruppenoperationPhysikalisches SystemWasserdampftafelService providerEinfügungsdämpfungNetzbetriebssystemEinsVorlesung/Konferenz
ApproximationVariableWellenlehreDivergente ReiheInformationVersionsverwaltungMathematikMetadatenFundamentalsatz der AlgebraIndexberechnungTermKonditionszahlMigration <Informatik>DateiformatCodierung <Programmierung>InstantiierungDigital Object IdentifierSondierungWechselsprungWeb-SeiteInformationMathematikFrequenzTypentheorieDigitalisierungMAPSensitivitätsanalyseAutomatische IndexierungEntscheidungstheorieMomentenproblemSpeicherabzugWechselsprungVersionsverwaltungGüte der AnpassungOnline-KatalogBasis <Mathematik>CASE <Informatik>Prozess <Informatik>SystemverwaltungInstantiierungZeiger <Informatik>PunktStrömungsrichtungWellenlehreBeobachtungsstudieUmwandlungsenthalpieMetadatenBetriebsmittelverwaltungElektronische PublikationWeb SiteKonditionszahlDifferenteSondierungIdentifizierbarkeitMultiplikationsoperatorDigital Object IdentifierStabCodierungProgrammbibliothekInzidenzalgebraVariableGeradeRechenschieberTermZahlenbereichDatenflussAutomatische HandlungsplanungFächer <Mathematik>AdditionBridge <Kommunikationstechnik>Gleitendes MittelSchnittmengeOffene MengeDateiformatObjekt <Kategorie>Mobiles InternetComputeranimation
Digital Object IdentifierVersionsverwaltungWeb-SeiteMetadatenPhysikalisches SystemWellenlehreBeobachtungsstudieComputerMathematikOnline-KatalogEindeutigkeitDateiformatVideokonferenzDichte <Stochastik>Funktion <Mathematik>SichtenkonzeptMereologieObjekt <Kategorie>TeilmengeEin-AusgabeInformationDatenbankDatensatzFolge <Mathematik>ImplementierungMathematikTypentheorieMakrobefehlMatrizenrechnungInzidenzalgebraMAPBetafunktionBitMereologiePhysikalisches SystemTeilmengeZahlenbereichVerschlingungVersionsverwaltungOnline-KatalogInternetworkingWasserdampftafelProzess <Informatik>Zeiger <Informatik>PunktStrömungsrichtungSchnittmengeEin-AusgabeOffene MengeBeobachtungsstudieEinfügungsdämpfungMetadatenHierarchische StrukturElektronische PublikationWeb SiteDifferenteObjekt <Kategorie>Dienst <Informatik>Demoszene <Programmierung>Figurierte ZahlDokumentenserverDigital Object IdentifierFunktion <Mathematik>Zellularer AutomatInstantiierungDateiformatsinc-FunktionIdentifizierbarkeitZwei
MetadatenDokumentenserverFunktion <Mathematik>Automatische IndexierungBildschirmmaskeBitTermProzess <Informatik>Trennschärfe <Statistik>Spezifisches VolumenDokumentenserverMomentenproblemCoxeter-GruppeVerkehrsinformationSicherungskopieEndliche ModelltheorieMultiplikationsoperatorRandwertPackprogrammVorlesung/Konferenz
ProgrammbibliothekInformationsspeicherungWeb-SeiteMetadatenDokumentenserverMathematikEchtzeitsystemVorlesung/Konferenz
Kette <Mathematik>Virtuelle MaschineProzess <Informatik>QuellcodeCoxeter-GruppeMetadatenHauptidealringKategorie <Mathematik>URLFlächeninhaltDatensatzDeskriptive StatistikFunktion <Mathematik>ProgrammierungMAPKategorie <Mathematik>Virtuelle MaschineVersionsverwaltungOnline-KatalogCoxeter-GruppeNummernsystemZeiger <Informatik>PunktWort <Informatik>MetadatenVollständiger VerbandWeb SiteEndliche ModelltheorieZehnIdentifizierbarkeitURLZweiRechter WinkelDokumentenserverDigital Object IdentifierMomentenproblemTabelleCASE <Informatik>Kontextbezogenes SystemDDR-SDRAMOAISComputeranimation
TermEreignishorizontTabelleQuick-SortProzess <Informatik>Web SiteKontextbezogenes SystemSelbst organisierendes SystemComputeranimation
Ein-AusgabeElektronischer ProgrammführerQuick-SortCoxeter-GruppePunktComputeranimation
Transkript: Englisch(automatisch erzeugt)
Okay, well, I'm Matthew Woolard, as Brigitte said, I'm the Director of the UK Data Archive, and the UK Data Archive is a subject-specific digital archive. We deal with social and economic data in the main, and the organisation has been running now for 45 years, we were set up in 1967.
And this is really not particularly about legal or ethical issues, it is really a use case of how we've implemented digital object identifiers within the UK Data Archive,
how we've interacted with data site, and some ideas about how we might be able to move some of these initiatives forward in the future, and how we might do that with members of data site. So, and I have to say that this is a presentation which I have given once before in a very
near variant to this, at a workshop on persistent identifiers in Berlin last month. It is up to date, and I have tweaked it for this conference, but there are still some broad messages in here which I'm pretty certain that everyone is going to be aware of.
I like to think about the reasons behind citing data, it goes back to the why do you try and measure impact, why do you need to cite data? And I think that these six reasons here, including helping tracking the impact and
use of data collections, are the key drivers, at least within the social sciences. And I think every presentation you hear about data citation or data site or digital object identifiers, these five or six, these six reasons are usually brought to the fore in
different orders, different rankings, but I think that it's really important that we do continue to recognize some of these principles. It's about the sources, it's about credit, it's about replication, it's about impact,
it's about reliable information about the data. But also, as Andrews pointed out very clearly this morning, it really helps also to find and access data. So it's another process of finding it.
Now this is our approach to citation in the past. You don't, you're not supposed to be able to read it, if you can read it, your glasses or your contact lenses are much too well focused. We have an end user agreement with our users, when users come to us and they take
our data, they have to click on the bottom of this, it's not as big as your updating your Mac operating system, I agree, but it is quite long, and tucked in here it says to acknowledge any publication, whether printed, electronic or broadcast, you can see a lawyer has been involved in this, has a broadcast of some data, anyway, based wholly or in
part of data collections, the archaeology data service have once had some of their data used in a performance piece, so it is possible. And then underneath that it says to supply the relevant data service provider, that's
us, with bibliographic details of any published work based wholly or in part on the data collections. This is the old style, but this is still within our end user agreement. And our approach to citation is pretty straightforward, it should provide enough information to ensure
that the exact version can be found. This you can see it says in here, second edition, but it, and this has been the way in which we've asked people to cite data since the mid-seventies or so, and it's not an acknowledgement, this is a citation.
We don't like people who cite data, I'd like to thank my friend Professor X for letting me use his data, and I do have examples of this, and I've forgotten to bring them with me. But it's not an acknowledgement, we want people to cite data.
So obviously we think, and we wouldn't be here otherwise, I wouldn't be here otherwise, we think that the use of persistent identifiers for data should be similar to those that are used for other research outputs. And we take these two terms and break them down, persistent and identifiers.
And we want to make sure that we're actually doing what it is that those things highlight. Persistence must mean enduring, perhaps not for all eternity, but for a very long time and at some stage in the future that we can't necessarily comprehend. And identifiers must be unique.
It sounds pretty straightforward, in the case of a persistent identifier it should be unique, it sounds pretty straightforward, but it's not actually always put into practice when you're thinking about the citation and the allocation of persistent identifiers to stuff. And I think also the other area that we've had to think about a lot is about the fact
that the digital object, whatever it is that is being cited, needs to be clearly defined in order to ensure that the appropriate granularity of that object identifier being given to it.
And as we heard this morning, this is not just about citing data or citing research outputs, persistent identifiers of a sort can be used to identify individuals, researchers, taxonomies, all sorts of other things as well. And it's not to think this is only going to be of use for research outputs of one
sort or another. So we went through a reasonably long process in order to try and implement digital object identifiers, because our data collections in themselves aren't just digital objects, they're collections of digital objects, but they may, and many of the older ones, include
there are still physical manifestations of code books in paper. Not all of them have been digitized. We also, because we run an organization that is making data available in a usable
form to the end user, we make changes to data. And this can happen quite frequently. It's not just on the ingest process when we discover that a statistical organization within the UK has given us some personally identifying information that we should leave
out of these data. So we would remove it. So the data has changed. The persistent identifier is no longer persistent. We have to think about the way in which data is versioned, and we need to try and do this in a commonly understood manner, which deals not only with editorial changes, but
also with changes in terms of a longitudinal data collection exercise, where data is added in waves to data sets over time. We wanted to make sure that our understanding of versioning was rule-based, but human-mediated,
so that we could have a little bit of flexibility over what a significant or a high-impact change was. And we also wanted to make sure that we were able to implement some of these things in a machine-actionable way, because the huge cost of digital preservation of data
archiving is human, and wherever we can get humans out of the workflow or reduce them in the workflow, the better it is. So we wanted to integrate the processes with digital preservation activities, but we also wanted to make sure that they worked within our current infrastructure and workflows, and we wanted to get it right first time.
So this is important. And around 15% of things we ingest in any given year, and that's about 200 data collections every year, about 15% of them are changed within the first year. Some of them are new additions, where there are these types of changes.
Some of them are new additions, but there are also changes to underlying metadata, which are slightly lower-impact change.
So what we've done is we've said we've got high-impact changes and we've got low-impact changes, and we also recognize that social science users, the majority, I say the majority of social science users want the most recent version, they don't want the old version. Most social sciences are not using data to replicate or validate other people's research.
They're using it for their own research. So we made the decision that the users will have older versions made available to them, and information about those older versions should be available, but they have to demand
it. And in most cases we can go back through our systems and take older versions, but we can't make them available to the user on demand. And then we've got this raft of low-impact changes, a change in a reference, a spelling
of a variable, the removal of administrative information, metadata spelling corrections, adding index terms, adding documentation, adding changes, or making a change to access conditions. I should say that we're getting towards 10 percent of our collection now is in some
way restricted at a sensitive level. We're holding sensitive data for government, and about 60 percent of it has another access condition on it, which means that you can't just come along and take it away. So about 30 percent of our collection is open to anyone on registration, but a considerable
proportion has an access condition that means you can't just come along and take it. And then we decided to codify some of the high impacts, and you'll see that new variables, new codes, new weighting, data which was miscoded, changes in file formats, significant,
and again this is a problematic word, what do we mean, can we measure significant, but change in access conditions. If it's a change in access conditions from closed to open, then that's a minor change.
If it's a change from restricted, from open to restricted, that's a major change. And what we did is we started to take these ideas and turn them into a straightforward workflow, and we decided that we would think about instances, and you can see that we have
three different types of change. We have an internal change during the ingest process, which is something that we do, nobody knows about, but it's not released publicly, that's an internal instance.
If there's a low impact change, then we have a new external instance with the same persistent identifier, and if we have a high impact change, we have a new external instance and a new persistent identifier. So that's the methodology, and then you all know about data site, but last year
we started working with the British Library and Data Site to try and allocate digital object identifiers to our 6,000 odd collections, and you know the rest of that. And in discussion with Data Site and the British Library, we came to the opinion that it would
be better to allocate the identifier to core metadata. We originally thought it would be better to allocate it to core metadata, but even the titles of some of our studies change. Adding a new wave to a longitudinal survey, there's a new title, for goodness sake, so
that's not persistent enough. So we did work on this basis of allocating a DOI to the metadata, which relates to each external instance, and I described those a moment ago, of the data collection. And the digital object identifiers resolve to a jump page, which points to all of the
external instances, and that should look like this. So this is time period 1, user comes in, they're looking at the survey waves 1 to 13, it has a digital object identifier which shows the study number, which is our collection
study number, and that points to instance specific data and metadata. The user comes in at time period 2, the DOI has changed, the title has changed, and there's a pointer still to the new current instance specific data and metadata, and if
we just go to the current time, and if user comes at time period 3, then the instance specific data and metadata is live, but only for that time period. So the user sees the other stuff, but they're not able to get the data from the earlier
versions. And that's what it looks like in our catalogue. Reasonably straightforward, there's the citation, and this is version 2 of a data set, in fact
it's not version 2 because it's version 12, but it's version 12, the second new version since DOI's were introduced, and then there's version 3 above. And on the right hand side, there would be a pointer to the catalogue record where the user would be able to get hold of the data.
So the process that we went through, and there was a question this morning about, how do you actually do this? Well, it's these 6 or 7 steps that are needed, that we need to do, we mint a new DOI through data site, we update the changelog within
our systems, we create a new citation file that zips into the catalogue, Bob's your uncle. If we have to update the catalogue record, it's a little bit trickier because we have to keep the old catalogue record as a record of what was present in the past.
And since these things are not entirely, not always entirely clear to everyone, they're clear to everybody, and there isn't actually a lot to be said about what you should be putting after your publisher's slash, we thought that we would keep an archive-readable identifier,
the UKDA, within the DOI. We're putting in SN, and for me, I know that that means study number, but it means that we could reuse or use very similar numbers
also to define things at a sub-collection level. So study number one is pointing to a metadata instance. When we get to the stage of using digital object identifiers to point to the data set or versions of data set, we can still
use this sequence and this hierarchy of format. And, well, again, we've seen that this morning. This is a hugely helpful and hugely useful thing that DataSite have provided.
It allows us to be able to see a little bit about how people are coming to us and whether they're coming to us through DataSite. So measuring impact and the impact of research
was the topic of the keynote this morning. And what really interested me, or what really interests me, is how do we manage to assess in any way the impact of the service that we're running? How are people using the data that we are
providing access to? Because we're not really carrying out research. We are infrastructure. But our research councils treat us a little bit as though we are research, because they're not always sure precisely what the difference is. And this is a relatively straightforward example just of using Google to be able to look at how our data has been cited by others
and available on the Internet. But Google is a bit of a hard tool. And I'd like to find much better ways of assessing how the DOIs that we issue, which
site data which we hold, are used, because then we can start to have a look at some of the impact that we're having as a service. We can start running some of the bibliometrics on this, because this is the only evidence that we have.
Really now the only evidence that we have. As I said, if you expect researchers to cite data properly, we should be able not just to mine these DOIs from the open Internet, but also from some of the deeper and more closed parts of the Internet.
This is another initiative where we'd like to work with some publishers. So our challenges for the future are looking at lower levels of granularity of data, especially about subsets of individual files. Again, we're currently only pointing to metadata.
We don't even point to a dataset. But subsets of quantitative data are increasingly important, especially this figure here is the GDP of Guyana in 1976. That's an identifiable data point held in macro data.
You should be able to cite it. It shouldn't just be 4.6 or whatever the number is. You should be able to cite down at a cell level in a database. We also want to make sure that there are clearer relationships between different types of object. Again, I'm really pleased to see the announcement from Data Side, which includes some of these
issues, movements towards some of these issues, having better relationships between research articles, which are held by publishers or institutional repositories, and research inputs for data. But data is also a research output as well.
And we need to make sure that research outputs, which are data, are related to other research outputs, which are data, especially as we've moved more to a reuse culture where people should be rewarded for reuse rather than just rewarded for creation.
The relationship between two datasets, because they're so easy to manipulate, I can take somebody else's data, I can add another variable to it, I can publish it. Well, I shouldn't get credit for the whole thing. So we need to make sure that that interrelationship between owner and creator and distributor
is a little bit more sorted out. And also for our research councils, we want to try and find a way of making sure that outputs and researchers are better linked together because there are too many researchers on this planet to effectively disambiguate them manually. So one of the challenges for the future,
and this is touching on the idea that Andrew, or the implementation that Andrew was talking about this morning, and we feel that as we're moving in the UK into a culture where austerity is important and the cost of looking after data is increasing because there is
much more human effort involved, we can still look after tons and tons of stuff. It's not the volume in terms of the size of the bits of data, it is the human activity for checking. So we're trying to move towards a model,
and it's a bit contradictory to the one that Efka presented in the previous session. We want to see institutional repositories doing more. We know that they're not very good at it at the moment, but they will improve. They will improve over time. And we think that not only can they look
after the journals, but they can also look after the data. Now, I'm not saying that the institutional repositories should be looking after data for 50 or 100 years, but they should be the ground where they can be placed to be looked after rather than curated, looked after, backed up for 10 or 15 years, so that something
like the data use index we heard about this morning can be applied to that. And we can say, well, these are the top 50, 100,000 data collections which were used in the last decade. Let's move those into the specialist data
archives repository, use the institutional repositories as some form of selection process. But that's for the future. And in the meantime, we still, as a specialist data archive, we still need to be able to find or let our users find more data.
And the solution, again, that Andrew provides of going through data site, and then possibly we could go to the Australian National Library's digital, and possibly, well, I don't like the idea, neither does he, of the 150 things going down the side of the page.
So I do think that using a single API to interact with either a metadata store or with an institutional repository probably isn't going to be effective. We need to try and drag some of the data out of these things, put it centrally, and then attack that data. And that's going to, in real time, that's
going to be really, really difficult. But it's not an impossibility, and it depends how frequently data changes. And I think that users are probably going to be happy with today rather than this second. But we never know.
And I think, sorry, I just think that digital object identifiers can provide some of the glue which holds this as a model together because there's the relationship between research outputs and data being stored in all of the different repositories.
So the point here is that at some stage in the future, we need a way of referring one digital object identifier or one URL to another in a reasonably permanent way. I think the other thing that we might need to think about in the future, and Akim Wakhro
gave a presentation at this meeting last month, he said a program is as likely to follow a URL as a person. And I don't think he's right. I think a program is much more likely to follow a URL than a person. So it posed the question, and he asked it, and we did have a discussion, but we didn't
get much towards an answer. Is there a specific property missing from data site? And if there were one specific property missing at the moment, which we as data archivists I think would like to be able to add, is
a pointer to rich metadata and a description of the rich metadata identifier. So in my case, this would have a URL or in fact even a persistent identifier which would
go to my catalog record or an OAI version of my catalog record, and the rich metadata scheme would say DDI, and the machine would be able to whip through data sites catalog and rather than, as Andrew's solution this morning was to search on title, they'd be able to search on the whole of the metadata records. Now our metadata records are handcrafted, they take days to produce, there are tens
of thousands of words in some of them. It would be nice to search all of those through a single portal. One minute on raising awareness, we've been trying to do a lot of raising awareness within the UK on this, I've put on your tables a brochure which is called Data Citation, what else would it be called?
What you need to know, and it's very short, and this is, well it has sort of pictures, it's not branded by my organisation, it's branded by the Research Council, and they
did a very fine job on reusing some illustrations, because this is the same illustration they use on the front of their big glossy how to do impact journal, brochure, so the SRC have been really good in helping us with this, and data site and the British Library and others have been really good in making sure that they've endorsed it, so hopefully
it is all correct, and it is what you need to know. I just want to finish up just by acknowledging some people who've had input into the presentation, and I also want to point out, because I saw it last week, that the IASSIST, which is the International Association of Social Science something, something, something, something,
have also just published a short and pithy guide to data citation, which is even shorter than this, and tells you almost as much, so I'd recommend that to you as well.
As a useful adjunct to your outreach activities. Thank you.