We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Assessing the FAIRness of Data Sets in Trustworthy Data Repositories

00:00

Formale Metadaten

Titel
Assessing the FAIRness of Data Sets in Trustworthy Data Repositories
Serientitel
Anzahl der Teile
14
Autor
Lizenz
CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Research funding in recent years often comes with the condition to make the resulting data openly available. Just opening up research data is not enough: the data should also be of sufficient quality. In this presentation, I will propose criteria and methods to assess the quality of data sets. Part of the quality assurance can be guaranteed by digital repositories that archive and provide access to data. I will argue that the certification criteria of digital archives and the FAIR data principles for data sets provide a good basis for guaranteeing the quality or “fitness for use” of research data sets. The core certification offered by the Data Seal of Approval (DSA) and World Data System (WDS) for data repositories, in combination with the FAIR data principles get as close as possible to giving quality criteria for research data. They do not do this by trying to make value judgements about the content of datasets, but rather by qualifying the fitness for data reuse in an impartial and measurable way. By bringing the ideas of the DSA/WDS and FAIR together, we will be able to offer an operationalization that can be implemented in any certified Trustworthy Digital Repository. In 2014 the FAIR Guiding Principles (Findable, Accessible, Interoperable and Reusable) were formulated. The well-chosen FAIR acronym is highly attractive: it is one of these ideas that almost automatically get stuck in your mind once you have heard it. In a relatively short term, the FAIR data principles have been adopted by many stakeholder groups, including research funders. The FAIR principles are remarkably similar to the underlying principles of the DSA, which date back to 2005: these specify that the data can be found on the Internet, are accessible (having clear rights and licenses), are in a usable format, are reliable, and are identified in a unique and persistent way so that they can be referred to. Essentially, the DSA presents quality criteria for digital repositories, whereas the FAIR principles target individual datasets.
Vorlesung/Konferenz
Vorlesung/Konferenz
ComputeranimationVorlesung/Konferenz
Vorlesung/Konferenz
Vorlesung/Konferenz
Vorlesung/Konferenz
Vorlesung/Konferenz
Vorlesung/Konferenz
ComputeranimationVorlesung/Konferenz
ComputeranimationVorlesung/Konferenz
Transkript: Englisch(automatisch erzeugt)
Thank you. My name is Peter Doolen. I'm Director at DANS. Yesterday I already briefly introduced my poster and I was just a while ago talking about my poster to a couple of people. And actually my presentation and my poster are very closely connected.
So if you didn't have the chance to hear about the poster, then you still cannot escape to hear something about fair data. So first of all, what is fair data and why is it important? Well, probably many of you are aware of the fact that open science has become a priority
and that not only publications need to be openly accessible, but also other research outputs, especially research data. We at DANS have the principle that data should be as open as possible, but there can also be good reasons not to make data complete.
And perhaps this is part of the reason why fair became so popular in a brief time. Because many researchers also have some reservation about the openness of the research data that they worked hard for when they collected it
and that they want to publish on. Data has sometimes been called gold or the oil of science, but really for many researchers it is like their gold. That is the stuff that they work on, that they use to extract knowledge from and that they want to publish about.
So openness itself is not enough. Data must also be fair, findable, accessible, interoperable and reusable. And as part of the question that I tried to answer is, do the fair principles that apply to research data,
well, great literature seems to incorporate data as well, even video, as also one of the other posters was claiming. Great literature seems to embody almost the whole sadistic universe of the internet.
We will see whether the fair principles can also apply to great literature. What happened after the very well chosen acronym was selected and the fair principles were defined about two, two and a half years ago, that everybody started to love fair, actually much more than open.
Many funding organizations started to use that and to require of researchers that they make their data not only open but also fair. But although the fair principles were defined quite well, what was not done is how could we measure fair.
And actually there is a couple of initiatives to try and operationalize those fair criteria. That sounds so well, but what do we actually mean and how can we measure them? So one of the groups that I'm also personally involved in is the fair metric group, the go fair metrics group.
And they try to develop a broadly usable framework and tool for the assessment of fair data. So how can we measure what data are fair and what not? There is no real time to go into the details here, but this is ongoing work.
At dance we started working on thinking how can we use those fair criteria for all the about 40,000 datasets that we have in our archive. Would it be possible to use those fair criteria as a kind of a proxy for data quality?
Well, quality is a very difficult subject, whether we are talking about data but also when we are talking about coffee or cookies or whatever. Ultimately it is about, if we talk about cookies, it is about the taste of the cookie and how fresh it is probably, that kind of thing.
But when you try to measure the quality of the cookie, you can very difficult measure the taste. So instead we measure how the production process took place. And something similar is going on with the fair principles. The fair principles do not measure in themselves data quality, but they say something about how the data was produced, kept and made available.
And now I must say that about 10 years before the fair principles, dance was the cradle of the data field of approval. When we were set up in 2005, we were required to give criteria for the quality of a data repository.
Also very hard to do. But we specified a couple of characteristics or principles that it should be possible that you should be able to find the data, they should be accessible, they should be in a usable format, they should be reliable,
and it should be possible to refer to them in a stable way. Markable similarity to those principles of the data field of approval to the fair principles. The DSA principles, short for data field of approval, apply primarily to data repositories. And the fair principles were set up to apply to data sets, data objects.
Not perfect. I will also not go into that. But they are fairly similar, nevertheless. Note, by the way, that the citeability is actually not in fair. Well, it's not a separate letter.
But I suspect the inventors of fair, if you would have the C as well, you would never have such a nice acronym. So what they did is put the C of citeable under the F of findable. Well, to a certain degree, all data sets that are in a trusted digital repository with the data field of approval
are already basically fair because the repository takes care of the fact that there is metadata, that there is an identifier, that there is a license, and a couple of other things more.
But still, there is differences of the way in which individual data sets are described within a repository. The one has more documentation than the other. The other has a more open license than the other. And so on. So there is also a difference in rating of fairness.
Well, Dominique asked me to say also something about the core trust seal, which is on the right-hand side. But as I'm having now the short version of my slides, I'm afraid there is not a separate slide on the core trust seal. But I will simply say, for the sake of Dominique at least, that the data seal of approval
and the accreditation system of the world data system, which was another already existing kind of system for a club of repositories, maybe in earth and climate science and climatology and sea science and so on, they decided to work together.
And the job requirements of the data seal of approval and of the world data system are the core trust seal. So that is in a way the successor of both. Together they are much stronger than individually and they cover many more repositories.
There is now a couple of hundred repositories around the world that is certain either to DSA or to world data system. And in the future they will all get the core trust seal. Now back to the FAIR scheme. So the idea was that you should be able to rate data sets.
And we would like to see the FAIR criteria as a proxy for the quality. But in order to rate them according to the F, the A, the I and the R, they should be independent. Well, there were a couple of difficulties as we saw in the original definition of the FAIR principles.
They were not entirely independent. And we also had some difficulty with, especially with the R of the reuse, because reuse is very hard to define in any objective way.
The reuse ultimately depends on what you want to do with the data. Whether the data set is useful for me depends what I want to do. If you want to do something else, the same data set may not be useful to you at all. So ultimately we didn't do the R and we shifted some of the characteristics from the R
to the other letters so that we only had to rate the F, the A and the I. And the number of stars indicates the level of the compliance. No time to go into the details of the criteria that we defined, but the slides are available in the brochure so you can read them for yourself.
And I must say they are not final yet. They are of a prototype and we are still working on refining the exact specifications as I'm talking. But in the meantime what is available is the prototype of a FAIR data assessment tool that anyone can try out.
The link is over there. It works like an online questionnaire. You get a couple of questions concerning the findability, the accessibility and the interoperability about the data. And for every little set of questions you get a score.
So for instance four stars for findable here. And it will ultimately then result in this badge and the badge, it will be an independent or the FAIR data assessment tool is in an independent website. And repositories will be able, this is just mock-ups, this hasn't been realized yet, will be
able to extract the badge and display it next to their list of holdings that they have so that any user can immediately say, see, oh, this dataset is probably better documented than that one because the F score is higher or here are limitations to the accessibility because the A score is lower and so on.
So we're currently in the process of testing the prototype and refining the system. And then now one of the last slides, I think I have two more about, and this question has come up earlier.
Yeah, what is actually data and also what is grey literature? And for this I took a very pragmatic approach and just to see what is actually the characteristics of the about four million data files that we have in our repository. And you see here a distribution of the top 20 formats, file formats, and it actually is quite striking that the 0.6 million images,
many of those are from archaeological datasets that have images of the finds, photos but also drawings and so on. The second one is PDF files, many of those are reports and the like.
The third one is TIFF files, another image. And only the fourth is CSV, comma separated values, a very common, let us say software independent format for tabular data.
So that is perhaps the theory that you would think is real data. But I must say for an archaeologist, if he doesn't have his images or if he doesn't have his GIS file with the plan of the site that he has been excavating,
he would be nowhere. That is the central data for him. But then the question arises, of course, what is the interoperability of a PDF file? Or how do we measure the interoperability of an image? Is that even possible? And wouldn't that be completely different from the interoperability of a structured data file?
I have not yet an answer to this question, but perhaps you can help me. And so, as for grey literature, I think a lot of the material will be also in the non-data category,
especially interoperability will be quite a difficult thing to measure. Now, as I said, this is now I'm really at the end of my talk, except for the thank you. There is a couple of different approaches around the globe in different communities going on at the moment about, because the FAIR concept has become so popular in the data world, how to measure it.
And we don't all agree. The dance approach is very pragmatic, simple, straightforward, I would say. But the original inventors of FAIR, now united in the GoFAIR metrics group, they have a much higher ambition.
So what I would like to propose is that we are starting to fight in solar and disagreeing about the nitty-gritty of the details, that we allow for some kind of a FAIR framework, in which we have perhaps the most, at the highest level, the GoFAIR metrics, who really try to have a full automatic machine approach to measuring fairness,
and a rather straightforward, simple dance approach to have a questionnaire asking users, but also data specialists working in repositories, to give their scores.
And that was my talk. Thank you very much.