Assessing the FAIRness of Data Sets in Trustworthy Data Repositories
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 14 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/37255 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
4
00:00
Vorlesung/Konferenz
00:33
Vorlesung/Konferenz
02:25
ComputeranimationVorlesung/Konferenz
02:54
Vorlesung/Konferenz
03:42
Vorlesung/Konferenz
06:15
Vorlesung/Konferenz
08:13
Vorlesung/Konferenz
10:24
Vorlesung/Konferenz
13:25
ComputeranimationVorlesung/Konferenz
14:41
ComputeranimationVorlesung/Konferenz
Transkript: Englisch(automatisch erzeugt)
00:03
Thank you. My name is Peter Doolen. I'm Director at DANS. Yesterday I already briefly introduced my poster and I was just a while ago talking about my poster to a couple of people. And actually my presentation and my poster are very closely connected.
00:23
So if you didn't have the chance to hear about the poster, then you still cannot escape to hear something about fair data. So first of all, what is fair data and why is it important? Well, probably many of you are aware of the fact that open science has become a priority
00:46
and that not only publications need to be openly accessible, but also other research outputs, especially research data. We at DANS have the principle that data should be as open as possible, but there can also be good reasons not to make data complete.
01:06
And perhaps this is part of the reason why fair became so popular in a brief time. Because many researchers also have some reservation about the openness of the research data that they worked hard for when they collected it
01:21
and that they want to publish on. Data has sometimes been called gold or the oil of science, but really for many researchers it is like their gold. That is the stuff that they work on, that they use to extract knowledge from and that they want to publish about.
01:42
So openness itself is not enough. Data must also be fair, findable, accessible, interoperable and reusable. And as part of the question that I tried to answer is, do the fair principles that apply to research data,
02:01
well, great literature seems to incorporate data as well, even video, as also one of the other posters was claiming. Great literature seems to embody almost the whole sadistic universe of the internet.
02:20
We will see whether the fair principles can also apply to great literature. What happened after the very well chosen acronym was selected and the fair principles were defined about two, two and a half years ago, that everybody started to love fair, actually much more than open.
02:43
Many funding organizations started to use that and to require of researchers that they make their data not only open but also fair. But although the fair principles were defined quite well, what was not done is how could we measure fair.
03:08
And actually there is a couple of initiatives to try and operationalize those fair criteria. That sounds so well, but what do we actually mean and how can we measure them? So one of the groups that I'm also personally involved in is the fair metric group, the go fair metrics group.
03:25
And they try to develop a broadly usable framework and tool for the assessment of fair data. So how can we measure what data are fair and what not? There is no real time to go into the details here, but this is ongoing work.
03:44
At dance we started working on thinking how can we use those fair criteria for all the about 40,000 datasets that we have in our archive. Would it be possible to use those fair criteria as a kind of a proxy for data quality?
04:03
Well, quality is a very difficult subject, whether we are talking about data but also when we are talking about coffee or cookies or whatever. Ultimately it is about, if we talk about cookies, it is about the taste of the cookie and how fresh it is probably, that kind of thing.
04:21
But when you try to measure the quality of the cookie, you can very difficult measure the taste. So instead we measure how the production process took place. And something similar is going on with the fair principles. The fair principles do not measure in themselves data quality, but they say something about how the data was produced, kept and made available.
04:43
And now I must say that about 10 years before the fair principles, dance was the cradle of the data field of approval. When we were set up in 2005, we were required to give criteria for the quality of a data repository.
05:03
Also very hard to do. But we specified a couple of characteristics or principles that it should be possible that you should be able to find the data, they should be accessible, they should be in a usable format, they should be reliable,
05:20
and it should be possible to refer to them in a stable way. Markable similarity to those principles of the data field of approval to the fair principles. The DSA principles, short for data field of approval, apply primarily to data repositories. And the fair principles were set up to apply to data sets, data objects.
05:45
Not perfect. I will also not go into that. But they are fairly similar, nevertheless. Note, by the way, that the citeability is actually not in fair. Well, it's not a separate letter.
06:00
But I suspect the inventors of fair, if you would have the C as well, you would never have such a nice acronym. So what they did is put the C of citeable under the F of findable. Well, to a certain degree, all data sets that are in a trusted digital repository with the data field of approval
06:27
are already basically fair because the repository takes care of the fact that there is metadata, that there is an identifier, that there is a license, and a couple of other things more.
06:41
But still, there is differences of the way in which individual data sets are described within a repository. The one has more documentation than the other. The other has a more open license than the other. And so on. So there is also a difference in rating of fairness.
07:03
Well, Dominique asked me to say also something about the core trust seal, which is on the right-hand side. But as I'm having now the short version of my slides, I'm afraid there is not a separate slide on the core trust seal. But I will simply say, for the sake of Dominique at least, that the data seal of approval
07:24
and the accreditation system of the world data system, which was another already existing kind of system for a club of repositories, maybe in earth and climate science and climatology and sea science and so on, they decided to work together.
07:44
And the job requirements of the data seal of approval and of the world data system are the core trust seal. So that is in a way the successor of both. Together they are much stronger than individually and they cover many more repositories.
08:01
There is now a couple of hundred repositories around the world that is certain either to DSA or to world data system. And in the future they will all get the core trust seal. Now back to the FAIR scheme. So the idea was that you should be able to rate data sets.
08:25
And we would like to see the FAIR criteria as a proxy for the quality. But in order to rate them according to the F, the A, the I and the R, they should be independent. Well, there were a couple of difficulties as we saw in the original definition of the FAIR principles.
08:46
They were not entirely independent. And we also had some difficulty with, especially with the R of the reuse, because reuse is very hard to define in any objective way.
09:02
The reuse ultimately depends on what you want to do with the data. Whether the data set is useful for me depends what I want to do. If you want to do something else, the same data set may not be useful to you at all. So ultimately we didn't do the R and we shifted some of the characteristics from the R
09:21
to the other letters so that we only had to rate the F, the A and the I. And the number of stars indicates the level of the compliance. No time to go into the details of the criteria that we defined, but the slides are available in the brochure so you can read them for yourself.
09:44
And I must say they are not final yet. They are of a prototype and we are still working on refining the exact specifications as I'm talking. But in the meantime what is available is the prototype of a FAIR data assessment tool that anyone can try out.
10:05
The link is over there. It works like an online questionnaire. You get a couple of questions concerning the findability, the accessibility and the interoperability about the data. And for every little set of questions you get a score.
10:21
So for instance four stars for findable here. And it will ultimately then result in this badge and the badge, it will be an independent or the FAIR data assessment tool is in an independent website. And repositories will be able, this is just mock-ups, this hasn't been realized yet, will be
10:41
able to extract the badge and display it next to their list of holdings that they have so that any user can immediately say, see, oh, this dataset is probably better documented than that one because the F score is higher or here are limitations to the accessibility because the A score is lower and so on.
11:06
So we're currently in the process of testing the prototype and refining the system. And then now one of the last slides, I think I have two more about, and this question has come up earlier.
11:22
Yeah, what is actually data and also what is grey literature? And for this I took a very pragmatic approach and just to see what is actually the characteristics of the about four million data files that we have in our repository. And you see here a distribution of the top 20 formats, file formats, and it actually is quite striking that the 0.6 million images,
11:49
many of those are from archaeological datasets that have images of the finds, photos but also drawings and so on. The second one is PDF files, many of those are reports and the like.
12:03
The third one is TIFF files, another image. And only the fourth is CSV, comma separated values, a very common, let us say software independent format for tabular data.
12:24
So that is perhaps the theory that you would think is real data. But I must say for an archaeologist, if he doesn't have his images or if he doesn't have his GIS file with the plan of the site that he has been excavating,
12:42
he would be nowhere. That is the central data for him. But then the question arises, of course, what is the interoperability of a PDF file? Or how do we measure the interoperability of an image? Is that even possible? And wouldn't that be completely different from the interoperability of a structured data file?
13:03
I have not yet an answer to this question, but perhaps you can help me. And so, as for grey literature, I think a lot of the material will be also in the non-data category,
13:21
especially interoperability will be quite a difficult thing to measure. Now, as I said, this is now I'm really at the end of my talk, except for the thank you. There is a couple of different approaches around the globe in different communities going on at the moment about, because the FAIR concept has become so popular in the data world, how to measure it.
13:46
And we don't all agree. The dance approach is very pragmatic, simple, straightforward, I would say. But the original inventors of FAIR, now united in the GoFAIR metrics group, they have a much higher ambition.
14:03
So what I would like to propose is that we are starting to fight in solar and disagreeing about the nitty-gritty of the details, that we allow for some kind of a FAIR framework, in which we have perhaps the most, at the highest level, the GoFAIR metrics, who really try to have a full automatic machine approach to measuring fairness,
14:30
and a rather straightforward, simple dance approach to have a questionnaire asking users, but also data specialists working in repositories, to give their scores.
14:43
And that was my talk. Thank you very much.