We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Data Quality Assurance at ARCHE

00:00

Formal Metadata

Title
Data Quality Assurance at ARCHE
Title of Series
Number of Parts
4
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The talk was given at the re3data COREF / CoreTrustSeal Workshop on Data Quality Management at Repositories that took place on October 5, 2022, 12:00-15:30 UTC.
Keywords
Computer animation
Computer animation
Transcript: German(auto-generated)
So, I'm here to talk about ARCHE, how we pronounce it, but as you please, just to say a few words about it. It's a repository, obviously, and the main purpose of it is long-term archiving and dissemination.
And our scope is basically digital humanities research data that is somehow related to Austria, but maybe geographically or historically, or it's done by researchers working in an Austrian Institute, but we don't categorically decline collections that would not be directly linked to Austria.
ARCHE is a rather new repository. It's been here since 2017 and already mentioned it's hosted by the Austrian Academy of Sciences. At this moment, we have 32 different collections going from archaeological collections to linguistic, historic.
Yes, that's mostly what it is. And it is a very highly curated repository. There's also quite a lot of manual work, and we always work very closely together with the depositors.
So, there's always a personal contact and the data that gets in ARCHE, it's always ingested only by us, the data curators. So, the users cannot change anything without us doing it for them.
So, I tried to follow your categories for the quality monitoring. So, but this was before I heard all of your wonderful presentations, so bear with me if I didn't categorize it exactly correct. But yes, about the definition, we have, as mentioned, our collection policy, where it is specified which collections we accept.
And yes, mostly it is about the scope. It is also about how endangered the data is. And it's a big topic right now, it's also copyright, because in order to deposit things in
ARCHE, the depositor must sign a deposition agreement where it states that the copyright has been cleared. And this is an issue, I mean, it is a topic for us, because we often archive projects that were digitizing projects.
So, they were creating digital, retro copies of some older material, where the authors are not those that were actually also involved in the project. So, the authors of the originals are not the ones who were doing the copies.
So, it is always a question if the copyrights were cleared or not, when we are accepting the data or not. So, and then we have a long list of the deposition process also published, and which kind of data formats we are accepting, and which ones we prefer,
and which ones we accept, which means accepted are those that we are okay with, but we then still have to convert them to something we prefer for archiving. Then we also have a minimum set of metadata that we require for each resource, basically, as well as for the whole collection.
Now for how we make sure that our data that comes in is, yeah, how we maintain the quality in our group, for instance, is that we track our tasks.
We have a Redmine system, which is a project management system where you can create tasks and subtasks. So, we have a specific curation template, so that we make sure we don't skip a step when we are curating a collection.
We also have regular team meetings, and since May we are sitting in new offices, which is very nice, because all the curators, developers are just across in another room, and it's very nice to them to talk how we can solve certain issues.
Then we are also participating in seminars and here at our Institute, because very often the collections we get are from the Academy of Science itself and different Institutes here. We have research lunch, where we present Arche, so that the researchers at the Institute get familiar in
what kind of form they have to deliver us the data, so that we can then archive it. How do we do quality control? We use, for the files themselves, we use dedicated software tools. We have our own script.
We call it Repro File Checker, that checks for filenames, if they are okay, checks for file formats. It also searches for duplicates and things like this. Then on the ingestion level, when we have the metadata as well, we have the so-called doorkeeper, which is a system that checks your ingestion files.
And it became very strict, so getting it through the first time, it's called for celebration.
It checks the metadata, because what we have is an Arche Schema, that metadata has to follow, and then it checks if all the mandatory fields are filled out, if the values are correct, in the sense like, was it a link that was used or was it free text?
So the data type, it checks even if the language tags were applied where it was necessary, so as I said, it's very strict. And what we also do is, we always let the second curator check our work, because when you
are dealing with one collection for a long time, there are things that you oversee and you don't. So it's always nice if someone else goes through to check if the structure makes sense, if there is enough information so that the data can be reused, if all the legal things are okay, and this is done.
And also manually, we create it now like a checklist that the second curator has to go through to see if everything is as it should be. And we normally work very closely together with the depositors, which means that we first prepare the collection.
We put it on our curation instance, so the depositor can have a look, and once both sides are okay with it, then we move everything to the production instance and publish it finally.
So yes, as said, we work closely together with depositors. We always inform them if we have to make some changes or if there is a lack of information somewhere. We also constantly develop the RK metadata schema, which then results in the fact that we do twice a
year clean-up weeks, where we have a look at all the collections and try to detect some anomalies or if something was wrong, something, and then we try to correct it during these metadata clean-up weeks.
Now for the evaluation, we don't have specific surveys yet. We do get user feedback. It often is from the depositors that we also get some feedback while they are checking their own collections being published. We also have some analytics with Matomo, but we are also going to implement open-air tracking, because Archa is being harvested by
open-air and also some other portals like Europeana, and now we are in process also for Ariadne portal, which is an archaeological portal.
Yes, and about the documentation, all these reports that are generated by the software, we save in a dedicated place and then there are also, we keep like a curation log in our Redmine system, where
we write down manually all the important information and decisions that we're taking or changes that we made. And then we keep also the ingestion logs, because the doorkeeper provides us
with the list of everything that was wrong or was right with the collection. So yes, I don't know if I managed the time, but it's very briefly, very quickly.