We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

TA3 - Repositories

00:00

Formal Metadata

Title
TA3 - Repositories
Title of Series
Number of Parts
5
Author
License
CC Attribution - ShareAlike 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Beverage canJohann Sebastian BachBreed standardAreaChemistryStorage tankSchmerzschwelleRiver sourceLocus (genetics)KorngrenzeElastinMissernteHydrophobic effectAreaKorngrenzeHydrophobic effectChemistrySolutionWine tasting descriptorsCooperativityErdrutschHope, ArkansasPhase (waves)PharmacyRiver sourceSet (abstract data type)Connective tissueTopicitySea levelActive siteCommon landDeterrence (legal)InfrastructureComputer animation
KorngrenzeBeverage canHydrophobic effectLocus (genetics)ElastinMissernteNobeliumMoleculeOrcinolLebensmittelkonservierungStorage tankChemotherapyBreed standardSetzen <Verfahrenstechnik>Bulk modulusSpectroscopyNuclear magnetic resonanceCatalytic converterDisabilitySpectrinChemical compoundIsomerChemistryStorage tankKorngrenzeNuclear magnetic resonanceHarvester (forestry)Sample (material)Chemical reactionCooperativityActive siteDeterrence (legal)KhatSeparation processAddition reactionBulk modulusSetzen <Verfahrenstechnik>Functional groupOperonChemical compoundBreed standardHydrophobic effectOcean currentTopicitySolutionISO-Komplex-HeilweiseSet (abstract data type)StratotypProcess (computing)Materials scienceAttachment theoryWine tasting descriptorsComputer animation
BiochemistryKorngrenzePharmaceutical drugOrcinolSupramolecular chemistryChemical compoundEnzymeBreed standardAreaNoma (disease)Setzen <Verfahrenstechnik>Generic drugChemistFord TempoMissernteCancer registryStress corrosion crackingLebensmittelkonservierungStorage tankChemistryHarvester (forestry)EnzymeKorngrenzeCheminformaticsOperonCHARGE syndromeISO-Komplex-HeilweiseProcess (computing)Addition reactionSetzen <Verfahrenstechnik>Functional groupSeparation processCoast ProvinceHydrophobic effectTheoretische ChemieStorage tankCancer registryPharmacyMarch (territory)Tool steelMaterials scienceComputer animation
Cancer registryBreed standardKorngrenzeStress corrosion crackingLebensmittelkonservierungGeneric drugPharmaceutical drugStorage tankColumbia RecordsHydrophobic effectChemistryMoleculeChemical structureSupramolecular chemistryCrystalNuclear magnetic resonanceElastinElektronentransferMarch (territory)Base (chemistry)Beverage canHaber processPhysical chemistryDendritic cellChemical plantGeneric drugFunctional groupHarvester (forestry)AreaChemistryDeterrence (legal)ThermoformingMedical historyMarch (territory)OperonWalkingStiffnessKorngrenzeSetzen <Verfahrenstechnik>Breed standardNuclear magnetic resonanceComputer animation
Computer animation
Transcript: English(auto-generated)
I want to present you a summary of what we did in our task area since our last consortium meeting. And I will also say a few words and show some slides about the broader picture and our vision and objectives and so on.
Because I think that this is also quite important to have these in mind in our current situation. Because we are getting very productive now in a very productive phase of the project and we should recap these.
Here you see the TA3 in the overall picture, you could say, and not the highlights in yellow. In TA3 we are dealing with research data repositories, which involves many aspects of
which the two most prominent ones are data publication and long-term archiving of data. And I'm leading TA3 together with Matthias, Matthias Ratzum, who can't be here today. And as you see on this image, like Christoph already mentioned, I was working at KAT
before and then I switched to FITS and I'm heading the research data department there now. And here you just see another representation of the same content, which you will hopefully then recognize if you go to the poster session.
Because we suggested to add this image to all of the posters, just to be able to easily identify the corner of NFTI for CAM in which the poster is situated.
Our vision was and still is to make research data fair and wherever possible we want to make it open as well. And to enable this we need to keep the data in a safe place, right?
So we need to offer storage, we need to offer long-term archival, and then we need easy possibilities for the chemists to publish their data together with contextual metadata. And to get a DOI to make references possible like citations and so on.
And other chemists or scientists from other fields as well, they need to be able to discover these data sets then and to get access to them, to reuse them or to reproduce research outcomes.
And we want to make sure that the data and also the services are interoperable. And of course, these services need to be reliable and sustainable. And the repositories should also be connected.
What does that mean? There should be connections between the individual repositories and also our smart lab environments should be able to transfer data directly to the repositories to ingest data. And at the beginning of NFDI for chem, we thought that these are the,
if we could achieve all these things, we could realize fair chemistry research data. And I think these are still the main things we should focus on. And that fits also to our objectives.
Data publishing and archiving in trusted repositories must be easy for chemists and long-term data access must be ensured. Researchers should be able to ingest, to search and to annotate and also to exchange research data or metadata across distributed data sources by means of a virtual environment of federated services.
And in this case, federated repositories. And we will go into the details of this federation in a dedicated session tomorrow.
But how are we getting fair? What are the concrete measures and what is their status? Let's have a look at this. So what we planned was always to adapt metadata schemas. Oh, sorry, I'm on the wrong slide.
I'm sorry. To adapt metadata schemas of each selected chemistry repository. In the next talk about TA4 standards, Stefan will talk about that, I guess. And what is already decided is that you want to use.
Sorry, I'm totally on the wrong slide. I don't know why. I'm really sorry. Is that a good candidate for general metadata will be the data site format.
And we have ongoing discussions and work on the chemistry specific metadata and how they both fit together. And as a good candidate, we identified bioschemals.org. And we are working on technical solutions now in order in close cooperation with the other task areas in this task area two, four and six.
And we wanted to design and implement application programming interfaces, short API. And the concrete work on that is currently a bit limited to some of the core repositories.
And we plan a workshop on interfaces with all the interested repositories and then together with all other involved task areas. This will be next year, hopefully. And we want to integrate authentication and authorization infrastructure, short AI.
And here we are working together with the IAM working group, which stands for Identity Access Management. That is dealing with this topic on an NFDI-wide level in the NFDI section, Common Infrastructure.
But since this will take much too long to agree upon a fit for all solution and to realize that solution as a service, we are connecting the services to the DFN-AAI for now.
And just in the meantime, and we hope that the DFN will then later implement and operate an NFDI-wide solution so we can connect to it easily. We also have some chemistry specific ideas of what we need here.
And Michael Glick will talk about that a bit in tomorrow's Federation of Services session. And we have already started with setting up this Federation of Interoperable Services.
And we will present more on that. I will present an overview of that tomorrow. And as a final measure, we planned to migrate all data with respect to the new metadata schema. This is something that we'll have to wait until we have finalized MIHI. This stands for Minimum Information for Chemical Investigation.
Stefan will say a few more words about that. And these MIHI and also metadata in general is one of our current topics of focus, as well as the APIs and the interfaces between ELN and repositories.
Another one is the publication of distributed data sets and their linkage to publications and to each other using DOIs and other persistent identifiers.
I already mentioned the AAI and we are working on technical solutions to make the repositories metadata harvestable using the OAI PMH protocol. And these are the core repositories in NFDI for CHEM.
We have CHEMotion, NMArchive, radar for CHEM, Marsbank, Suprabank and StrenderDB, which cover the main sub-disciplines of chemistry and also the data formats they deal with. And we have Nomad that provides sort of a bridge to the world of material science.
Let's go through these repositories. In short, one of our flagships is of course the CHEMotion repository that was created as early as 2014 at KAT. And that is closely connected to the CHEMotion ELN and uses the same data structures, which makes it really easy for ELN users to publish their work.
And it concentrates on processes and analytics, molecules, reactions and research data in general, additional research data.
And since last year, we see that the publications in CHEMotion are increasing. We have 536 more samples now and we have 475 more reactions and 2871 more analyses in it.
And the development is done in close cooperation with the ELN development team. The core developers are Peggy Huang and Claire Lin. CHEMotion allows a single sign-on, as well as the ELN, as Nicole mentioned, with ORCID and GitHub accounts, as well as with KAT accounts for now.
But also a Chiblet integration is in work. It will be integrated soon to allow federated logins then.
CHEMotion also provides metadata harvesting over OAI PMH already and it publishes its records in the Dublin Core format and also supports the OAI data site prefix now.
And published datasets of course get a DOI and data site metadata. It is also connected to storage of KAT's data center, so the data is safe. And it provides several interfaces, of course a web-based user interface, which is open to
the public for searching and registered users may also publish or attach characterization data in it. It also has a REST API and also a bulk download functionality is planned over FTP.
Then we have the rather fresh NMR archive and NMR spectroscopy data repository and also an analysis platform, we could call it. Its three developers are Chandu Nainala, Nisha Sharma and Noura Raya.
And regarding AAI, NMR archive allows single sign-on with Twitter and GitHub and in theory also with ORCID, which is currently disabled because ORCID does not provide the email address of the user by default, which is a bit complicated then for the users.
And also other identity providers can easily be added, that's not a big thing. And metadata harvesting over OAI PMH is currently not available here but planned.
Also current work includes bioschemas and data sites mapping, you could say, and hopefully there will soon be converters available between the two. And the OIs for data sets that are published in NMR archive will soon be available.
And archiving is also planned via the Google archive storage and additionally in a local storage at Jena. And NMR archive can be used via a browser interface.
You can find it under NMRarchive.org and it also has a REST API. And it supports, of course, several standards in the field of NMR data. In Marsbank EU, which started in 2006, users can find mass spectra of known, unknown and provisionally identified substances.
And we see some increased numbers here since 2021. We have 287 more compounds in here and we have 3,617 more spectra, which is really an impressive number.
All development and also the operation is done by the two main developers, René Meyer and Tobias Schulze. Of course, always with support by the community.
And metadata harvesting via OAI PMH is currently not implemented, but individual records are described in JSON-LD in the bioschemas.org format. And Marsbank does not use DOIs for now.
I think we could make that happen quite soon, but let's see. Regarding data archiving, it relies on Zenodo and GitHub, where it is also hosted, you could say, which is not so bad.
And users can access it using a web-based user interface and can also interact with it using the Git protocol because it's hosted on GitHub. And soon a REST API will be realized. Then we have SupraBank, which started in 2019.
It's meant for supramolecular interaction data, mostly from the literature. The development is currently rather slow since only one developer is working on it, as I heard, and in part-time.
And the hiring process is still ongoing. So still it has nearly 400 datasets in it and nearly 40,000 interactions and 1,700 compounds. There's no single sign-on method currently provided, and no metadata harvesting is implemented here.
Maybe we can help with that soon. But data publication is realized with dataset DOIs, so we could harvest the metadata by our dataset.
There's no archiving implemented yet. Maybe we can help with that as well. And the interfaces it provides are a web-based user interface for interest and metadata input, and the data can be exported also in several formats.
Then there is TrendRDB. It was and is still developed for functional enzymology data by currently one developer. And it's operated at the Baishan Institute. It has no single sign-on AAI yet.
Let's get together and make this happen, if you wish to have that. That could help a lot in integrating it in the federation. We also have metadata here at DataSite because their DOIs are used for publishing between 20 and 30 datasets per year that are then curated.
And it also has a browser-based user interface and an API is also planned here. And besides the DataSite metadata and Inchi, it will hopefully soon use the enzyme ML, which is currently under development.
Then we have Nomad. The Nomad repository was created in 2006 and concentrates on raw data from materials science that is then automatically processed after uploading.
And then several analysis tools can be used, which could be useful for some chemists, maybe in theoretical chemistry or computational chemistry. And it contains over a thousand datasets now and more than 12,000 entries.
And it's developed in the Fermat Consortium, so it has some funding there. It uses the Keycloak software to realize an AAI, which is quite promising. So if you're looking for a solution, maybe you should also look at Keycloak.
And metadata harvesting is also planned here. And selected datasets can be published and get then a DataSite DOI. And the data is archived on tape in the long term. And users can, of course, use a graphical user interface and also several APIs, which are based on BCAT.
Then we have radar4chem. radar4chem is rather new. We started it in March this year. And since it is based on the generic radar repository, it has no limitations regarding its content.
So you can just place any file there. We offer a data publication for chemists free of charge. And as radar, radar4chem is also developed in my team at Fritz Karlsruhe by seven developers and one product manager.
And it's also operated by our data center and also partially at KIT. Shibboleth is used to offer the federated single sign-on logins via the DFN AAI.
But additionally, users without such an account can always just register with the email address. And we have a fine-grained rights and roles concept that is quite useful. And we have special processes for reviewing data before their publication.
And metadata harvesting is offered as well using our own custom and very scalable OAI PMH provider that other repositories could, of course, also use. So if you want to, if you are interested to use it, just get in touch with us and we can provide the software.
We use several persistent identifiers in it. Of course, the data side DOIs for published data sets, but also handles orchids for persons.
The Crossref funder registry is included and the ROR is used for institutions. And data is archived on tape with two copies at KIT, at the SEC, at two different locations. And we have one additional copy in Dresden.
Both institutions take over the coasts for their storage offers as an FDI for CAM core applicants. And the same applies, of course, for the whole operation that we do at FITS. And data is kept here for a minimum of 25 years, which is quite some time.
And radar for CAM has, like all the other repositories, a web-based user interface for data interest and annotation with metadata. It has a REST API and also, like I said, an OAI PMH interface for harvesting of metadata. So we are able to make the rather generic radar metadata schema harvestable.
That is based on data side, Dublin Core and schema.org. And at the moment, we are also starting the development of a more discipline specific metadata schema editor.
So you can create your own schema then with it. And not only the schema as an XML file or so, but also the input form in one step. So you can create your own form for the input of metadata.
And we will announce it when it is available, of course. And I hope that many users will then test it. Radar for CAM has many additional features that I won't talk about in detail now.
But here you see a screenshot of the metadata input form that it has right now. So you see it's rather generic. If you want to try it, you can just fill the form you see here that you will also find on our NFDI for CAM website.
And of course, you will get support from us if you need it. If you are now a bit overwhelmed by all these repositories, and you ask yourself which is the right one for you,
you can find a lot of information about the accepted data formats and the use standards. And also about the sub-disciplines of chemistry in our knowledge base. And we created this graphic you see here that will help you without much reading if you want to be quick in deciding.
And it is also on our repositories poster, so check it out there and we can discuss about the repositories. There is one very important thing that I already mentioned, and I want to show an example here,
which is the distribution of datasets across different repositories which we are currently working on. And here you see an example for an NMR dataset that is created in the lab. So you have the data in the Chemotion ELN in the first place, and then you want to publish it in the Chemotion repository.
And now the NMR part of the data could be used in NMR archive, which may provide special analysis functionality and so on that you want to use.
And the rest of the data could, for example, go to radar4chem or somewhere else. And now we have a distributed dataset and there should be links between all of its parts to keep its creation history and to be able to find the other parts.
Also, for example, from NMR archive for this example. And we are currently developing a concept for that and I and also Nicole will describe it a bit in tomorrow's Federation session. And what are the big challenges we have?
So there's, of course, staff recruitment that took really, really long in the beginning of the project. And it is really hard to find good personnel that have experience in software development, in research data management and ideally also a background in chemistry.
So this is nearly impossible. So the staff recruitment is still ongoing, for example, at KIT SCC and KIT INT. And also there's a lot of diversity of programming languages in the repositories. Different frameworks are used and different operation environments are involved.
And another challenge is that some repositories just don't have the flexibility for fast developments that the others have. So the question for us is how can we support the development from TA3?
As I said, we can, for example, offer the OAI PMH software and help the repositories to use it to enable harvesting, just as an example. And if you need any other help, be it expertise or programming or anything else, please just contact us.
We'll try to help you to join the Federation of Services and provide the needed interfaces for it. And here you see a list of things that TA3 has achieved recently.
We have the new radar for CAM up and running, as I said. There was a press release about this in March. We interviewed all the repository operators and we created profiles of each repository. And you will find these in our NFDI for CAM knowledge base, as well as our article,
how to choose the right repository, together with the decision tree I showed. And we identified the interfaces that are needed to set up this Federation of Services that we are currently building.
And we are offering our FITS OAI PMH provider software and, of course, the documentation to it. And we can offer also a workshop, if you're interested, in its setup for interested parties.
There are several working groups and ongoing interaction, of course, with all the other task areas. And we are also involved in editors for CAM, for example, where we talk with the journals. And, yes, this is just a list of the participants in TA3.
However, as you can guess, many more people are working in TA3. All the developers and the data stewards and so on. And the task area of borders are also beginning to dissolve more and more now, which is a really good thing, I think.
So much about TA3. And if you're interested in the repositories and our work, visit our posters and discuss with us. And please join our rocket chat if you haven't done so far. Because there we are always available for questions and support.
Thanks a lot.