We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

ZENODO & OpenAIRE

00:00

Formale Metadaten

Titel
ZENODO & OpenAIRE
Serientitel
Anzahl der Teile
24
Autor
Lizenz
CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
Produktionsjahr2014
ProduktionsortNancy, France

Inhaltliche Metadaten

Fachgebiet
Genre
NP-hartes ProblemInhalt <Mathematik>Elektronische PublikationMessage-PassingElektronische PublikationWeb SiteRechter WinkelPaarvergleichCoxeter-GruppeBitElektronische BibliothekMinkowski-MetrikMultiplikationsoperatorMomentenproblemProfil <Aerodynamik>FlächeninhaltVorzeichen <Mathematik>CodeFormation <Mathematik>Bildgebendes VerfahrenBasis <Mathematik>DigitaltechnikLastOrdnung <Mathematik>SchnittmengeNotebook-ComputerBeobachtungsstudieMenütechnikSymboltabelleEinfügungsdämpfungSpieltheorieNeuroinformatikRepository <Informatik>HypermediaGebäude <Mathematik>Mailing-ListeStellenringAutorisierungEinsProzess <Informatik>InformationsspeicherungPackprogrammFunktion <Mathematik>Offene MengeGanze FunktionInhalt <Mathematik>Abgeschlossene MengeInverser LimesAttributierte GrammatikGemeinsamer SpeicherDokumentenverwaltungssystemComputeranimation
Funktion <Mathematik>Rechter WinkelInformationGleitendes MittelInnerer PunktMereologieZahlenbereichGesetz <Physik>NummernsystemMAPHypermediaInformationLokales MinimumWeb-SeiteRepository <Informatik>BitSoftwareCASE <Informatik>MultiplikationsoperatorDeskriptive StatistikVollständigkeitProjektive EbeneDokumentenserverBasis <Mathematik>DatensatzPhasenumwandlungMomentenproblemAggregatzustandFunktion <Mathematik>GeradeSelbst organisierendes SystemComputerspielQuick-SortEinfache GenauigkeitIdentitätsverwaltungCodeMeterUmwandlungsenthalpieOffene MengeFeuchteleitungMinkowski-MetrikInformationsspeicherungDatenmissbrauchMetrisches SystemRechter WinkelMetadatenSchnittmengeAutorisierungGanze FunktionVerschlingungDatenfeldComputeranimation
Offene MengeAbstrakter SyntaxbaumSoftwareGemeinsamer SpeicherUnendlichkeitGruppenoperationInstantiierungTypentheorieVerknüpfungsgliedSelbst organisierendes SystemDokumentenserverProzess <Informatik>EinfügungsdämpfungProjektive EbeneCASE <Informatik>InformationWeb SitePhysikalischer EffektCoxeter-GruppeCodeEndliche ModelltheorieMultiplikationsoperatorEinfache GenauigkeitPhysikalisches SystemOrdnung <Mathematik>Gewicht <Ausgleichsrechnung>Güte der AnpassungNeuroinformatikFeuchteleitungSoftwareRadikal <Mathematik>Offene MengeGamecontrollerDigitalisierungStatistikSchlussregelSpieltheorieZahlenbereichQuick-SortArithmetisches MittelDeskriptive StatistikMathematische LogikHypermediaAutomatische HandlungsplanungTabelleGesetz <Physik>Mailing-ListeDatenverwaltungNegative ZahlSoftwareentwicklerDelisches ProblemPunktBitHalbleiterspeicherVirtuelle MaschineMetropolitan area networkResultanteMobiles InternetFamilie <Mathematik>Zentrische StreckungAutomatische IndexierungInterface <Schaltung>SchnittmengeDateiformatZeitzoneServerIdentifizierbarkeitQuellcodeDokumentenverwaltungssystemLoginElektronische BibliothekProgrammbibliothekVersionsverwaltungStrömungsrichtungHorizontaleElektronische PublikationInhalt <Mathematik>Nichtlinearer OperatorMeterAutomatische DifferentiationRechter WinkelVerschlingungKollaboration <Informatik>GeradeSystemplattformStapeldateiComputeranimation
Transkript: Englisch(automatisch erzeugt)
Today lots of research data is being lost. It can look like this, a stolen laptop with crucial scientific data plus many years of research work inside. Or it could look like this. I'm not sure you can perhaps spot the USB pen drive lying here on the parking place. Or it can look like
this. It's a researcher cleaning up his his drive after he needs a bit more space. Or in the more dramatic area there's a building burning down with the offices for about 800 researchers because
of short circuit in a coffee machine. So the fact is that there's lots of way to lose your research data. The problem we're trying to address in Sonodo is to make it easy for to make archiving of your research output very easy and especially for the long tail of science. So usually all the research
outputs that don't really fit into the existing subject-based data repositories or the phrase publisher Paris stating the need to publish in order to advance your career. And while it's
over publications I think we all know that it's really not so for data. And study from 2010 showed that only around 20% actually store data in a proper digital archive. And this is massive
amount of data. We are talking petabytes to exabytes of long tail research data which are not being properly stored today. And here is a small comparison between that and the CERN the entire CERN data archive. So in my world that basically means that all this research data is lost.
That if it's not in a proper digital repository then all this data is lost for the for the future. And there are many reasons why researchers don't share and I'm sure that you've all heard your share of excuses. I think two important ones is that it's hard and if you spend you have
to spend a lot of time on it and if you spend a lot of time on it you don't really get credit for it. So these days a lot of data journals are popping up. It allows the researcher to properly describe your data sets, which method it was collected with, what are the attributes
of it and so forth, and put a publication on the publication list that they can then use to get a better job in the future. So they're trying to address this need and where we, especially with Zenodo, try to address is to make it as easy as possible to put it your data set into a proper digital repository. We also believe that we're not the final solution.
This is just the first step but we really believe that without capturing the content we can't improve the way that we actually capture it. So it's really just a first step, a first pragmatic approach we are taking here. So how does it look like for a researcher? Well,
you basically just upload your files, you describe them and you publish them. It's very simple. So if we look into the upload process then we support files for up to roughly two gigabytes per file. You can upload as many files as you want. We don't
have any upper limit at the moment and it's not been a problem so far. We also allow you to use your existing GitHub account or ORCID account to sign in on Zenodo or register for an account so that we lower the barrier for that you don't have to to create your local Zenodo account and so forth. Also if you have your files in a Dropbox store
somewhere then we can grab your files straight from that Dropbox. Once you've uploaded the files we go on to then describing the dataset. We basically support any kind of research output,
publications, presentations, posters, datasets, software, image videos, you name it and we support it. You also have to specify access right to your research output so that means that you can deposit them both as open access or as closed access because we believe in that if we can capture the data even though it's closed access then we can later
perhaps convince the author to actually switch it to open access. Another feature that we have is that you can pre-reserve a DUI which means that it's basically a promise from us that once we're going to register a DUI for your dataset then it's going to be this DUI. This
allows people to then include this DUI into a readme file or into the presentation or wherever they would need to know the DUI that they are eventually going to get. Besides that basically it's the minimal requirements for data side that you have to specify
authors, title and description and that's all. We do support more metadata and it's definitely something we are also planning to expand on in the future, namely with the automatic extraction but it's very low barriers for a researcher to deposit that dataset.
Then you hit submit, your upload goes online immediately and we register a DUI straight away. Once it's online we support, we have integrated article level metrics from ourmetric.com and it's not so much for the number that it provides, it's more that it actually allows you
to discover what is going on in social media about your uploads. Also we meant a data side DUI and we allow you to link with funding information. So currently we only support FP7 funded research but it is something we are planning to expand on in the future.
This is also something that I'd like to talk a little bit more about since Zenodo started its life as part of the open air project which is specifically about linking research output with the funding information. So when you specify this kind of information,
the funding information, then we export your records into open air. It's a big pan-European project with a lot of partners and it was set up to support the European Commission mandate for open access and in a nutshell it basically tries to answer the question of what is the output of
EU funded research. That's the basic questions I think open air tries to answer. How much of this is open access? How much of this is closed access? To do that
it requires authors to deposit in a repository. It can be institutional or subject space subject based repository and then harvest all these repositories. So at the moment we're harvesting something like 460 repositories and for the data repositories data side actually play quite a
crucial role here since we harvest the entire data side meter data store and try to link this with publications. To do the linking you need basically two crucial information.
First we're using the data side meter data schema to specify access rights. Basically you can specify the vocabulary saying that this is open access. If you happen to know that it's
Creative Commons you can also stick in a Creative Commons license and then we also know that it's open access but basically that's one information. The other information we need is the funding and when we started this there was nothing called fund ref. So this was the scheme we put on top of
the data side meter data schema so to be able to link down to a specific grant and it's also pretty simple it just uses the contributor field and add a layer and a vocabulary for then specifying the actual grant. I think in the future once we go a bit further while our one fund ref is more mature then we're definitely something
we're going to go look more at. One of the big reasons that we adopted the data side meter data schema to export data is that it was very strong on linking. So underneath the hood
open air basically tries to link publications data grants people and organizations and in this the data side schema basically had all the requirements we had so there was a
big plus for selecting that. So in the end this allows us to create pages like this. We're not completely there but basically it's a data set
with related publications down here and where we're missing the small part now is finding the funded what actually funded it. But we basically have a big network of a big network of all these research outputs. So the
in Horizon 2020 there's now an open data pilot which means that around 20 percent of I think the 20 percent of the 2015 budget you're now required to come up with data management plans. You're also required to deposit your data. There's always all sorts of ways to opt
out of it but open air is the infrastructure that is going to support this and in Sonodo we can deposit your data set if you have nowhere else to go. So the last thing I'd like
to talk a bit about is software. So we've been talking a lot about data and it's hard to get researchers to deposit that data and that the data is not cited but the same goes for software that is actually used to analyze the data. So just one example here is from astronomy where we have the astronomy astropy. It's a community library developed by a lot of people and in
I think two years ago they wrote a paper and asked people to cite it if they were using the software. And two years after it looks like this they had around
it's pretty hard to see from below I can see they had around 40 citations over here it's pretty good for a new public for two-year-old publications. However they still think that it's quite underestimated and if you look into the GitHub repository where the source code is hosted then they have 336 people starring their project. They have 219 people working on
on forks of their project. They have when you look at the Python package index where you can download this package from they have something like 60 000 downloads a month. So they think that
it's somehow understated the number of citations they have. On top of that if you start looking into ADS which is not the Archaeological Data Service but the Mesa Astrophysics Data System then in 2008 to 2014 there was only around 70 papers mentioning astropy. On the other hand
there was around 4 000 papers actually mentioning Python. So it's just an example of that that software is also heavily under the people who actually spent a lot of time on it grading these things are not really getting credit for what they're doing. Also going
further with software then you might have progressive researchers who stick their code on GitHub a big source code collaboration platform and then stick a link to the GitHub repository in the repository. And while this is great they open their source code they make it publicly
available then GitHub is not a digital archive. They have their danger zone you can delete your repository you can rewrite history you can transfer it you can make it private. So it's really not good for long-term archiving. So what we have done is basically to allow people
sign in with their existing GitHub account, get a list of their repositories and then flick a switch and we'll automatically start preserving all the software they have in GitHub every time they make a new release we automatically drag it from GitHub and stick a DUI on it. We also have a batch that allows them to advertise inside GitHub that they actually
have preserved this software and that it has a has a DUI. On top of that then of course the software is only it's only interesting if it's actually visible the places where people expect it to find it. So what we have done also is to be able to take a piece of software
from GitHub put it in Sonoro and then exports it exported again to the Inspire high energy physics repository where they are able via the the data side linking where they're able to then link up the paper with the software and actually if a researcher come look at the
paper then they can then find the associated software with it that has been preserved in Sonoro and they can go further to GitHub to actually get out the latest development version of it. So last thing is that behind all of this is the CERN data infrastructure.
Sonoro is a way for for CERN to expose some of the existing things that we're doing for our own community to the long tail of science and yeah we are based on the nVinio repository system which is basically running Inspire and the CERN document server and some
other large very large digital repositories. We do bit level preservations, all those sorts of things. So to sum up we think that we have a very easy to use interface we really lower the barriers and we've heard earlier today that it's important otherwise scientists don't
don't do it. We have really very low restrictions on type of files you can put up on format and license. I think that I haven't mentioned is the distributed community curation if you want to know more about it I can I can tell you that but it's it's a way to allow
you to help curate the content in Sonoro and most of all we also provide longevity we are very big of an organization with very large scale operation that are capable to absorbing some of this this content. Thank you. We do have time for some small questions. Does anyone have any?
Yeah I saw you. So one simple question like so when you
so in the science so software so there's a DIY to software so what's the bartering issue so is it like so in one okay github case like so one so software is a lot of versions so yeah we're basically going with the with the snapshot model so we take a
snapshot of the software and github and put it in Sonoro so it basically means the the source control history is lost okay but it's it's done for simplification that it might be that we could take the the source
source repository and harvest that but then you might get into problems in 10 years that you can't read the this git repository with the current git version and things like that
that's that's an open question so now we're trying this is just the infancy and there's there's still lots of issues in it also how do you actually I think one is how do you cite software another one is how do you actually run it in in 10 years again that's an even bigger question and there we have some some projects at CERN where we're then trying to
to put software together with a virtual machine and be able to then run this virtual machine in 10 years but yeah it's not easy yes thanks so it was also about how do you cite things and the the idea being you had your step two being described and in this description okay I
can tell that this is a data table concerning this and this how much of this description information apart from the format which I expect would would be in the citation is actually cited
or would need to be cited because you you created a line that looked like a bibliographical reference with creator right a year publisher and identifier yeah for instance why not put format to beside those informations so that at the end of an article in 20 years when I find a
reference to some data set I can also see that this data set is video or pure numbers or things like that so basically what we we lean against is how what data side is recommending for a basic
data citation or software citation so it doesn't prevent you from adding extra information into the citation so I think the the important point here is to stick in the identifier so that actually can go back and with with the identifier you can then look in the data side meter data and there you get the the resource type of what it actually is for instance text data sets software
and so forth yeah so extended information is available through through the identifier okay okay any more questions
the dui syntax is always suffix is always the another the nodal followed by four digit the number is that right independently of all the content and always always the same it's a standard
yeah and it's it's uh it's kept like that to um to make it pretty simple since uh the you might add the remove change the meter data but the identifier just stays the same