Eefke Smit, STM Association at the DataCite summer meeting 2012
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Untertitel |
| |
Serientitel | ||
Teil | 3 | |
Anzahl der Teile | 10 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/10501 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache | ||
Produzent |
Inhaltliche Metadaten
Fachgebiet | |
Genre |
00:00
AssoziativgesetzKopenhagener DeutungDesintegration <Mathematik>RastertunnelmikroskopStandardabweichungWeb-SeiteQuellcodeDatenstrukturURNRechenwerkFeinstruktur <Mengenlehre>Lesen <Datenverarbeitung>EindringerkennungLesezeichen <Internet>Endliche ModelltheorieInformationInhalt <Mathematik>Digital Object IdentifierQuaderHochdruckMAPMetrisches SystemE-MailDatenbankFolge <Mathematik>TabelleDatenstrukturNatürliche ZahlRelativitätstheorieDigitalisierungElement <Gruppentheorie>VersionsverwaltungBitWechselsprungTabelleRandwertPackprogrammWeb-SeiteFigurierte ZahlPerspektiveDatenmissbrauchInformationSichtenkonzeptDichte <Stochastik>ComputerspielLesen <Datenverarbeitung>VerschlingungMotion CapturingGarbentheorieMultiplikationsoperatorVollständiger VerbandComputeranimation
03:53
Anwendungsspezifischer ProzessorNeuronales NetzURNTotal <Mathematik>Strom <Mathematik>Folge <Mathematik>Atomarität <Informatik>DatenstrukturComputerRastertunnelmikroskopQuellcodeZellularer AutomatVolumenProzess <Informatik>GraphMittelwertEigentliche AbbildungDifferenteTypentheorieArithmetisches MittelAnwendungsspezifischer ProzessorBitZahlenbereichPhysikalisches SystemInklusion <Mathematik>GeradeDatenbankStabPunktInverser LimesOffice-PaketDatenkompressionSchnittmengeMultiplikationsoperatorTotal <Mathematik>AusnahmebehandlungSpezifisches VolumenHypermediaZellularer AutomatZählenViewerSichtenkonzeptPackprogrammApp <Programm>MultimediaAutorisierungGraphPeer-to-Peer-NetzVorlesung/Konferenz
07:33
RastertunnelmikroskopSondierungQuellcodeLokales MinimumTelekommunikationSystemidentifikationGewöhnliche DifferentialgleichungExt-FunktorDatenbankRohdatenSchnittmengeMini-DiscStrukturierte ProgrammierungSupremum <Mathematik>Freier LadungsträgerNeuroinformatikProzess <Informatik>Selbst organisierendes SystemDatenstrukturMereologieExogene VariableFigurierte ZahlMobiles EndgerätProjektive EbeneGüte der AnpassungSchnittmengeMessage-PassingAutorisierungMAPReibungswärmeGenerator <Informatik>BitratePerspektiveCoxeter-GruppeObjekt <Kategorie>Vollständiger VerbandStörungstheorieInformationBitTabelleOffice-PaketVererbungshierarchieTopologieBildschirmmaskeBildgebendes VerfahrenComputerspielMultiplikationsoperatorSelbstrepräsentationKondensation <Mathematik>SondierungWeb SiteWechselsprungEigentliche AbbildungDokumentenverwaltungssystemDatenaustauschDatenmissbrauchLesen <Datenverarbeitung>RechenzentrumDokumentenserverGrenzschichtablösungMailing-ListePufferüberlaufÄußere Algebra eines ModulsPackprogrammRechenschieberMultigraphXMLUML
14:42
NeuroinformatikDatenstrukturBitQuick-SortMini-DiscFestplatteVorlesung/Konferenz
15:07
TermMini-DiscGleitendes MittelFlächeninhaltBeobachtungsstudieSchätzfunktionMessage-PassingKategorie <Mathematik>MultiplikationsoperatorPackprogrammComputeranimation
16:06
Desintegration <Mathematik>Interaktives FernsehenIdeal <Mathematik>Mini-DiscDichte <Stochastik>RastertunnelmikroskopEigentliche AbbildungRohdatenSchnittmengeBefehl <Informatik>DatenbankEindeutigkeitCodierung <Programmierung>VerschlingungDatenverwaltungVollständiger VerbandTexteditorDifferenteFitnessfunktionGemeinsamer SpeicherBefehl <Informatik>PackprogrammSchnittmengeData MiningElement <Gruppentheorie>BitRechter WinkelVerschlingungDeklarative ProgrammierspracheAutorisierungSystemplattformViewerFrequenzNotepad-ComputerIdeal <Mathematik>IntegralCASE <Informatik>IdentifizierbarkeitRohdatenRastertunnelmikroskopComputeranimation
19:39
PackprogrammMultiplikationsoperatorRechter WinkelAutorisierungDigitalisierungSelbst organisierendes SystemPerspektiveInformationRastertunnelmikroskopDokumentenserverWeb SiteRechenzentrumBefehl <Informatik>Vorlesung/Konferenz
Transkript: English(automatisch erzeugt)
00:00
Thank you very much Jan, and indeed I think on behalf of the publishing community I can say we're all very positive about what Data Science is doing and it's always a pleasure to be here. I want to, I'll try to entertain you until lunch, I know that's a bit of an extra challenge and nobody can be even half as entertaining as Andrew, so that's an extra
00:24
challenge there, but I wanted to give you some perspective from a publisher's view of how data and publications belong together and since there's such a data explosion in the research community, life for researchers has become quite different, but life for publishers has also become quite
00:45
different and I want to start with a few examples of that, that nature was so kind to let me borrow from them, for example here is a paper roughly 60 years old and it was a very famous paper often quoted because it was the paper that revealed the
01:04
structure of the DNA and 60 years ago such a groundbreaking paper counted one page, two authors, one figure, no data. Then we move on to roughly 10 years ago
01:24
and that was when the human genome was completely unraveled and nature dedicated a special issue to that which counted 62 pages, 49 figures, 27 tables and here you see
01:42
one page that lists all the people who contributed, you know the first empty pages were already listing all those people and you can also see that we were really reaching the boundaries of what was possible on paper because the issue has fold out and
02:00
everything and could hardly actually capture this important step in science. Then we move on another 10 years and that was when the knowledge on the human genome was celebrated after 10 years and nature then put it on an iPad edition, but also for not because it was very cool, it is of
02:27
course very cool to do that, but there were also extra reasons and the reasons were that more than a thousand genomes were described, the raw data was enormous, genomes are
02:40
are captured in so-called SRIs, short read archive elements and this simply would never have fitted paper anymore and also if you then zoom in on the electronic version of the paper that still looks you know the main bit very much looks like a traditional paper, but there's
03:01
so many extras to it, links that you can jump to, related information, figures, figure previewers, collapsible sections, you know the paper has really become very different with so much digital information available. There are other examples too, for example a lot of work has
03:27
been done in Manchester on the so-called utopia documents and they are good examples where you can jump within their interactive PDFs and within that PDF if you find a figure you
03:40
can actually click through to the data underneath and you can also find all kind of cool tools to represent those data different, to play with them, to turn them around, so that is another example how data have become integrated into articles and the Biochemical Journal of
04:00
Press was the first one to launch the utopia documents but I've already seen announcements of more journals and more publishers who are adopting this. Elsevier the big publisher has also included all kind of apps into their digital publications and one example here
04:22
is gene and protein viewers that you can use from within the article, so while you're reading the article you can open up the gene and protein viewer and it actually takes you to the data in the world protein database or in gem bank, so the data doesn't have to be with the article,
04:41
it can be anywhere in the world in official data archives but the viewer in the article will help you to immediately pull up that data to play with it to see it in different viewers etc etc. So all of this looks really cool, so you'd say there's a lot of opportunity and hardly any problem,
05:03
well actually there is a bit of a problem and that is in the sheer volume of all that data. This is a graph that you've probably seen before, it comes from the Biochemical Journal and it indicates with the red line how over the years since 1965 the number of journals has grown
05:22
in percentage, sorry I say it wrong, the number of publications have grown and you see in the other colored line how different types of data sets have grown and you see that it depends a bit what kind of data you talk about, some of them started earlier, some of them started later but all after the turn of the century they give a steep steep rise, there is so much available.
05:49
So what does that mean for journals? Some publishers are really struggling with what they call the data problem and I'm taking out two examples, perhaps a bit anecdotal
06:05
but I think they're very illustrative. Many journals in the past 10 years have started to include data in their journal supplements and that was because anything that couldn't be published in the article itself could now be made available online etc etc so that became
06:25
a very popular way but for example the journal of neuroscience two years ago announced that they would stop accepting these supplements because they were drowning in them, there were simply too many and you see how their volume expressed in megabytes, how their volume of what
06:46
they publish is in the red line and you can see in the blue line how there was an explosion in the amount of stuff that authors submitted to them in supplements and at a certain point
07:01
they said stop we're not going to do it anymore simply because it was too much of a burden on the peer review system and they couldn't guarantee the quality, they couldn't really look into it so they said no more with a few small exceptions like multimedia stuff and things like that. The journal Cell had a similar problem but they chose for another way out,
07:26
they set limitations on the kind of supplementary material that authors can submit submit because they were feeling that they were turning more and more into a dumping ground of data where authors whatever they had at the end of the at the end of their project they would
07:44
send it to the publisher and expect it to be on their website forever and I think I'll come back to that but I'll jump to a little conclusion that I personally have I think the reason why authors send it to publishers is because they have too few alternatives elsewhere.
08:02
If there were many many more good data archives they would send them there but I'll come to that but I just wanted to give you a little preview on what I personally think about that. Okay so the general message here that publishers cannot guarantee the proper handling and
08:21
curation of that data but of course they want to serve the authors and that is where the friction occurs. STM was involved in a project together with a few more organizations here in the audience in the EU project called Paris Inside and as part of that
08:42
project we did a survey in 2009 where we asked researchers where they currently store their data and there were more than thousands respondents and it gave the following picture a lot goes stays at the computer at work but also at portable carriers but for example also a
09:04
large percentage at computers at home whereas for example digital archives get very little I find that a bit worrying especially if you look at the responses to the following question and that is where would you be willing to submit your research data then suddenly the digital
09:24
archives which got such a low percentage in the previous slide rate very high publishers rate lower than in the perspective of what they get nowadays so I think you know coming back on that
09:41
little conclusion I already gave I think there's a lot of demand in the research community to have much better and much more many more data archives another EU project in which STM is involved and on which I want to give you a preview to some of the outcomes is project old opportunities for data exchange CERN and
10:04
STF are also in it you'll hear presentations later from them STM was in a working package together with the British Library, Liebherr and the Deutsches Nationalbibliothek because we were zooming in how data and publications are being handled nowadays with the objective to see what
10:25
kind of impact and incentives we could think of if those are combined better especially if you integrate data sets and publications in a more useful way
10:40
what we came to was first of all a definition of if we talk about data what kind of data are we talking about because it's data data and data well we put them in what we call the data publication pyramid which has four layers the lowest layer is really the raw data as as generated
11:05
and produced in the research project the next layer is probably the layer that Andrew Treloar just described the data collections like you find them in in data archives like in Australia a level further up is really the process data and the data representation so you make the step
11:24
from the data for example into graphs that that that show um yeah a condensed form of the data and then on top of the pyramid are the data as you can find them in publications and now why do we structure the pyramid like this
11:42
because it gives a nice way to show all the different ways in which data and publications are nowadays connected to start this time at the top of the pyramid of course publications always had data you know it's not new that data and publications should
12:00
do something together it's very hard to find a publication that does not have anything of data in it but usually in a publication it is very condensed very processed very aggregated into perhaps one figure one little table with the most noteworthy processed outcomes of the data
12:21
but of course it is a very important way because it's the most traditional way in which data has always been presented then the second layer is what you see nowadays happen in supplementary information to journal articles and that is the overflow of all the stuff that did not fit into the article but that authors still want to present to the readers in one way
12:45
or the other so so the second one is really journal or article supplements if we go to the third layer that is an increasing practice nowadays and and one that i think has a lot of promise
13:01
for the future and that is data that is held in data centers and repositories and is referred to from the articles because really there is no need to have the data and the publication in one place or they can really be in different places and i think especially from my own publishers background i think that data centers are much more professional and have much more
13:25
expertise in curating that data well than publishers can ever do in the supplements the fourth way was described already earlier this morning by vishwas and that is data publications
13:42
or data articles that is really a new thing although quite spreading by quite fast now and also making use of data in the repositories and but articles are really about describing the data the quality of them the way it can be applied by others etc etc and it is interesting to see
14:08
a few years ago herbert grittermeyer wrote an article together with me and we made a little list of all the examples there were the very first data journals were then appearing but
14:21
since then i also get all kind of examples sent to me of journals who were actually doing this already much longer and had had data articles as a separate article type within within the journal so but it's interesting to see that there's so much more enthusiasm for it and then the
14:40
fifth form there that's of course the most desolate form and and one that is yeah a bit of a pity and that is all the data and drawers that remains on disks in the institute and that is a big percentage of the researchers who say that they keep the data on their own computer or at a hard disk or whatever so if this sort of structures the way you know if this
15:06
is a typification of the way things are being done now then of course the next obvious question is what is the situation and how bad is it well the pyramids likely short-term reality
15:21
is this that that horrible fifth category is probably far too large now if you look at different studies then the current estimate is that that two-thirds or seventy percent of the data is never shared or is never made available or whatever data archives
15:41
are springing up here and there and and taking a lot of new initiatives but many disciplines are without any and that comes back to to my message before you know a lot of i think a lot of researchers would like to see more and more trustworthy data archives
16:02
then we come to the area with the supplements you know a lot of data ends up in journal supplements now not always the best way to make them reusable and re-accessible and the top of the pyramid is fairly stable that you know those are the data that that get it that always
16:20
got into publications and and that will remain to be there so if this is the likely short-term reality of course in our utopia we can also make the ideal pyramid and i think it would look like this you always have raw data that is perhaps not yet shared or not yet ready to be
16:42
shared so you'll always have that layer underneath but hopefully it would not be as big as in the previous pyramid my hope is that data archives will get a much bigger share of this that supplements will shrink i really think that supplements to journals you know if we
17:00
look back 10 or 15 years from now that we will say ah yeah that was a that was a temporary thing because people thought that you know it didn't fit on the paper and we weren't sophisticated yet enough to really integrate the data with the articles but of course as the examples at the start of my talk showed there's a lot happening there so i would hope that data
17:24
in publications actually grows and we will get much more sophisticated ways like the utopia documents like viewers into data that are elsewhere in data archives etc etc so this would be my ideal pyramid and of course then we as publishers set ourselves how can publishers help to make
17:44
things better well i listed a few obvious things that partly repeat what we heard in the other two talks here before it's about persistent identifiers it's about bi-directional linking it's about partnering with data archives etc it says a better better integration of underlying data
18:05
in the in the articles and i think i can also use this platform here to announce you that today data science and stm are actually issuing a statement uh together right after my talk it
18:20
builds on the brussels declaration that we had in 2007 where stm and and some 50 or 60 publishers said that raw research data should be made freely available as much as we can and the statement that we announce today is first of all that we encourage authors of research papers to deposit
18:43
their research validated data in trustworthy data archives secondly that we should have bi-directional linking in place between the data sets and the publications thirdly that there should be visibility of these links from people who start at the publication same people who
19:02
start at the archive they should have the links visible to show them the related articles a fifth element of the statement is that we want to work on best practice recommendations for the citation of data sets and the work that is that we're doing today tomorrow and the day
19:22
after on for co-data is exactly on this and i think this is now very important because personally i find the way of data citation a bit messy right now there's a lot of differences between different disciplines also you know some data archives give very good instructions
19:47
on how they want authors to cite the data in their in in their repository but others don't give any instructions i very much like the paper that the digital curation center
20:01
issued last year about best practice guidelines for data citation because at least it gives some some how to do it information and i think you know from the perspective of publishers publishers are eager to do this right but then you know let's agree with all the communities how we actually want to do it and especially with the data archives you know how do you want
20:24
it done and then last but not least uh data site and stm invite other organizations to also co-sign this statement so that we can help to disseminate it more and better
20:42
that was my talk i think just the time and perhaps you have questions
Empfehlungen
Serie mit 10 Medien