We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Revolutionizing the Journal through Big Data Computational Research

00:00

Formal Metadata

Title
Revolutionizing the Journal through Big Data Computational Research
Title of Series
Number of Parts
24
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2014
Production PlaceNancy, France

Content Metadata

Subject Area
Genre
Computational physicsOpen setGroup actionAuthorizationSoftware repositoryCentralizer and normalizerComputer animation
Macro (computer science)Maß <Mathematik>Wide area networkComputer-assisted translationPrinciple of maximum entropyGrand Unified TheoryDuality (mathematics)MIDIMetropolitan area networkInfinityRaw image formatMoving averagePhysical lawPointer (computer programming)Inclusion mapArmFatou-MengeCAN busAuthorizationPhysical systemMultiplication signComputer fileDigital object identifierDatabaseMetadataPeer-to-peerVirtual machineBitOpen setObservational studySelf-organizationSoftwareContent (media)Level (video gaming)InformationTerm (mathematics)Shared memorySheaf (mathematics)Data structureWeb pageTable (information)Standard deviationElectronic mailing listLink (knot theory)Characteristic polynomialCloud computingComputing platformSingle-precision floating-point formatVideo gameAssembly languageQuicksortSpectrum (functional analysis)AdditionSoftware testingMeasurementRepository (publishing)NumberSelectivity (electronic)Process (computing)Group actionOperator (mathematics)Boss CorporationPoint (geometry)Communications protocolCASE <Informatik>WorkloadSign (mathematics)Line (geometry)Game theorySystem callForestMereologyFlow separationNetwork topologyMathematicsPosition operatorInteractive televisionFrequencySource codeXML
Discrete element methodWechselseitige InformationSmith chartComputer clusterSicDew pointHeat transferArmExecution unitDigital object identifierVirtual machinePattern recognitionEndliche ModelltheoriePoint cloudCloud computingPay televisionExtension (kinesiology)Multiplication signObservational studyMetadataDomain nameObject (grammar)QuicksortDatabaseDynamical systemStandard deviationRepository (publishing)Set (mathematics)MultiplicationSoftwareInstance (computer science)AuthorizationProcess (computing)Medical imagingBitParameter (computer programming)Revision controlResultantNeuroinformatikFigurate numberProjective planeGoodness of fitPoint (geometry)Term (mathematics)Computer hardwareContent (media)Letterpress printingUniverse (mathematics)FreewareVirtualizationFlow separationLatent heatFeasibility studyLaptopInternet service providerShared memoryWeightWorkloadLine (geometry)Level (video gaming)Order (biology)Computer scientistOnline helpFilm editingMenu (computing)Special unitary groupDisk read-and-write headComputer animation
WordEndliche ModelltheorieStandard deviationUniform boundedness principleResultantNeuroinformatikState of matterPoint cloudInteractive televisionDifferent (Kate Ryan album)Computer animationMeeting/Interview
FrustrationText editorRepository (publishing)Different (Kate Ryan album)Dirac delta functionSequenceNeuroinformatikDatabaseSet (mathematics)Standard deviationCASE <Informatik>Point cloudQuicksortInternet service providerLatent heatSpeciesSign (mathematics)Military baseCartesian coordinate systemCodePosition operatorNetwork topologyBeat (acoustics)Medical imagingRight angleDistribution (mathematics)Lecture/ConferenceMeeting/Interview
Transcript: English(auto-generated)
Yes, so I am in charge of a group of open data journals which are largely bioinformatics and genomics journals. I also push forward open data policy and our relationship with data repositories at
BioMed Central. Just to tell you a bit about us, we are an open access only publisher. We were founded in 2000. We have been using the Creative Commons license since 2002.
We publish over 260 open access journals. Those are largely genomics, bioinformatics, as well as public health and global health journals. All of our articles are licensed under CC5, like I said, and since around this time last
year we now license our content with the CC0 waiver as well, with an understanding that the content we are publishing is also going to be mined by machines, and so it needs
to be as reusable by them as possible. So just to tell you a little bit about what we are doing in terms of data reuse at BioMed Central, for all of our journals we strongly encourage data deposition. If a reader asks the author, we require that the author share his or her data, as
well as if a reviewer asks. For a select number of journals, mainly our genomics journals, we actually require data deposition, and we are currently looking at mandating it for all of our journals. So like I said, we use the CC0, CC5 combined license.
We also include an availability data section so that a reader can easily find the data or tools behind the paper. We also encourage the use of a metadata standard called ISATAP, which I'll talk
more about later. In the works, we are currently working on providing interactive tabular data in our articles, DOI's for all additional files, better searchability of our additional files,
and we are working with the Force 11 Data Citation Implementation Group to tag data citation in the JATS XML, which I think Nigel will be happy about. So I'm mainly going to focus on our journal Gigascience.
This is a life science journal. It is run in partnership with the BGI, the Beijing Genomics Institute. Unlike any other journal, this journal has a cloud platform connected to the publication
platform. So for every single paper, authors submit their data set. They also submit the tools behind their study. Everything receives a DOI and is archived in the GigaDB database. So for example, this is just the submission workflow for how an author would upload
his or her data. It's a big data journal, so normally we're talking about anywhere from 10 gigabytes to terabytes of data.
So once the author submits to Gigascience, our data curator and our data scientists will work with him or her to create a metadata file as well as there's been a recent paper, actually, our data scientists actually worked with an author to create a virtual
machine to make all of the tools and data much more reusable. So another thing we do, like I said, we have an availability of supporting data section. Here we cite the data which is cited in the which then links to the
reference list and links out to the GigaDB repository. So I'm just going to talk about one case study. So Gigascience is really kind of our experimental journal.
Like I said, it's in partnership with BGI. So in addition to myself, we have a data scientist, a bioinformatician as well as a data curator working on the journal. And so this article is a software paper essentially on the Soap To Novo software,
which is a genome assembler software that's highly used. So we wanted to see what would we have to do to make this paper essentially
reproducible with a few clicks of the button. So to do that, first we have to collect the actual metadata behind the studies. So we worked with people from the ISATAB organization or BioSharing,
which is out of Oxford. You might also know them because they work with scientific data to create ISATAB files for their data articles. So when I say metadata file, I'm not talking about bibliographic data.
I never know how much people know, so I'm just going to give you all the details. So I'm not talking about a bibliographic file. I'm talking about essentially pages and pages of tabular files so that you would be
able to essentially recreate the experiment if you wanted to. So it's structured in terms of three sections. So first is the investigation. This gives the investigator, the funder, the protocols, very high level information.
The next, every investigation can have several studies. The study includes information on the subject under study, characteristics of the treatments applied. So every study will also have several assays. And for a software paper like this, of course, it's going to have lots of assays.
It's going to need to be tested in many ways. So the assay is the test performed and the quantitative or qualitative measures from the experiment. So this is essentially the ISATAB structure.
And it can be, I mean, with this structure, you can do it too. There's a sort of spectrum on level of detail. So this is something, for example, we could think about incorporating into the peer review system. And it looks like it would be a lot of questions, but you could do it to
a level to where it would be a lot of questions, but it would still be doable. However, that's not really going to make, that's not going to provide enough information to make most studies reproducible. And so we did this for our subdenovo paper.
It took us about two months to do, working with people from bio sharing who created this ISATAB structure and our three data scientists, essentially. So at the end of these two months, we were able to essentially press a few buttons and recreate most of the experiments.
So that's something we learned. You can do this. It, however, costs a lot of money. The time it took us to do it, there's no way we could ever really work that into a publication workflow because an APC could never be, could never cover it.
But we did learn that there's a huge amount you learn when capturing the metadata behind a study. So this needs to happen before publication.
So now I'm going to talk about a similar project we are doing. So GigaScience, like I said, is our big data journal. We host everything in the cloud and it's highly, highly computational.
So in thinking about computational research, it should technically be easier to replicate. However, it's not because of the activation time that it actually takes to port
the software to rerun everything, to make sure the data is up to scratch. So we've been thinking of how can we, as data becomes bigger and bigger in some sciences, it's not really just genomics and the neurosciences.
Imaging data is quite big, but also the Human Brain Project is a highly neuroinformatics based project. And the data coming out of the Brain and the Human Brain Project are going to be in the yottabytes of data.
So as big data grows, obviously universities are not going to be able to house this hardware themselves. So we're looking more and more towards the cloud. So with that in mind, is there a way we could lower this activation energy to make research more reproducible?
So we have been talking to several partners. There's also a recognition that there are already a lot of dynamic tools out there to make things more reproducible. For example, with GigaScience, we see a lot of people submitting IPython notebooks,
as well as in Nitter, and a lot of people using things like Galaxy. These are very easy to use tools that make all of the research objects behind a study at least more examinable.
Than if one was to just submit a manuscript. So there are tools available, there are partners. We've been thinking, can we put all of these together to lower that activation energy?
So this is a storyboard, a clip from our storyboard. So say you came across an article in a journal. So you came across a figure and you thought, actually, I would like to learn a little bit
more about that, or I wonder if I tweaked the parameters a bit, I would get a different result. You can click on the pipeline, it would be dynamic. You could look at the data, you could then sign on to a cloud if you wanted to and rerun it a bit.
We've talked to several commercial cloud providers who would be keen to let readers rerun to a certain extent for free as a kind of premium model. And then if you wanted to reuse it, you simply copy that instance with a click of a button. So you can easily see a future of sort of a virtual
machine infrastructure, assigning DOIs to virtual machines and then archiving them in repositories. So in thinking about this, we think that publishers, commercial cloud providers and researchers have a very complementary role.
I think publishers, one, they hold a lot of this content and we need to start thinking about how we can make it more than an advertisement, as Graham still says.
But also, publishers typically do an OK job at enforcing community standards. So I think with data deposition, that's going to have to come from publishers. As it did with the genomics community, academic databases can provide a sort
of credible long-term archiving, as well as working on curation and metadata standards. Commercial cloud providers then sort of, there are good computing infrastructures, but
the commercial cloud provider has this way of sort of democratizing it. All you need is some money and you can quickly fire up an instance in the cloud. So if we're moving towards reproducibility through things like virtual machines, what kind of specific challenges does this have to data?
There's the question of to what extent data should be included in that virtual machine or pulled in externally. How can we avoid costliness of moving data around?
To what extent are cross domain standards for referring to and pulling in underlying data sets feasible? Multiple versions of data sets, to what extent is it practical when dealing with evolving data sets to make them available as reproducible snapshots?
But mostly the culture of data sharing, how to get authors to want to share all of their work so that it can be reused. Conclusions. So with big data and computational tools, one thing we're seeing at GigaScience is that research can be reproducible, it can be reusable.
The infrastructure is out there and we just need to do a better job of using it. What authors need to communicate their research is changing and publishers need to start responding to that.
And I think they also need to step up and play a better role in setting community standards or enforcing them. Finally, I think we're getting to a point where publishing is actually exciting. I'm very happy that we're moving away from the dead print model.
And that's me. I like the idea that it's taken a couple of hundred years but publishing is only now just getting exciting. I have a pile of questions but let me see if people on the floor have questions that they would like to start with.
Jonathan. Hi, Jonathan Tedd, University of Leicester. Thanks for touching on many critical issues for many researchers I
know and work with in terms of their use of computationally intensive resources, data and so on. I don't really have, this isn't so much a question as a suggestion, something I'm beginning to see from some researchers that I work with. Which is that if you can encourage researchers to share data, in other words, they perhaps it's hosted in some kind of cloud resource.
That ultimately becomes the only way people can easily interact with that data, reproduce experiments and so forth. In other words, what people have to do is move the computation to the data rather than
the other way around instead of everyone trying to download, re-host, mirror in various different places. It looks like the model may be that there's an agreed community standard way of hosting the data and people move their interaction with that data to that cloud.
I'm not quite sure how that's going to work in practice yet, but that seems to be the way that people are moving. Do you have a comment on that? Yeah, I completely agree. I mean, it's not only costly to move data, it's very timely. It takes days for people to upload things to GigaDB sometimes, let alone the frustration of doing it.
But also in talking to the different commercial cloud providers, that's just very clear. There's no reason to be wasting money moving data around. We just need to figure out how to move the computation.
And just to comment on that, I think the critical thing has been that the size of the data sets has crossed over with the size of the VMs. The VM images, particularly if you're talking about deltas on top of standard VMs, are now smaller than the data. So it absolutely makes sense, just in efficiency terms, to be moving the VMs around.
Other questions? So I have a question about your comment on the role of public and academic databases providing long-term archiving. And yet I note that with GigaScience, the publisher is actually holding the data.
Can you kind of comment on that? Yes, okay. So GigaScience is a special case. And one reason that is because the database actually was started before the journal.
So the BGI hosts the database, and the BGI are a research institute in themselves, so all of their data was already going there. And so when we started talking about partnering with them, we thought that even where there were subject-specific repositories,
so for example we have a genomics journal called Genome Biology, and even though we have in-house editors checking and linking the data, it's still never perfect.
It's often not perfect, and they miss things. And we put it, as a publisher, an incredible amount of money into that journal. So if people who are paid to do that all day long mess that up, then academic editors are going to mess it up. But anyways, so with GigaScience the idea was, how can we put everything in one place,
so that when someone's looking for the tools, someone's looking for the data, it's all right there. But also, there are some cases with the subject repositories, they'll only take a specific type of data. So with GigaDB, you can host everything.
So if it's a new species, for example, you can put your sequence data there, but you can also put, say you did some cool fMRI stuff, you can put that there. The idea isn't to replace the subject repositories, so if someone comes with sequence data, we tell them to put it in Jim Bank.
You cannot publish with us unless you put it in the specific subject repository.