E-Biogenouest: a Regional Life Sciences Initiative for Data Integration
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 24 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/15292 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2014 | |
Production Place | Nancy, France |
Content Metadata
Subject Area | |
Genre |
11
13
24
00:00
DisintegrationInformationGoodness of fitSelf-organizationVideo gameDifferent (Kate Ryan album)Sheaf (mathematics)Lecture/ConferenceMeeting/Interview
00:22
DisintegrationCore dumpProjective planeLevel (video gaming)Software developerData structureSystem callSoftwareComputer programmingNeuroinformatikMusical ensembleMultiplication signMachine visionGroup actionDialectCASE <Informatik>Execution unitResultantMatching (graph theory)Focus (optics)Core dumpUtility softwarePoint cloudOntologyVideo gameMobile WebWeb portalCloud computingMoore's lawContext awarenessComputer animation
04:42
Multiplication signProgrammer (hardware)Context awarenessComputer programmingDressing (medical)Extension (kinesiology)SequenceHard disk driveFormal grammarSelf-organizationComputer animationDiagram
06:08
CASE <Informatik>SequenceData structureChannel capacityDissipationData storage deviceVirtual machineMedical imagingWaveProcess (computing)NeuroinformatikCoprocessorRevision controlArithmetic progressionMultiplication signState of matterMathematicsPhysical lawMoore's lawEvoluteProjective planeMicroprocessorElectric generatorCurveFocus (optics)Self-organization
08:07
System programmingPhysical systemPrinciple of maximum entropyMetropolitan area networkOctahedronEmailSpecial unitary groupSuite (music)Open setMenu (computing)Data managementStorage area networkMoving averageContinuum hypothesisProjective planeAreaPrototypeDifferent (Kate Ryan album)Identity managementSet (mathematics)System callRule of inferenceWeb 2.0Presentation of a groupPhysical systemInterface (computing)Descriptive statisticsPoint (geometry)Scripting languageKey (cryptography)ResultantFunctional (mathematics)Physical lawCASE <Informatik>Integrated development environmentDialectData integrityData managementContext awarenessSoftware developerInsertion lossSoftwareUniform resource locatorLevel (video gaming)BuildingAdditionMultiplication signRepository (publishing)View (database)MetadataOntologyTelecommunicationSuite (music)Mathematical analysisFile archiverComputer fileWave packetWeb portalModule (mathematics)Data analysisField (computer science)Instance (computer science)MomentumVirtualizationComputer animation
15:22
Multiplication signData managementRevision controlProjective planeIdentity managementMoment (mathematics)Insertion lossPoint (geometry)Integrated development environmentClosed setLevel (video gaming)CollaborationismInformationRight angleBitData structureRule of inferenceMatching (graph theory)NeuroinformatikProfil (magazine)CASE <Informatik>MathematicsPopulation densityMedical imagingKey (cryptography)Time zoneCellular automatonComputer sciencePhysical systemVideo gameProduct (business)MereologyPrototypeDifferent (Kate Ryan album)Functional (mathematics)VirtualizationWeb portalMetadataPerimeterType theorySoftwareEvent horizonTask (computing)Parallel portComputer fileSubsetOnline helpContinuum hypothesisComputer animation
Transcript: English(auto-generated)
00:00
At first I would like to thank all the organizers for inviting me to this session. I'm definitely not an information specialist so I am not learning quite a lot and I will come back to my lab with many good ideas I guess. So we we have talked during the the former talks about national
00:26
international initiatives and I will in my talk explain what happens at a more local level, at a regional level. So I will talk at first about our network
00:43
of core facilities, about what's happening in biology and then after that I will talk about our project called e-bio-genoist whose aim is to bridge data, metadata and computation. And let's go. So bio-bio-genoist, as we pronounce it,
01:04
is a network of core facilities and it's in two regions, Bretony and Peil-Elwar. And this network was created in quite many years now and it coordinates 31
01:23
technological core facilities in the two regions, Bretony and Peil-Elwar. And it's funny to see that Peil-Elwar cannot be translated in English but Bretony can be. And the aim of this network is to federate and organize the actions of the
01:42
different institutes because it will also federate more than six 70 research units and that for many topics, many scientific topics, mostly marine science, agri-food, health and bioinformatics. So Genoist, you will
02:05
hear many waste things, the far-waste syndrome I guess. Genoist is a bioinformatics core facility. We're like a bunch of Boy Scouts. We like to collect badges and
02:20
labels. So we're obviously a member of the Bio-Genoist network. We're also a member of the EFB, the French Bioinformatics Institute. And we're also recognized at the national level by IBISAR. We're also recognized by our colleagues from INRA as a regional strategic facility and we're also ISO
02:48
certified. As a platform, as a core facility, we provide typical things that any core facility should provide such as computing infrastructure,
03:04
storage, software development, expertise, and we are carrying out also many different projects around the different challenges that we are facing right now. Among our projects, we have many different focuses. Among them,
03:33
we will find computation, workflows, portals, collaboration, data, metadata, and ontologies. For example, in computation, we are trying to find the
03:44
best way to bring computational power to solve the problems that biology is encountering. So we worked on grid computing. We are now working
04:03
also on cloud computing. We are operating one of the two academic clouds dedicated to bioinformatics. We are also heavily involved in the usage of utilization of portals such as mobile or galaxy, mentioned earlier. And
04:27
with all these projects, we can, we were able to launch what was called e-biogenoist. It's an e-science initiative for the life science in the
04:41
West of France. So the context is the following. If you see the cost of the e-biogenoist since 2008. And this is a great thing for biology because we
05:07
scientists were able to address problems that could not be addressed formally. But it has a very strong implication because the decrease of the cost allows this community to launch more and more sequencing programs. And
05:29
it's very common to see, to meet a biologist who come with his hard drive and say, okay, I had some remaining credits and I was able to sequence my
05:42
organism or many of my organisms. What can I do now? The problem is here. They are able to generate massive amounts of data, but very often the infrastructure is not here. The IT infrastructure is not here. They have
06:01
not enough people to analyze the data. And typically this is a real bottleneck. And the situation can be worsened by the fact that the speed at which the data are generated is now outpacing the traditional Moore law that
06:28
describes the progress in the computational power. So in this case you can see the two curves describing the speed of the microprocessor and the storage capacity. And you can see clearly that the curve of the
06:46
sequencing capabilities is completely crushing everything. And in this case, we are just focusing on the genomic field, but there will be another wave
07:04
of data such for the proteomics, for the bioimaging, that will also cause very dramatic changes in the way biologists are doing their research. Typically,
07:21
we can consider that biology has now turned as a digital science and generating huge amounts of data, heterogeneous amounts of data. And this situation can be critical for some laboratories because this evolution was not clearly anticipated. It appeared very recently, the first next-generation sequencing
07:44
machines appeared in 2005. So in less than 10 years, discipline has to restructure his whole organization. Focus mostly on the generation of data. Now they must reorganize themselves to be able to analyze the huge amounts of
08:05
data they are generating. And that's where we come. Our project was started in May 2012, and it's funded for three years. It's quite a modest
08:26
funding because we have one person dedicated for this project, funded by the two regions. And what we're trying to do is to do some community buildings around the challenges that biology will have to face. So we're
08:47
providing many training workshops about this issue. And we're also preparing a roadmap to help the whole network to build coherent infrastructure to deal
09:00
with this amount of data. And in this project, we're also building a pilot project that can become our prototype prototypal virtual research environment. How have we built this system? Just by building a
09:23
system of systems. Because the resources were quite limited, so we chose to aggregate different tools to be able to cope with the data integration and analysis problem. So we combine different tools such as
09:43
Galaxy that was mentioned earlier. We also used ISA tools, the ISA tools suite. And also a collaborative portal such as Hub Zero. Of course, we had to use additional software to make the whole thing work using some APIs,
10:07
doing some developments, or using some dedicated tools. The Galaxy is a web portal for biomedical data analysis, and it's widely used now. There's a
10:22
tremendous success of this tool because mostly of this intuitive interface, and also because it allows the user to build workflows, and these workflows can be exchanged between users. So there's a very strong
10:42
momentum around this tool. And we also have obviously an instance of Galaxy at Gen West. And on this, we have inserted 800 tools covering different fields of different topics of biological data analysis to help
11:08
our users. We also use ISA tools suite. So it's a tool dedicated for the management of experimental metadata. It was created by people at
11:27
Oxford, and it allows people to describe from a technical point of view their experiments. It can be considered as an electronic lab book,
11:43
for example. When the description of the experiment is done, there's a strong encouragement through the interface to use ontologies to have a better vocabulary. And it will also help people to create local
12:06
repositories where they can centralize all their data and they can share their experiments. And of course, it allows also publication to public
12:21
repositories. It's one of the main advantages. There's some cost of describing, of taking time to describe the experiment, and the reward is the fact that there will be some easier publication steps after that. And we
12:41
also at Gen West have used the ISA tool suite, and we have also have adapted this suite to our needs. And we also made some additional developments for a better adaptation. So what's happening exactly? So the wet lab experiment generates data and the metagenics data
13:07
and can be described with the metadata. And with ISA tools, you will have the ISA tab files that can link to raw data. And with the ISA tools,
13:21
you can also be able to create an archive, the ISA archive. And this archive, we developed different tools to import the ISA archive in the Galaxy environment. In this case, the data are imported,
13:44
and the Galaxy environment is now aware of the location of the raw data. And in this case, people are able, in a very simple way, to do their analysis quite
14:02
immediately. So once they describe the experiment, they explain where the raw data are located, and then the next step of the Galaxy import allows them to
14:21
launch their own analysis. The next tool is the HUB0. It's a scientific web portal that is widely used in the nanoscience community, for example. It gives many
14:44
functionalities for the collaboration, such as Wiki, blogs, and everything is built around resources, such as results, articles, presentations. And there are also some modules to do some lightweight project management. And combining these three tools,
15:09
we are able to provide a single integrated environment for the management and the analysis of the biological data, and with a collaborative resource. That creates something that could be
15:27
called a virtual environment, virtual research environment, with the web portal. Everything is built upon a computing infrastructure, hosting, of course, a data infrastructure. There are the functionalities provided by HUB0 that help the scientists to
15:48
deal with the different steps of the life science project of the project life cycle, project management, collaboration, dissemination. And so with the data,
16:06
the scientist is able to launch very easily its workflows to analyze the data. So what we're trying to do here, and it's exactly what's happening everywhere, I guess,
16:21
is we're changing the paradigm, because formerly biologists came with their data, and we had to say to them, okay, put there on the Linux box, and you will find some software and try to parallelize your computation, learn to use a grid manager and everything.
16:50
So now, since the data are becoming more and more huge, and the size of the virtual machine,
17:01
as said earlier, is becoming, not becoming smaller, but it's rather small compared to the size of the data, we're now plugging the IT tools, the IT environment, onto the data. And it's a very important step in my mind, because when the first
17:24
side of the older way of working, we were de-processing the biologists from its own data. When someone has to find the good software in a huge list, when someone has to deal with
17:46
file formats, when someone has to deal with a new environment, when someone has to deal with the parallelization of some tools, he can easily lose the sight, lose the right purpose of his work. The biologist is here to answer
18:08
biological questions, not to lose his time doing some fancy computer science stuff. And in this case, I guess we could have a solution, a viable solution, to help biologists
18:23
to regain their data. So the next step, what we learned, so when you propose a new tool, some new methodologies, there's always the acceptance and the adoption issues.
18:46
It's incredible. Sometimes people say, okay, that's fine, I understand the benefits, but it will interfere too much with my way of working. So it's a rejection.
19:03
It can happen very often. It's better to deal with people who start new projects with no established rules. And in this case, they will adopt more easily the tools. For the moment, on the hub, we have more than 150 people trying, and we're hosting 50 projects.
19:35
The benefits are more obvious when the tools are used by different communities,
19:44
for example, biologists and computer scientists. And on this hub, they find the perfect tool to exchange very easily. So what we will do next, we obviously have to switch to a production
20:00
environment, production environment, because it's only a prototype for the moment. We will also investigate identity federation to cope with the multi-organism apartments of the people using our tools. And we're also working on the metadata for
20:26
bioinformatics workflows. It can be itself a protocol, and that we would like to grasp and to collect information about this type of experiment in silico. So what we need to do,
20:47
we have to connect to other initiatives, because we started at a very local level, and when I see and when I heard all the very nice projects during this conference, I know that
21:02
I have to reach other people. And what we need to do also is to define the perimeter. For bioinformatics facility, it's not an easy task to do. We are focusing mostly on computational and high-performance computation, and we must add some extra functionalities
21:29
as data management. And that's not what we can do at the best, obviously. So as a conclusion, biology has become a digital science. It's a new event. Very rapidly,
21:48
and the changes are very profound, and the usage must evolve also very rapidly. But for the moment, it could be a dangerous situation. If there were so many
22:03
bioinformaticians in France, it could be not so dangerous. But for the moment, maybe there are not enough people. So we tried to build a system of systems using recognized tools to build a better tool, and we tried to create a continuum
22:28
of data to bring back biology to the biologists.
Recommendations
Series of 12 media
Series of 10 media