[Molecular] Bioscience DEVL + RDC Projects #2 - July 18
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 19 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/42924 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Tech Talk15 / 19
00:00
Context awarenessVideo gameComputer-generated imageryFingerprintArchitectureDisk read-and-write headSupercomputerIntegrated development environmentInclusion mapFunction (mathematics)Mathematical analysisRSA (algorithm)Sample (statistics)Installation artGraphical user interfaceSampling (statistics)Wave packetPoint cloudWindows RegistryProjective planeGroup actionInstance (computer science)Computer configurationBasis <Mathematik>Mathematical analysisSoftware repositoryReference dataServer (computing)Set (mathematics)VirtualizationSoftwareLocal ringTask (computing)Uniform resource locatorBiostatisticsMedical imagingProcess (computing)Mobile appQueue (abstract data type)Core dumpOnline helpData storage deviceService (economics)Integrated development environmentSimilarity (geometry)NumberMereologyVirtual machineVideo gameContext awarenessBitConnectivity (graph theory)SpeciesQuicksortSequenceArrow of timeGraphical user interfaceFunctional (mathematics)ConsistencyGraph (mathematics)1 (number)Shared memoryRepository (publishing)DatabaseMechanism designInteractive televisionLevel of measurementType theorySingle-precision floating-point formatSurjective functionWebsiteDifferent (Kate Ryan album)EstimatorStatisticsMultiplicationEvent horizonComputer architectureDisk read-and-write headWeb serviceFrequencyPhysical systemSoftware frameworkSoftware developerNeuroinformatikSupercomputerEndliche ModelltheorieStability theoryECosWater vaporArchaeological field surveyRational numberMultiplication signLine (geometry)Right anglePhysical lawStandard deviationForcing (mathematics)Graph (mathematics)State of matterComputer fileCoordinate systemAlgorithmPlotterUniverse (mathematics)Musical ensembleCuboidLevel (video gaming)CodeResource allocationTouchscreenPairwise comparisonPhysicalismComputing platformTrailXMLProgram flowchart
Transcript: English(auto-generated)
00:00
Um, okay, you can see my screen? Yes. Okay, great. So I'm going to talk about bioscience, STABL and RDC projects. I should point out this. This is molecular bioscience, not imaging bioscience or other types of bioscience. So I'm based at the University of Queensland. I work at Q-SIF
00:21
and the University of Queensland, RCC and I'm also at MBLAVR. There's quite a few development partners on these projects. So Q-SIF and Q-FAB and RCC here in Queensland, Melbourne Bioinformatics in Victoria and the Centre for Comparative Genomics in Perth. And the project supported in one way or another by Bio Platforms Australia, EMBL Australia, Bioinformatics Resource, the
00:43
Atlas of Living Australia and with funding from ARDC. So just to provide a little bit of context, I guess, molecular bioscience may be different to some other disciplines in that I guess it's birth as a data science has happened in a relatively short period. You've probably
01:01
all seen graphs like this. This shows the, in blue, the number of sequences that are held in a large international repository in the States called GenBank. And just to provide, I guess, context in a timeline, that arrow is when I finished my PhD. So basically, we've gone from a
01:22
non data science to a data science in a relatively short time. So that's all well and good, but how many life scientists are really taking full advantage and, and sort of being part of this, the what analysis of this data allows you to do. So this is just a graph showing estimates that have been undertaken from a working group
01:45
looking at an Australian Bioscience data capability. And it classifies biologists into four broad groups. And on the right, we have biology focused Bioscience researchers. So these are people that are working in the lab at the wet, you know, doing wet experiments at the
02:01
bench, and this still forms the large majority of biologists in this group. So those people might go and use a web service once a month or something like that and do a look at go to the NCBI and look up a sequence or do a blast, but they really don't necessarily engage more than that. So this next group here is what we call data
02:22
intensive Bioscience researchers. And these are the group that is growing. So these are people who may have decided to do a gene sequencing experiment. They've sent their sample away to a sequencing facility, they've got the data, and now they need to know what to do with it. Then we have a couple of other groups here, which are
02:44
smaller. So the bioinformatics intensive Bioscience research group effectively use bioinformatics on a day to day basis in their research and probably don't really do much wet experiments. And then we have biometricians here on the left, and this is the group
03:02
that are developing new algorithms. So what I'm really going to talk to you about today is that the audience we're focusing on for the DEVIL and RDC projects are these groups at the right. So they're the groups that are not, you know, they don't have it all sorted out and
03:21
they may or may not, or they might be easing into this data science world and need some help and tools to be able to do that. So the DEVIL project, it builds on the previous Genomics Virtual Lab work. So that was a NECTA funded project. And what the GVL is essentially
03:42
is a server image. So this server image contains a number of standard tools. So Galaxy, I'll talk quite a bit about that, but it also has RStudio, JupyterHub, Command Line Access, Virtual Desktop, and some administrative tools on it. And then there's a number of optional
04:04
bioinformatics pipelines and analysis tools that can also be launched when a server is, a GVL virtual machine is launched. So the other part of it is a virtual machine that's running the server image and
04:21
it's been built so that it can run on an open stack cloud such as NECTA or on EC2 cloud such as Amazon. So that effectively is the GVL. So the option that previously or still exists for people to use this is to fire up and self manage your own GVL instance as a URL
04:43
here. But the steps for doing that are effectively on if one was to use the research cloud here in Australia is that we need to access the NECTA dashboard, the user has to get their NECTA allocation, it's probably just worth mentioning that a
05:01
trial project is sufficient for launching a GVL, then the user has to obtain their cloud credentials, launch their personal GVL instance, access it, manage it and use it and then shut it down. So that actually is still quite a technically challenging set of tasks around in the audience. So the second option for
05:24
using a GVL has been to use a public managed GVL service. And until the beginning of this year, there was actually a few of these. So there was an R Studio service, and three Galaxy services. So one hosted by RCS at the University of Queensland, one
05:42
hosted by Melbourne Biopharmatics and a training instance. So the devil projects has a few broad aims, we're going to talk about all the aims, but the main ones are to rationalise and rearchitect the public managed GVL services. So that's effectively taking these four public services and
06:04
developing one single service, which is called Galaxy Australia. And that includes R Studio and JupyterHub. That's now public, the URL is there, it's usegalaxy.org.au. The proposed architecture that underlies this federated, I
06:24
guess, model is that there's a head node that resides here at the University of Queensland in the in the Research Computing Centre. And there's a SLURM queuing system that submits jobs to worker nodes. So that's pretty much what
06:41
Galaxy, the Galaxy instances were like previously. What's being done to I guess, speed things up and make make the service more efficient is by separating off database service. But then there's also this new, I guess, component of
07:02
Galaxy and they're called interactive environments. And it's possible to just launch up a single use virtual machine for these number of interactive environments such as RStudio or Jupyter. So they run on this Docker swarm. So the idea of that is that it's just a VM that's fired up for a
07:21
particular session, and then it shut down again. So that's, that's what's here at UQRCC. But obviously, we need to think about how we can increase the computational resources that sit under a national service. So one of these is submitting jobs not just to worker nodes on
07:43
Nectar Cloud, but also submitting jobs to HPC. So Galaxy is pretty good in that it can submit jobs over a SLURM queue, a PBS queue and some other queuing methods. So one thing we're working on now is also submitting jobs to the HPC machines here at UQ. Now, we're also going to be
08:04
submitting jobs from the head node using a Condor queue to a Condor head node sitting at the University of Melbourne. And because we can submit jobs from the head node to either cloud or HPC via PBS SLURM or Condor, we can also
08:22
submit jobs to other sites. And we're currently having some discussions with the University of Sydney who are interested in supporting Galaxy Australia as well. The second main aim is to harmonise the look and feel with other global Galaxy services. So Galaxy Australia is not the only one. There's actually over 90 Galaxy
08:40
servers around the world. But the two other main ones are usegalaxy.eu, which is hosted in Freiburg, and usegalaxy.org, which is hosted in the US. Novojtek talked about a consistent user experience. So this project between the three Galaxies listed here is all about having a consistent user experience with
09:02
similar tools, with similar training material, with similar look and feel and layout and reference data sets as well. I should say, so there's a global Galaxy tool shed. This is like an app store for Galaxy. So when command line
09:21
tools are wrapped and enabled to be using Galaxy, it can then be downloaded from the Galaxy tool shed. And we have a policy now on Galaxy Australia that all of the tools that are installed have to be installed from the Galaxy tool shed. And to get into the Galaxy tool shed, there's a core set of tools that have
09:41
undergone extensive QC. The third aim of the devil is to rationalise and expand our existing training efforts. So one of the environments I mentioned previously was a Galaxy tube, so that was just for training. It will be eventually
10:01
decommissioned over the next few months, and we'll use the Galaxy Australia service for all of the training. We have developed previously Australia training material here in Australia, so that's being rationalised. And there's also a global Galaxy training material registry. So all
10:22
of the Australian material is going into that particular training registry as well. So again, the idea of all of this is that it should be possible to go to any of these global Galaxy resources and be able to use that material on our Australian instance. The project's also
10:42
establishing a national network of trainers in the Galaxy. So this is happening through the EMBL-ABR network. There's a train-the-facilitator two-day workshop that's happening in Melbourne. There's about 10 people going from around the country to be given the same training, and so they'll be able to go away and deliver Galaxy
11:02
training locally. And then we'll be undertaking at least three-hour virtual, we call them virtual physical national training events. So we have a lead trainer that's based in one location and then simultaneously around the country, we can hold training events in multiple places. So I'll
11:23
be holding three of those on different topics. So I also wanted to talk about the sister RDC project, so the Research Data Cloud, and this one is extending Galaxy Australia so that it actually can support other national infrastructures that require a bioinformatics
11:40
analysis functionality. And in this particular project, we're going to be supporting BPA's data portal. So BPA's data portal, it's used to store and share framework datasets during the period where the team are working on them. It's based
12:00
on a CCAN framework and it's accessible to consortium members. So it's primarily a data repository and storage and sharing mechanism. It doesn't have raw data, I should say. It doesn't have an analysis functionality. So what we are doing in this particular project is linking up the
12:20
two so that Galaxy Australia can perform that analysis functionality. So we're using it to support a group of researchers that are interested in metagenomics. So metagenomics is a methodology where you can get a sample, so it might
12:42
be something from the environment like soil or water, and then you can extract DNA from that mixed population. You sequence that DNA and then you kind of work backwards to identify what species were present in
13:01
that original sample. So to do that, we are installing a couple of tools onto Galaxy Australia. These are kind of, I guess, commonly used in this metagenomics analysis. These are called QIIME and MOTHER. So they're both being installed onto the system. Now, the
13:24
second part, I guess, of a metagenomics analysis is not just determining what microbes are present in one particular place, but doing statistical analyses across different places. So saying, okay, so what are the commonalities between site A and B and C? And there's a whole
13:42
plethora of types of analyses that people want to do. So see things on a map, do clustering, look at boxplots, so on and so forth, ordination. And so there are a number of R packages out there that do this. So two of
14:01
them are called PhyloSeq and Rea, so they're well known in the community. And what we're doing in the project is wrapping these so that these are available for use through the Galaxy graphical user interface. Obviously, QA-ing of those RAP tools, depositing them into the tool shed, and then installing them from the tool shed onto Galaxy Australia. There's also a
14:23
training component. So we're developing material for primary and secondary metagenomic analysis and depositing this material both into the global Galaxy training portal I talked about, but also into the EcoEd training portal, which is part of the EcoScience DEVIL project that's
14:44
occurring, and then also delivering workshop across the NREL-ABR nodes. So I guess the intended outcomes over time, we would like to see that we're enabling people, I guess, to move from one of these groups to the next
15:00
one to the left. So we have over time, so we would expect that there'll be more biology focused bio-researchers that are moving into the group that are enabled to actually undertake analysis, and then the group moving,
15:22
or this group also moving across. So I should say that it's both pretty short projects. So March to December 2018, both are on track, going pretty well, and there's a lot of people involved, and they are listed here. Okay, thank
15:41
you.