We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

[Molecular] Bioscience DEVL + RDC Projects #2 - July 18

00:00

Formal Metadata

Title
[Molecular] Bioscience DEVL + RDC Projects #2 - July 18
Title of Series
Number of Parts
19
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
06.07.18 - Jeff Christiansen, Bioscience DEVL (Data enhanced Virtual Lab). In June & July's TechTalk events, representatives from each of the DEVLs will introduce their projects from a developers' perspective: the problem the project is trying to solve, the tech stacks deployed and to be developed, the approaches of their software development and community engagement while developing tools and applications. The demonstration/discussion of 8 projects are scheduled as follows: DEVL #1 June (1st): Astronomy (Robert Shen), Marine Science (Roger Proctor), Terrestrial Ecology (Gerhard Weis), and Climate (Clare Richards). DEVL #2 July (6th), we welcome speakers from: GeoScience (Carsten Friedrich), Characterisation (Lance Wilson), BioScience (Jeff Christiansen), and Culture and Communities (Sarah Nisbet, Alexis Tindall).
Context awarenessVideo gameComputer-generated imageryFingerprintArchitectureDisk read-and-write headSupercomputerIntegrated development environmentInclusion mapFunction (mathematics)Mathematical analysisRSA (algorithm)Sample (statistics)Installation artGraphical user interfaceSampling (statistics)Wave packetPoint cloudWindows RegistryProjective planeGroup actionInstance (computer science)Computer configurationBasis <Mathematik>Mathematical analysisSoftware repositoryReference dataServer (computing)Set (mathematics)VirtualizationSoftwareLocal ringTask (computing)Uniform resource locatorBiostatisticsMedical imagingProcess (computing)Mobile appQueue (abstract data type)Core dumpOnline helpData storage deviceService (economics)Integrated development environmentSimilarity (geometry)NumberMereologyVirtual machineVideo gameContext awarenessBitConnectivity (graph theory)SpeciesQuicksortSequenceArrow of timeGraphical user interfaceFunctional (mathematics)ConsistencyGraph (mathematics)1 (number)Shared memoryRepository (publishing)DatabaseMechanism designInteractive televisionLevel of measurementType theorySingle-precision floating-point formatSurjective functionWebsiteDifferent (Kate Ryan album)EstimatorStatisticsMultiplicationEvent horizonComputer architectureDisk read-and-write headWeb serviceFrequencyPhysical systemSoftware frameworkSoftware developerNeuroinformatikSupercomputerEndliche ModelltheorieStability theoryECosWater vaporArchaeological field surveyRational numberMultiplication signLine (geometry)Right anglePhysical lawStandard deviationForcing (mathematics)Graph (mathematics)State of matterComputer fileCoordinate systemAlgorithmPlotterUniverse (mathematics)Musical ensembleCuboidLevel (video gaming)CodeResource allocationTouchscreenPairwise comparisonPhysicalismComputing platformTrailXMLProgram flowchart
Transcript: English(auto-generated)
Um, okay, you can see my screen? Yes. Okay, great. So I'm going to talk about bioscience, STABL and RDC projects. I should point out this. This is molecular bioscience, not imaging bioscience or other types of bioscience. So I'm based at the University of Queensland. I work at Q-SIF
and the University of Queensland, RCC and I'm also at MBLAVR. There's quite a few development partners on these projects. So Q-SIF and Q-FAB and RCC here in Queensland, Melbourne Bioinformatics in Victoria and the Centre for Comparative Genomics in Perth. And the project supported in one way or another by Bio Platforms Australia, EMBL Australia, Bioinformatics Resource, the
Atlas of Living Australia and with funding from ARDC. So just to provide a little bit of context, I guess, molecular bioscience may be different to some other disciplines in that I guess it's birth as a data science has happened in a relatively short period. You've probably
all seen graphs like this. This shows the, in blue, the number of sequences that are held in a large international repository in the States called GenBank. And just to provide, I guess, context in a timeline, that arrow is when I finished my PhD. So basically, we've gone from a
non data science to a data science in a relatively short time. So that's all well and good, but how many life scientists are really taking full advantage and, and sort of being part of this, the what analysis of this data allows you to do. So this is just a graph showing estimates that have been undertaken from a working group
looking at an Australian Bioscience data capability. And it classifies biologists into four broad groups. And on the right, we have biology focused Bioscience researchers. So these are people that are working in the lab at the wet, you know, doing wet experiments at the
bench, and this still forms the large majority of biologists in this group. So those people might go and use a web service once a month or something like that and do a look at go to the NCBI and look up a sequence or do a blast, but they really don't necessarily engage more than that. So this next group here is what we call data
intensive Bioscience researchers. And these are the group that is growing. So these are people who may have decided to do a gene sequencing experiment. They've sent their sample away to a sequencing facility, they've got the data, and now they need to know what to do with it. Then we have a couple of other groups here, which are
smaller. So the bioinformatics intensive Bioscience research group effectively use bioinformatics on a day to day basis in their research and probably don't really do much wet experiments. And then we have biometricians here on the left, and this is the group
that are developing new algorithms. So what I'm really going to talk to you about today is that the audience we're focusing on for the DEVIL and RDC projects are these groups at the right. So they're the groups that are not, you know, they don't have it all sorted out and
they may or may not, or they might be easing into this data science world and need some help and tools to be able to do that. So the DEVIL project, it builds on the previous Genomics Virtual Lab work. So that was a NECTA funded project. And what the GVL is essentially
is a server image. So this server image contains a number of standard tools. So Galaxy, I'll talk quite a bit about that, but it also has RStudio, JupyterHub, Command Line Access, Virtual Desktop, and some administrative tools on it. And then there's a number of optional
bioinformatics pipelines and analysis tools that can also be launched when a server is, a GVL virtual machine is launched. So the other part of it is a virtual machine that's running the server image and
it's been built so that it can run on an open stack cloud such as NECTA or on EC2 cloud such as Amazon. So that effectively is the GVL. So the option that previously or still exists for people to use this is to fire up and self manage your own GVL instance as a URL
here. But the steps for doing that are effectively on if one was to use the research cloud here in Australia is that we need to access the NECTA dashboard, the user has to get their NECTA allocation, it's probably just worth mentioning that a
trial project is sufficient for launching a GVL, then the user has to obtain their cloud credentials, launch their personal GVL instance, access it, manage it and use it and then shut it down. So that actually is still quite a technically challenging set of tasks around in the audience. So the second option for
using a GVL has been to use a public managed GVL service. And until the beginning of this year, there was actually a few of these. So there was an R Studio service, and three Galaxy services. So one hosted by RCS at the University of Queensland, one
hosted by Melbourne Biopharmatics and a training instance. So the devil projects has a few broad aims, we're going to talk about all the aims, but the main ones are to rationalise and rearchitect the public managed GVL services. So that's effectively taking these four public services and
developing one single service, which is called Galaxy Australia. And that includes R Studio and JupyterHub. That's now public, the URL is there, it's usegalaxy.org.au. The proposed architecture that underlies this federated, I
guess, model is that there's a head node that resides here at the University of Queensland in the in the Research Computing Centre. And there's a SLURM queuing system that submits jobs to worker nodes. So that's pretty much what
Galaxy, the Galaxy instances were like previously. What's being done to I guess, speed things up and make make the service more efficient is by separating off database service. But then there's also this new, I guess, component of
Galaxy and they're called interactive environments. And it's possible to just launch up a single use virtual machine for these number of interactive environments such as RStudio or Jupyter. So they run on this Docker swarm. So the idea of that is that it's just a VM that's fired up for a
particular session, and then it shut down again. So that's, that's what's here at UQRCC. But obviously, we need to think about how we can increase the computational resources that sit under a national service. So one of these is submitting jobs not just to worker nodes on
Nectar Cloud, but also submitting jobs to HPC. So Galaxy is pretty good in that it can submit jobs over a SLURM queue, a PBS queue and some other queuing methods. So one thing we're working on now is also submitting jobs to the HPC machines here at UQ. Now, we're also going to be
submitting jobs from the head node using a Condor queue to a Condor head node sitting at the University of Melbourne. And because we can submit jobs from the head node to either cloud or HPC via PBS SLURM or Condor, we can also
submit jobs to other sites. And we're currently having some discussions with the University of Sydney who are interested in supporting Galaxy Australia as well. The second main aim is to harmonise the look and feel with other global Galaxy services. So Galaxy Australia is not the only one. There's actually over 90 Galaxy
servers around the world. But the two other main ones are usegalaxy.eu, which is hosted in Freiburg, and usegalaxy.org, which is hosted in the US. Novojtek talked about a consistent user experience. So this project between the three Galaxies listed here is all about having a consistent user experience with
similar tools, with similar training material, with similar look and feel and layout and reference data sets as well. I should say, so there's a global Galaxy tool shed. This is like an app store for Galaxy. So when command line
tools are wrapped and enabled to be using Galaxy, it can then be downloaded from the Galaxy tool shed. And we have a policy now on Galaxy Australia that all of the tools that are installed have to be installed from the Galaxy tool shed. And to get into the Galaxy tool shed, there's a core set of tools that have
undergone extensive QC. The third aim of the devil is to rationalise and expand our existing training efforts. So one of the environments I mentioned previously was a Galaxy tube, so that was just for training. It will be eventually
decommissioned over the next few months, and we'll use the Galaxy Australia service for all of the training. We have developed previously Australia training material here in Australia, so that's being rationalised. And there's also a global Galaxy training material registry. So all
of the Australian material is going into that particular training registry as well. So again, the idea of all of this is that it should be possible to go to any of these global Galaxy resources and be able to use that material on our Australian instance. The project's also
establishing a national network of trainers in the Galaxy. So this is happening through the EMBL-ABR network. There's a train-the-facilitator two-day workshop that's happening in Melbourne. There's about 10 people going from around the country to be given the same training, and so they'll be able to go away and deliver Galaxy
training locally. And then we'll be undertaking at least three-hour virtual, we call them virtual physical national training events. So we have a lead trainer that's based in one location and then simultaneously around the country, we can hold training events in multiple places. So I'll
be holding three of those on different topics. So I also wanted to talk about the sister RDC project, so the Research Data Cloud, and this one is extending Galaxy Australia so that it actually can support other national infrastructures that require a bioinformatics
analysis functionality. And in this particular project, we're going to be supporting BPA's data portal. So BPA's data portal, it's used to store and share framework datasets during the period where the team are working on them. It's based
on a CCAN framework and it's accessible to consortium members. So it's primarily a data repository and storage and sharing mechanism. It doesn't have raw data, I should say. It doesn't have an analysis functionality. So what we are doing in this particular project is linking up the
two so that Galaxy Australia can perform that analysis functionality. So we're using it to support a group of researchers that are interested in metagenomics. So metagenomics is a methodology where you can get a sample, so it might
be something from the environment like soil or water, and then you can extract DNA from that mixed population. You sequence that DNA and then you kind of work backwards to identify what species were present in
that original sample. So to do that, we are installing a couple of tools onto Galaxy Australia. These are kind of, I guess, commonly used in this metagenomics analysis. These are called QIIME and MOTHER. So they're both being installed onto the system. Now, the
second part, I guess, of a metagenomics analysis is not just determining what microbes are present in one particular place, but doing statistical analyses across different places. So saying, okay, so what are the commonalities between site A and B and C? And there's a whole
plethora of types of analyses that people want to do. So see things on a map, do clustering, look at boxplots, so on and so forth, ordination. And so there are a number of R packages out there that do this. So two of
them are called PhyloSeq and Rea, so they're well known in the community. And what we're doing in the project is wrapping these so that these are available for use through the Galaxy graphical user interface. Obviously, QA-ing of those RAP tools, depositing them into the tool shed, and then installing them from the tool shed onto Galaxy Australia. There's also a
training component. So we're developing material for primary and secondary metagenomic analysis and depositing this material both into the global Galaxy training portal I talked about, but also into the EcoEd training portal, which is part of the EcoScience DEVIL project that's
occurring, and then also delivering workshop across the NREL-ABR nodes. So I guess the intended outcomes over time, we would like to see that we're enabling people, I guess, to move from one of these groups to the next
one to the left. So we have over time, so we would expect that there'll be more biology focused bio-researchers that are moving into the group that are enabled to actually undertake analysis, and then the group moving,
or this group also moving across. So I should say that it's both pretty short projects. So March to December 2018, both are on track, going pretty well, and there's a lot of people involved, and they are listed here. Okay, thank
you.