[Molecular] Bioscience DEVL + RDC Projects #2 - July 18 - TIB AV-Portal

[Molecular] Bioscience DEVL + RDC Projects #2 - July 18

00:00

3

Related Material

Australian Research Data Commons (ARDC)

Christiansen, Jeff

Formal Metadata

Title

[Molecular] Bioscience DEVL + RDC Projects #2 - July 18

Title of Series

Number of Parts

19

Author

Christiansen, Jeff

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/42924 (DOI)

Publisher

Australian Research Data Commons (ARDC)

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Webinar/Tutorial

Abstract

06.07.18 - Jeff Christiansen, Bioscience DEVL (Data enhanced Virtual Lab). In June & July's TechTalk events, representatives from each of the DEVLs will introduce their projects from a developers' perspective: the problem the project is trying to solve, the tech stacks deployed and to be developed, the approaches of their software development and community engagement while developing tools and applications. The demonstration/discussion of 8 projects are scheduled as follows: DEVL #1 June (1st): Astronomy (Robert Shen), Marine Science (Roger Proctor), Terrestrial Ecology (Gerhard Weis), and Climate (Clare Richards). DEVL #2 July (6th), we welcome speakers from: GeoScience (Carsten Friedrich), Characterisation (Lance Wilson), BioScience (Jeff Christiansen), and Culture and Communities (Sarah Nisbet, Alexis Tindall).

Tech Talk15 / 19

1

23:02

Exploring GLAM data (with Jupyter notebooks) - Sept 18

2

18:58

The Prosecutions Project - Sept 18

3

17:29

Using NetCDF in Jupyter notebooks - Oct 18

4

21:33

Scientific Data in the Cloud - Oct 18

5

16:53

ESIP EnviroSensing Cluster Pt. 2 - Cluster Projects & Highlights - Nov 18

6

08:02

ESIP EnviroSensing Cluster Pt. 1 - Cart before the horse: system QA and data QC practices for sensor networks - Nov 18

7

06:04

ESIP EnviroSensing Cluster Pt. 4 - An Integrated Sensor Data Management System (ISDMS) - Nov 18

8

17:41

ESIP EnviroSensing Cluster Pt. 3 - Cloud-Hosted Real-time Data Services for the Geosciences (CHORDS) - Nov 18

9

11:41

Marine Data enhanced Virtual Laboratory DEVL #1 - June 18

10

11:21

Humanities, Arts and Social Sciences DEVL #2 - July 18

11

13:20

Geoscience DEVL #2 - GeoDEVL - July 18

12

10:50

EcoCloud DEVL #1 - EcoScience Research Data Cloud & Data Enhanced Virtual Laboratory (RDC & DEVL) - June 18

13

13:43

Climate Science Data Enhanced Virtual Laboratory #1 - June 18

14

14:46

Technology of the Characterisation Virtual Laboratory (C-DEVL project) #2 - July 18

15

15:44

[Molecular] Bioscience DEVL + RDC Projects #2 - July 18

16

13:27

Astro DEVL #1: ASVO - MWA Node - June 18

17

31:06

ESIP Information Quality Cluster: Vision, Objectives, Accomplishments and Status - March 2019

18

13:03

Australia National Computational Infrastructure - Implementing a Data Quality Strategy to simplify access to data - March 2019

19

10:28

ESIP Information Quality Cluster - A Brief Overview of Maturity Models for Consistemt Data Quality Ratings - March 2019

Automatic playback

Speech

Text

Image

00:00

Context awarenessVideo gameComputer-generated imageryFingerprintArchitectureDisk read-and-write headSupercomputerIntegrated development environmentInclusion mapFunction (mathematics)Mathematical analysisRSA (algorithm)Sample (statistics)Installation artGraphical user interfaceSampling (statistics)Wave packetPoint cloudWindows RegistryProjective planeGroup actionInstance (computer science)Computer configurationBasis <Mathematik>Mathematical analysisSoftware repositoryReference dataServer (computing)Set (mathematics)VirtualizationSoftwareLocal ringTask (computing)Uniform resource locatorBiostatisticsMedical imagingProcess (computing)Mobile appQueue (abstract data type)Core dumpOnline helpData storage deviceService (economics)Integrated development environmentSimilarity (geometry)NumberMereologyVirtual machineVideo gameContext awarenessBitConnectivity (graph theory)SpeciesQuicksortSequenceArrow of timeGraphical user interfaceFunctional (mathematics)ConsistencyGraph (mathematics)1 (number)Shared memoryRepository (publishing)DatabaseMechanism designInteractive televisionLevel of measurementType theorySingle-precision floating-point formatSurjective functionWebsiteDifferent (Kate Ryan album)EstimatorStatisticsMultiplicationEvent horizonComputer architectureDisk read-and-write headWeb serviceFrequencyPhysical systemSoftware frameworkSoftware developerNeuroinformatikSupercomputerEndliche ModelltheorieStability theoryECosWater vaporArchaeological field surveyRational numberMultiplication signLine (geometry)Right anglePhysical lawStandard deviationForcing (mathematics)Graph (mathematics)State of matterComputer fileCoordinate systemAlgorithmPlotterUniverse (mathematics)Musical ensembleCuboidLevel (video gaming)CodeResource allocationTouchscreenPairwise comparisonPhysicalismComputing platformTrailXMLProgram flowchart

Transcript: English(auto-generated)

00:00

Um, okay, you can see my screen? Yes. Okay, great. So I'm going to talk about bioscience, STABL and RDC projects. I should point out this. This is molecular bioscience, not imaging bioscience or other types of bioscience. So I'm based at the University of Queensland. I work at Q-SIF

00:21

and the University of Queensland, RCC and I'm also at MBLAVR. There's quite a few development partners on these projects. So Q-SIF and Q-FAB and RCC here in Queensland, Melbourne Bioinformatics in Victoria and the Centre for Comparative Genomics in Perth. And the project supported in one way or another by Bio Platforms Australia, EMBL Australia, Bioinformatics Resource, the

00:43

Atlas of Living Australia and with funding from ARDC. So just to provide a little bit of context, I guess, molecular bioscience may be different to some other disciplines in that I guess it's birth as a data science has happened in a relatively short period. You've probably

01:01

all seen graphs like this. This shows the, in blue, the number of sequences that are held in a large international repository in the States called GenBank. And just to provide, I guess, context in a timeline, that arrow is when I finished my PhD. So basically, we've gone from a

01:22

non data science to a data science in a relatively short time. So that's all well and good, but how many life scientists are really taking full advantage and, and sort of being part of this, the what analysis of this data allows you to do. So this is just a graph showing estimates that have been undertaken from a working group

01:45

looking at an Australian Bioscience data capability. And it classifies biologists into four broad groups. And on the right, we have biology focused Bioscience researchers. So these are people that are working in the lab at the wet, you know, doing wet experiments at the

02:01

bench, and this still forms the large majority of biologists in this group. So those people might go and use a web service once a month or something like that and do a look at go to the NCBI and look up a sequence or do a blast, but they really don't necessarily engage more than that. So this next group here is what we call data

02:22

intensive Bioscience researchers. And these are the group that is growing. So these are people who may have decided to do a gene sequencing experiment. They've sent their sample away to a sequencing facility, they've got the data, and now they need to know what to do with it. Then we have a couple of other groups here, which are

02:44

smaller. So the bioinformatics intensive Bioscience research group effectively use bioinformatics on a day to day basis in their research and probably don't really do much wet experiments. And then we have biometricians here on the left, and this is the group

03:02

that are developing new algorithms. So what I'm really going to talk to you about today is that the audience we're focusing on for the DEVIL and RDC projects are these groups at the right. So they're the groups that are not, you know, they don't have it all sorted out and

03:21

they may or may not, or they might be easing into this data science world and need some help and tools to be able to do that. So the DEVIL project, it builds on the previous Genomics Virtual Lab work. So that was a NECTA funded project. And what the GVL is essentially

03:42

is a server image. So this server image contains a number of standard tools. So Galaxy, I'll talk quite a bit about that, but it also has RStudio, JupyterHub, Command Line Access, Virtual Desktop, and some administrative tools on it. And then there's a number of optional

04:04

bioinformatics pipelines and analysis tools that can also be launched when a server is, a GVL virtual machine is launched. So the other part of it is a virtual machine that's running the server image and

04:21

it's been built so that it can run on an open stack cloud such as NECTA or on EC2 cloud such as Amazon. So that effectively is the GVL. So the option that previously or still exists for people to use this is to fire up and self manage your own GVL instance as a URL

04:43

here. But the steps for doing that are effectively on if one was to use the research cloud here in Australia is that we need to access the NECTA dashboard, the user has to get their NECTA allocation, it's probably just worth mentioning that a

05:01

trial project is sufficient for launching a GVL, then the user has to obtain their cloud credentials, launch their personal GVL instance, access it, manage it and use it and then shut it down. So that actually is still quite a technically challenging set of tasks around in the audience. So the second option for

05:24

using a GVL has been to use a public managed GVL service. And until the beginning of this year, there was actually a few of these. So there was an R Studio service, and three Galaxy services. So one hosted by RCS at the University of Queensland, one

05:42

hosted by Melbourne Biopharmatics and a training instance. So the devil projects has a few broad aims, we're going to talk about all the aims, but the main ones are to rationalise and rearchitect the public managed GVL services. So that's effectively taking these four public services and

06:04

developing one single service, which is called Galaxy Australia. And that includes R Studio and JupyterHub. That's now public, the URL is there, it's usegalaxy.org.au. The proposed architecture that underlies this federated, I

06:24

guess, model is that there's a head node that resides here at the University of Queensland in the in the Research Computing Centre. And there's a SLURM queuing system that submits jobs to worker nodes. So that's pretty much what

06:41

Galaxy, the Galaxy instances were like previously. What's being done to I guess, speed things up and make make the service more efficient is by separating off database service. But then there's also this new, I guess, component of

07:02

Galaxy and they're called interactive environments. And it's possible to just launch up a single use virtual machine for these number of interactive environments such as RStudio or Jupyter. So they run on this Docker swarm. So the idea of that is that it's just a VM that's fired up for a

07:21

particular session, and then it shut down again. So that's, that's what's here at UQRCC. But obviously, we need to think about how we can increase the computational resources that sit under a national service. So one of these is submitting jobs not just to worker nodes on

07:43

Nectar Cloud, but also submitting jobs to HPC. So Galaxy is pretty good in that it can submit jobs over a SLURM queue, a PBS queue and some other queuing methods. So one thing we're working on now is also submitting jobs to the HPC machines here at UQ. Now, we're also going to be

08:04

submitting jobs from the head node using a Condor queue to a Condor head node sitting at the University of Melbourne. And because we can submit jobs from the head node to either cloud or HPC via PBS SLURM or Condor, we can also

08:22

submit jobs to other sites. And we're currently having some discussions with the University of Sydney who are interested in supporting Galaxy Australia as well. The second main aim is to harmonise the look and feel with other global Galaxy services. So Galaxy Australia is not the only one. There's actually over 90 Galaxy

08:40

servers around the world. But the two other main ones are usegalaxy.eu, which is hosted in Freiburg, and usegalaxy.org, which is hosted in the US. Novojtek talked about a consistent user experience. So this project between the three Galaxies listed here is all about having a consistent user experience with

09:02

similar tools, with similar training material, with similar look and feel and layout and reference data sets as well. I should say, so there's a global Galaxy tool shed. This is like an app store for Galaxy. So when command line

09:21

tools are wrapped and enabled to be using Galaxy, it can then be downloaded from the Galaxy tool shed. And we have a policy now on Galaxy Australia that all of the tools that are installed have to be installed from the Galaxy tool shed. And to get into the Galaxy tool shed, there's a core set of tools that have

09:41

undergone extensive QC. The third aim of the devil is to rationalise and expand our existing training efforts. So one of the environments I mentioned previously was a Galaxy tube, so that was just for training. It will be eventually

10:01

decommissioned over the next few months, and we'll use the Galaxy Australia service for all of the training. We have developed previously Australia training material here in Australia, so that's being rationalised. And there's also a global Galaxy training material registry. So all

10:22

of the Australian material is going into that particular training registry as well. So again, the idea of all of this is that it should be possible to go to any of these global Galaxy resources and be able to use that material on our Australian instance. The project's also

10:42

establishing a national network of trainers in the Galaxy. So this is happening through the EMBL-ABR network. There's a train-the-facilitator two-day workshop that's happening in Melbourne. There's about 10 people going from around the country to be given the same training, and so they'll be able to go away and deliver Galaxy

11:02

training locally. And then we'll be undertaking at least three-hour virtual, we call them virtual physical national training events. So we have a lead trainer that's based in one location and then simultaneously around the country, we can hold training events in multiple places. So I'll

11:23

be holding three of those on different topics. So I also wanted to talk about the sister RDC project, so the Research Data Cloud, and this one is extending Galaxy Australia so that it actually can support other national infrastructures that require a bioinformatics

11:40

analysis functionality. And in this particular project, we're going to be supporting BPA's data portal. So BPA's data portal, it's used to store and share framework datasets during the period where the team are working on them. It's based

12:00

on a CCAN framework and it's accessible to consortium members. So it's primarily a data repository and storage and sharing mechanism. It doesn't have raw data, I should say. It doesn't have an analysis functionality. So what we are doing in this particular project is linking up the

12:20

two so that Galaxy Australia can perform that analysis functionality. So we're using it to support a group of researchers that are interested in metagenomics. So metagenomics is a methodology where you can get a sample, so it might

12:42

be something from the environment like soil or water, and then you can extract DNA from that mixed population. You sequence that DNA and then you kind of work backwards to identify what species were present in

13:01

that original sample. So to do that, we are installing a couple of tools onto Galaxy Australia. These are kind of, I guess, commonly used in this metagenomics analysis. These are called QIIME and MOTHER. So they're both being installed onto the system. Now, the

13:24

second part, I guess, of a metagenomics analysis is not just determining what microbes are present in one particular place, but doing statistical analyses across different places. So saying, okay, so what are the commonalities between site A and B and C? And there's a whole

13:42

plethora of types of analyses that people want to do. So see things on a map, do clustering, look at boxplots, so on and so forth, ordination. And so there are a number of R packages out there that do this. So two of

14:01

them are called PhyloSeq and Rea, so they're well known in the community. And what we're doing in the project is wrapping these so that these are available for use through the Galaxy graphical user interface. Obviously, QA-ing of those RAP tools, depositing them into the tool shed, and then installing them from the tool shed onto Galaxy Australia. There's also a

14:23

training component. So we're developing material for primary and secondary metagenomic analysis and depositing this material both into the global Galaxy training portal I talked about, but also into the EcoEd training portal, which is part of the EcoScience DEVIL project that's

14:44

occurring, and then also delivering workshop across the NREL-ABR nodes. So I guess the intended outcomes over time, we would like to see that we're enabling people, I guess, to move from one of these groups to the next

15:00

one to the left. So we have over time, so we would expect that there'll be more biology focused bio-researchers that are moving into the group that are enabled to actually undertake analysis, and then the group moving,

15:22

or this group also moving across. So I should say that it's both pretty short projects. So March to December 2018, both are on track, going pretty well, and there's a lot of people involved, and they are listed here. Okay, thank

15:41

you.