Cancer Genomics Linkage Project - Making data-intensive biology research easier by utilizing Australia's e-Science Infrastructure

Video thumbnail (Frame 0) Video thumbnail (Frame 597) Video thumbnail (Frame 1041) Video thumbnail (Frame 2687) Video thumbnail (Frame 5357) Video thumbnail (Frame 7421) Video thumbnail (Frame 9450) Video thumbnail (Frame 10865) Video thumbnail (Frame 13242) Video thumbnail (Frame 14070) Video thumbnail (Frame 14528) Video thumbnail (Frame 19627) Video thumbnail (Frame 21276) Video thumbnail (Frame 22602) Video thumbnail (Frame 29270)
Video in TIB AV-Portal: Cancer Genomics Linkage Project - Making data-intensive biology research easier by utilizing Australia's e-Science Infrastructure

Formal Metadata

Cancer Genomics Linkage Project - Making data-intensive biology research easier by utilizing Australia's e-Science Infrastructure
Alternative Title
Cancer Genomics Linkage Project 2012
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
A compelling short illustration of how important data access under a collaborative framework can be for others in areas of similar discipline by Jeff Christiansen, Senior Business Analyst, ANDS.
Projective plane Self-organization
Slide rule Game controller Atomic nucleus Causality Military base Projective plane Insertion loss Data structure
Group action Variety (linguistics) Line (geometry) Plotter Projective plane Range (statistics) Content (media) Line (geometry) Mereology Sequence Entire function Number Type theory Mathematics Urinary bladder Different (Kate Ryan album) Energy level Self-organization Circle Bounded variation Online chat
Point (geometry) Group action Information Connectivity (graph theory) 1 (number) Instance (computer science) Line (geometry) Generic programming Mereology Sequence Type theory Latent heat Latent heat Different (Kate Ryan album) Energy level Circle Pressure
Game controller Group action Information Bit Instance (computer science) Mereology Information privacy Sequence Entire function Information privacy Number Centralizer and normalizer Information Bounded variation
Algorithm 1 (number) Information Sequence
Web page Trail Set (mathematics) Mathematical analysis Web browser Mereology Rule of inference Virtual reality Different (Kate Ryan album) Software Software framework Circle Computing platform Physical system Point cloud Scripting language Mobile Web Information Software developer Projective plane Mathematical analysis Sampling (statistics) Planning Virtualization Line (geometry) Sequence Type theory Visualization (computer graphics) Software Personal digital assistant Point cloud Bounded variation Data integrity
Web service Virtual reality Process (computing) Electric generator Software Mathematical analysis output Mathematical analysis Instance (computer science) Function (mathematics) Task (computing) Point cloud
Complex (psychology) Group action Electric generator Software developer Projective plane Mathematical analysis Planning Set (mathematics) Digital object identifier Virtualization Sequence Connected space Revision control Workload Software Cube Universe (mathematics) Videoconferencing Data integrity
so thank you what I'm going to talk to you about today is a project and I think it quite nicely illustrates how the a science infrastructure that is being developed by ends and other organizations such as nectar are having a real impact on making data intensive researched easier for um researchers I'm
Joe Christensen as as probably mentioned i'm a senior business analyst depend and i have a background in biology in data science and biology so what i'm actually going to start the talk with is not data for science or biology so this project
is based around cancer and research on cancer and cancer is its cause effectively by unregulated cell growth it's where a cell has losses inhibition and and grows out of control effectively and will end up forming a tumor and what actually causes this unregulated cell growth is the build up of many many DNA mutations in the chromosomes of a single cell so this shows here some DNA and and and then there's these little ACGT bases here you know some of them get mutated this DNA is packed into chromosomes and they're shown here on the left and that's really structures that hold the DNA and pack it into the nucleus so as I said there could be many many mutations across all of these chromosomes we have we have 23 pairs in humans what on the next slide shows that when whole genome
DNA sequencing well it's basically become cost effective to be able to sequence the entire DNA content of a single cell or a group of cells such as in a mutation in a cancer and what this here is showing is it's a quite a nice visualization of of effectively all the mutations in a type of cancer here it's actually data from a breast cancer cell line so it's not actually from a tumor it's from a cell line that behaves like a tumor and the plot here is called a circle but what it shows here is here on the Left we we have those chromosomes all laid out nicely 1 2 3 4 and so I'm here that they're just represented going around the circle and clockwise fashion so we have 1 2 3 4 and so on around the circle but what what whole genome DNA sequencing has shown is that there is effectively a lot of mutations and needs tumors there can be single based changes as I described just before that's where let's say an a is changed to a tee orgy is changed to a C but there's also variations in copy number of genes and also structural rearrangements and some of these structural rearrangements can be quite extreme and and they're shown here by all of these red lines and it's effectively where if this this line here represents that heart of this chromosome one has been attached onto chromosome 8 in this particular cell so basically you know hold the whole genome DNA sequencing has shown you know I guess it's shown that you know these tumors are actually much more mutated than we thought they were at a DNA level so the
project I'm going to talk about is part of an international cancer genome sequencing effort around the world there's 47 projects currently on the go across 15 countries a lot of them are in the USA and various European countries but there's other other countries in there such as a Mexico and Saudi Arabia and so on and so forth and in this effort Australia is involved into the sequencing two types of cancers and that's ovarian cancer and pancreatic cancer and I mean this is a really very large effort they're going to sequence 21 pounds in tumors across a whole range of of organs or answers from different organs and you can see a variety of those liver long bladder blood though so on and so forth but effectively a lot of types of cancers and then ate what they also need to do is is compared the mutated of the full the tumorous tissue the DNA sequence of interest issue to non tumor tissue so they also need to sequence DNA from national tremors and that's a lot of data effectively so as I said leave
the aim is to sequence DNA across all those types of tumors and that what's shown here is another of these circle spots and this is this is some data from a melanoma versus a lung cancer and you can see just at this level that there's a lot of differences in in the mutations between this particular cancer here and this particular cancer here there may also be some common ones for instance this this line across he might be the same as that one so by looking broadly across a lot of different tumor types the aim is to identify well what a lot of common mutations across all cancers so are there mutations in there that will effectively make a cell okay cancer cell but then because there's a lot of information about tissue-specific there will be a lot of information about tissue specific answers maybe it's also possible to identify mutations that are specific to a particular type of tissue and then ultimately what what all of this information is being seen is generated for is to inform therapeutic treatment so hopefully better and more directed cancer treatments will be able to be derived from this information so
back to the Australian component there's there's two groups that are that are responsible for sourcing the particular tumors so the pancreatic cancers are coming from Sydney from the government institute of medical research and professor andrew yankin is responsible for he's very much a clinician who is also doing you know biology research the ovarian cancer tumors are coming from melbourne from the Peter Mac Cancer Center and professor David hotel is responsible for that now the other part of the australian component which is very important is that is actually doing the DNA sequencing and performing bioinformatics so gleaning information from the DNA sequence and pressures on women at the point and sent for medical genomics in at the University of Queensland of Brisbane is responsible for that so in this Australian component
the derived data so this is effectively the mutation information it's not the Royal DNA sequence it's it's derived information and part of the international effort is is if it as a requirement that that information is released through a data portal that's internationally accessible and there's just a screenshot here showing some of the data for their pancreatic cancer here as you can see Queens and central medical genomics here there's some derived data here so this is showing in a particular gene doesn't really matter what gene it is but it's saying that four out of 67 tumors sequence have a copy number alteration or variation in this particular gene so if one goes in here and sees this information and clicks on this this and further information here it lists the ID of the donor and various information about the mutation so here we can see that in this particular donor in this particular tumor there's a copy number of 2 for instance and once again if you click on the donor there is actually some information about the person for that this came from and this is a 69 year old male from New South Wales and there's that there's a little bit of information there about the type of patient but all of that derived data is is release through the International data portal but their access to the raw data is
controlled and that's because of privacy issues and apart from things like somebody's name I mean almost the ultimate in in identifying a person is their entire DNA sequence so that that that access to the raw data is controlled and it's controlled or only bona fide of researchers who are doing collaborative research with with these these groups are permitted to have access to the raw data okay so that the the sort of broad
mutations that I showed a couple of salons bank these ones here these these
are called by a algorithm dis runs through the sequence and predicts some
certain mutations of certain known types but what really is is is of need for these scientists working on this is that they need to be able to actually analyze the raw data themselves to be able to identify maybe other mutations and various rearrangements and so on so forth and and really it's the it's the scientists the wet lab scientists or the clinicians who would like to be able to analyze that raw data traditionally this would have definitely required a bio petition because a lot of the analysis of the rule DNA sequence needs to be analyzed with scripts and various things like that but not really this is not really anymore going to be the case if if virtual laboratory is used so one of the other research infrastructure developers in Australia which nectar is developing a virtual genomics lab and this and project with the DNA sequence of the tumors is very closely aligned with the nectar virtual genomics laboratory project so what is the nectar virtual genomics lab well it's it's basically a system that's going to allow DNA sequence analysis software to be stored on the research plan in Australia and analyzed on the research cloud here so there's one aspect and that's a data integration aspect so the data is is accessible for analysis the data integration through the through the through virtual genomics mobile vgl is going to be the latest human genome reference sequence there's a a another very large data set called the thousand genomes data set and that's where the full genetic information from a thousand individuals has been sequenced so part of that information is also going to be held in the vgl and then by platforms Australia which is an increase capability is is generating data for various framework data sets of reference or or of importance to Australia and these are wine yeast because one is obviously a large export crop melanoma which is a serious health problem in this country soil because Australia's soil is not the best for agriculture and forestry and things like that so I'd better understanding what microorganisms live in the soil is very important to know and the other the other data set that fire becomes Australia is generating is for wheat and again we courses a is a is a significant cash crop for Australia and again to grow in the poor soils of Australia better understanding what types of weeks the genetic makeup of them can can help to increase yields so that's the virtual genomics lab data integration side of things the other part is the analysis visualization and analysis so the software for visualization that's that will be running in the vgl is something
called the UCSC genome browser so this is University of California Santa Cruz genome browser it's very widely used across the biology arena for visualizing genetic information and it may be slightly too small to see here but this is this is actually just showing one chromosome here and this little red line here is is is blown up into this here and effectively what you can do is you can have many many tracks associated with this and these can be selected here and go way down page but you can you can literally have thousands of tracks aligned to the reference sequence so whilst those circles once I showed very very good for showing the difference for the differences in variation within one particular type of tumor the the UCSC genome browser is great for showing variation in many many many particular or different samples or tumors so the UCSC genome browser will be running on the virtual lab and the
other the other aspect that's that's very important for the vgl is an in and out and it is an analysis package and that's called galaxies and galaxies has
it's widely used again in biology it generally would have a local instance of it but what it allows one to do is is it's effectively a workflow generator of web services and all the web services do very small and specific I guess bioinformatics tasks but what a user can do is make these incredibly complex workflows out of these by saying do this process first and then do this process and so on and so forth so these very complicated workflows can be generated to take input data run it through many many processes and generate output data
so that was I guess a brief introduction about the virtual genomics lab and and we're planning another project which is called the cancer genome linkage project and effective really what it is going to do is to allow wet lab scientists and clinicians to allow complex cancer genomics data using the vgl and so this is wet lab scientists like Andrew the engine that we discussed before so the software development for this particular ants project will be done by coin same facility for Advanced Diploma maddox they're based at the University of Queensland the development of of this software will be closely aligned with that of the nectar vgl and that that the nectar vgl is being developed by dr. microphase at university of queensland as well the data integration aspect for this project is going to be to include the very large raw pancreatic data set into the nectar vgl and this is specifically so Android being kinnon and groups like is that the government can can can access the data the project is also going to develop those workflows galaxy workflows that we saw so there's will be reusable by clinicians and what will allow is much easier mutation searching and analysis of the raw sequence by the wet lab and wet lab scientists and clinicians what we'll also be doing is minting digital object identifiers or the galaxy workflows so because these are reusable these are and and and a lot of the galaxy workflows are very very complex so it's a great idea to have those reusable and have have them sizable as well to be rerun on the nectar virtual genomics lab and and and what it also what it also does is I think allows the workload to be properly identify and and rerun in a in a standardized way or to allow users to and to make sure that they using the exactly the same version of the workflow that was described according to the digital object identifier and then ultimately the software that's developed through the pancreatic data set will be readable for other groups and obviously the first one will be people on the group studying ovarian cancer at the Big Mac um in Melbourne so that's that's really I guess what I wanted to discuss today and and I think it's a very nice example of a lot of the infrastructure that's been put in place or the e research infrastructure that's been put in place it's coming to fruition and there's there's a lot of I guess connections and people and institutions that are involved in this project obviously there's the funders the Queensland Government the national government and their nation and as they have all funded generation of this data or generation of the infrastructure that is being developed both nectar and hands have been taking the role in ensuring that the video will be happy running there's a lot of institutions who are generating data so the garvan institute of medical research which is affiliated with the University of New South Wales we have a Peter Mac which is affiliated with the University of Melbourne and then we have you cube who is doing the automatics development will get the paramedics development along with Q fab and Q safe as well so what what all this is ultimately allowing is for data to be generated from that data research is conducted on it and from that knowledge is is is is produced and of course mindful magicians and clinicians are involved in all this but ultimately it's a very nice example of this infrastructure actually being used to help people so patients and to ultimately for potential therapies so thank you