We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Bioinformatics pipeline for revealing tumour heterogeneity

00:00

Formal Metadata

Title
Bioinformatics pipeline for revealing tumour heterogeneity
Subtitle
Bioinformatics pipeline for revealing tumour heterogeneity from single cells
Title of Series
Number of Parts
118
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Reproducibility of research is a common issue in science, especially in computationally expensive research fields e.g. cancer research. A comprehensive picture of the genomic aberrations that occur during tumour progression and the resulting intra-tumour heterogeneity, is essential for personalised and precise cancer therapies. With the change in the tumour environment under treatment, heterogeneity allows the tumour additional ways to evolve resistance, such that intra-tumour genomic diversity is a cause of relapse and treatment failure. Earlier bulk sequencing technologies were incapable of determining the diversity in the tumour. br / Single-cell DNA sequencing - a recent sequencing technology - offers resolution down to the level of individual cells and is playing an increasingly important role in this field. We present a reproducible and scalable Python data analysis pipeline that employs a statistical model and an MCMC algorithm to infer the evolutionary history of copy number alterations of a tumour from single cells. The pipeline is built using Python, Conda environment management system and the Snakemake workflow management system. The pipeline starts from the raw sequencing files and a settings file for parameter configurations. After running the data analysis, pipeline produces report and figures to inform the treatment decision of the cancer patient.
Keywords
GoogolPoint cloudMachine learningComputer fontSequenceSingle-precision floating-point formatNetwork topologyData modelMarkov chain Monte CarloCellular automatonAsynchronous Transfer ModePredictabilityResultant1 (number)FamilyBounded variationNetwork topologyFigurate numberMereologyNumberSingle-precision floating-point formatQuicksortInheritance (object-oriented programming)Model theoryInformationCellular automatonDialectFunctional (mathematics)Video gameRoutingSet (mathematics)Endliche ModelltheorieThermodynamisches SystemTwitterProfil (magazine)Computer fontCircleData dictionaryDifferent (Kate Ryan album)Self-organizationDecision theoryLevel (video gaming)Sampling (statistics)SequenceMultiplicationImage resolutionChainData analysisMachine learningNumbering schemeArithmetic meanOperator (mathematics)Multinomial distributionComputer scienceMathematical analysisAverageTraffic reportingRootBasis <Mathematik>Right angleComputer animation
CondensationNetwork topologyScalabilityComputer programmingProcess (computing)Data managementPhysical systemStatisticsDynamic random-access memoryOrder (biology)Thermodynamisches SystemActive contour modelProgramming paradigmIterationFigurate numberReal numberPoint (geometry)Heegaard splittingInheritance (object-oriented programming)Network topologyRight angleChainChemical equationRandomizationCondensationSoftware testingInsertion lossEndliche ModelltheorieVideo gameAdditionResultant2 (number)ScalabilityBit rateSequenceScaling (geometry)MultiplicationMereologyModel theoryData managementCluster analysisSpacetimeSampling (statistics)Numbering schemeSemiconductor memoryMultiplication signTask (computing)Data MatrixMoore's lawProgramming languageSoftware frameworkNeuroinformatikMarkov chain Monte CarloMiniDiscLogic gateDifferent (Kate Ryan album)Analytic continuationTunis
Rule of inferenceData managementPhysical systemFunction (mathematics)Computer fileElement (mathematics)Texture mappingPlastikkarteLine (geometry)Directory serviceSampling (statistics)Matching (graph theory)outputGastropod shellJava appletRule of inferenceArithmetic meanActive contour modelGraph (mathematics)Process (computing)SimulationObservational studyOrder (biology)Point (geometry)CodeRegulärer Ausdruck <Textverarbeitung>Data managementPhysical systemCodeScripting languageComputer fileDomain nameLatent heatTask (computing)Thermodynamisches SystemFunctional (mathematics)Function (mathematics)Programming paradigmDifferent (Kate Ryan album)Library (computing)Crash (computing)Error messageFormal languageComputer programming1 (number)Parallel portRow (database)Complete metric spaceDiagram2 (number)Poisson-KlammerQueue (abstract data type)DatabaseExtension (kinesiology)Level (video gaming)XML
Computer fileMathematical analysisConfiguration spaceComputer programmingParameter (computer programming)Set (mathematics)Active contour modelModule (mathematics)Function (mathematics)Functional (mathematics)Electronic mailing listDifferent (Kate Ryan album)Rule of inferenceGastropod shellCodeLibrary (computing)Formal languageLine (geometry)outputRegular graphMultilaterationSource codeXML
LoginCrash (computing)Active contour modelRevision controlStandard errorGastropod shellFunction (mathematics)Configuration spaceSocial classRadical (chemistry)File formatGoodness of fitXML
Configuration spaceComputer configurationScheduling (computing)Cluster analysisFile formatHierarchyBinary fileFormal grammarOperations researchData storage deviceSpacetimeIndependence (probability theory)Active contour modelProcess (computing)Semiconductor memory2 (number)Binary fileTask (computing)Configuration spaceRun time (program lifecycle phase)MiniDiscFile formatMetadataScheduling (computing)Sinc functionData dictionarySubsetSpacetimeStructural loadHierarchyParsingBinary codeOperator (mathematics)Cluster analysisJSONXML
Wrapper (data mining)Maß <Mathematik>Keyboard shortcutOpen sourceMereologyPerformance appraisalReal numberComputer simulationMultilaterationFormal languageStatisticsOpen sourceEndliche ModelltheorieWrapper (data mining)Computer fontKeyboard shortcutSlide rulePiTable (information)
Active contour modelProjective planeMathematical singularityRoundness (object)Lecture/Conference
Transcript: English(auto-generated)
Thank you very much So yeah, this talk is about the bioinformatics pipeline. We're developing at ETH Zurich and yeah, here is a Just introduction. So here is Just some research interest that I have. So we are working on data analysis pipelines and bioinformatics
Machine learning also we are developing some new methods to Yeah, explain certain biological data sets and previously I had some work with recommender systems
So that's also one of my research interest here. I have yeah here. There's github, linkedin, twitter information there And the outline of this talk is so I will first introduce the problem I will give you some basic biological information later
I will talk about so-called DNA mutation trees. So our our model for for representing mutations on the DNA and then I will talk about the pipeline and bioinformatics stuff that we are doing to address the cancer research and
Yeah, so I'll start with the biology background I'm not a biologist so my both my masters and bachelors were in computer science. So don't worry I can't go into much detail when it comes to biology So we are working on cancer research and
So the the data we have so Is coming from some hospital some patients go there get their tissue sequenced and then we are analyzing it and providing a A report to the clinician to basis treatment decision and the previous so they
Let me explain what a cell is so cell is the smallest living Thing in the in the human body and when cells come together they form tissues like a muscle tissue It is consisting of different muscle cells and then the tissues come together to represent the organ
Organs become systems and systems become the organism. That's the high school biology information that is required for this talk So the previous technologies so when when we want to analyze the cancer tissue the previous technologies were able to
Retrieve the information at the tissue level Like the the operation is called sequencing the the biological sample arrives to the lab and then it gets sequenced and as a result we get the Yeah, we have the digital information about it so that we can run our analysis and the previous technologies that were
Sequencing based on tissues. They were not able to detect the heterogeneity Among different cells because the so in the tissue you just get the average of the cells Therefore you get the most dominant mutation, but you you you ignore all the other mutations and as a result in the treatments when you
Have the treatment for the most dominant mutation sometimes the other ones also pop up and This is why single-cell Sequencing has an importance now and the new technology allows us to go into single-cell detail at the single-cell resolution
And Yeah from the from the single cells. Okay. We have we have this DNA inside the cell which has the Which contains the genetic information and DNA sometimes have mutations and some of those mutations are known to be associated with certain diseases and here we are
So in the talk, I will talk about our efforts to model those mutations So The the mutation we are considering is from the family of mutations called structural mutations. It is called copy number variation and So here on the on the figure, there's this blue part on one
Genome and on the second figure it has duplicated So this is a this is a this is a mutation So the variation in the copy number on the left there was just one copy of the blue But on the right there are two copies. So those copy number variations can be either duplications or deletions and
So we are going to be analyzing those copy number variations at the single-cell resolution Those mutations they have So they have this family sort of relationship because when a mutation happens sometimes Other mutations just span from the parent one. So there are child mutations and then sibling mutations
So they have ancestors as well and therefore Therefore we are modeling them in a tree Fashion, so then not necessarily binary, but it's a it's a tree to represent the mutation information So the genome is divided into different regions here. Those regions are just meaningful
Parts of DNA that are associated with certain functionality in the sort of functionality of life Let's say and here the root one. So root one doesn't have any mutations
The other one said I'm not sure if it is how readable they are but so they are representing those copy number profiles so this one says r1 plus 1 so in region 1 there is one extra copy and All of the all of the single cells that are Represented by this pink circle are going to have one more
extra copy of that region and And it goes on like this So it's a tree where we have a dictionary of regions and then we contain the extra Or missing copy number information and this is what we are trying to learn. So we have a
Machine learning model to learn the best suited tree for a given cancer sample and The way we are learning is by using a Monte chain Monte Carlo Markov chain Scheme and CMC. So for each tree We have a we have a mean of scoring it by using it richly multinomial model. I want
bore you with the Formulation but we can discuss about it if you are interested in after the talk So for a given tree and they and the sample data We have a way of scoring it
it's tells how likely it is to observe this data this tree given this data and then we are able to by using an MCMC scheme we are able to move from one tree to another and Then we will score the next tree as well and then we will see how Good the second tree is and if it is significantly better than the first one
We will discard the previous tree and we will update the model with the next tree and we continue like this It will gate usually takes lots of iterations on real data. It is Millions and these so that the bullet points are the MCMC moves we define so here I will talk about it
The first one is prune and reattach. So we randomly pick one note from the tree. We pick the The brown one and it has it happens to have two two children We just prune it and then we reattach it somewhere in the tree randomly just we just prune something randomly and attach it somewhere else and
afterwards we we scored the next tree as well and if it is Significantly better than the previous we keep this one we discard the other and we continue Another move we have is called at remove note
So we pick a note and then we randomly we randomly generate another note as a child of that note and then we see if this tree is better than the other and Another one is condense and split So here we have So Yeah, we pick one note alongside with its parent and then
We condense them into just one. So these two so the the second one got swallowed into the parent they became just one And then we see we test if this one represents the data better than the other this move in fact, this could be done by
Different at and remove notes so we but the reason we have this one is to help with the convergence because by insertion and deletion we would have to Use too much iterations but here is just simpler
And each of the move we have in the Markov chain, so it's it is Reversible Like from from the from the tree on the right hand side, we are able to go back to the tree on the left and we are Externally making sure that it is equally probable from to move from
left tree to the right tree To from right tree to the left tree. Otherwise the Markov chain would not be in Balance and it would it would cost certain biases and yeah so this is so this this one is a
Tree lot we learned from from a real data. This data was from a brain from a mouse brain and yeah, we again started with a random tree and after millions of iterations This tree happens to be the one that is explaining the mouse brain tumor evolutions the best
on the right hand side and below is the Is the original data matrix we are getting after the sequencing experiment and then? The figure above is how we can reconstruct it from the tree from the evolutionary tree we learned
Continue. Yeah, so this was the so this was how we are How we were defining the the model or Yeah, how we are modeling the heterogeneity in tumor, but in in real life, we have many more things in addition to it
So the first thing is reproducibility of the research. We want any other research Institute to Just read our paper and then able to reproduce the results Second requirement is scalability because in in genomics So in the in the past 10 years the cost of sequencing has been decreasing faster than the Moore's law
As a result, this is producing more genomic data every day and it is so it the growing rate is exponential There's too much genomic data produced there for the informatics matters have to be able to scale and this is the standard requirement
We have We often use multiple Programming languages the tree model it was so the MCMC parts was built in C++ since it was Requiring too many iterations therefore performance and as many of the machine learning frameworks out there
This so for this we also decided to C++, but many other parts are written in Python and sometimes we even need to use R How certain statistical methods are only implemented there and Yeah, multi processing because we are so the this data is
So I cannot run the experiments on my local machine. We are using the computational clusters and Since we have multiple notes, so we We try to make use of multi processing a lot and yeah cluster execution for two reasons one There's not enough memory in my local second. There's not enough time because we need to run things in parallel
And the resources management so for each task Each by informatics task we are doing we need to define the memory time and the disk space requirements otherwise into in order to better utilize the
cluster and we often need to look at the statistics about the resource usages in order to better tune the cluster execution and to achieve this We are using yeah Among many things we are using this workflow management systems
so it is That the one we are using is called snake make it is similar to G and you make it just follows its paradigm but it has the Pythonic syntax and see I like this figure. I took it from I guess one of the Previous snake make talk somebody gave yeah, so this just explains
It In a You know small diagram So as snake make is a workflow management system in Python it is consisting of different Different programs and those programs have dependencies between each other. Some of them are providing output to the others
Some of them are running in parallel not depending on each other and then they are getting merged connected in the end and Yeah, so this Snake make is a so it is provided as a Python package
You can just pick install snake make and it has exactly the same Python syntax with a few Extensions over it and it follows the G and you make paradigm which which is well established And yeah, so so the workflows are defined in rules and those rules are trying to create the output
Given the input file and the workflow management system is automatically defining the dependencies between different rules And by using snake make we can make use of all the existing Python libraries and So unlike other workflow management systems out there
Like when I need to use some Python functionality inside the workflow, I mean to write a Python script and I need to Make it executable So that's I can access it from from the shell but in snake make you can you can just use All the functionality of Python as it is. You don't need to wrap them into different different scripts
Automated logging of the status so since workflow management systems are consisting of multiple programs Sometimes even implemented in different languages When something crashes you need to know which one crashed and why it crashed and
If possible you may want to continue with the rest of the workflow or you may want to stop there but logging here is very important and snake make is providing a very Automated logging of all of the error warning and the status of each each rule
snake make came out in Bioinformatics domain, but it is a general purpose workflow Definition language so it can be used in Any any domain it's not domain specific and I will show you Some example here syntax. So the rule is
Basically a Task that needs to be done. So the rule can be Depending on the rule may use a shell or Python code itself or it can it can use I guess their their support for
our scripts as well So here this rule is going to take two inputs one is called the genome dot FA the other is fast Q and then Once these two inputs are provided the rule will automatically execute the shell command and then
It will provide the output if it fails to provide the output. It will crash Otherwise the rule will be successful and the next rule may may begin Here in the shell comment, there's this curly curly bracelet so it is it is how to tell So it is a way of communicating between the shell comment and the input files because in
otherwise if you want to invoke the same rule from the shell you just need to Yeah, do some extra work likewise the output only they're also serving is serving the same purpose And here is one extra feature of snake makers if you can have wild cards
So the sample here between the curly brackets, it is known to be a wild card and this feature I like very much because We that using a workflow management system, this is very hard to achieve So I will let me explain you what it is. So in the second line of the input, it looks at the data directory
Okay, go to data samples and then Find all of the fast Q files that have That match certain criteria. This can be a regular expression This can be just anything like a dot fast Q B dot fast Q C dot fast Q and then for each of those
input files create the output that contains the same wildcard and for each input we have so the the shell command gets executed and So without changing any line of codes we can basically make it scale
Just by just by using those wildcards and those wildcards So we can even use them across different rules like whatever created from this Java tool use the exact same thing in the Python rule and Produce the automatically and handle the dependencies automatically and by means of dependency
There is so before snake make runs it creates this direct acyclic graph of the jobs This one was from one of our simulation studies so at first one two three four first rules are Executed in in order first one finishes second one begins and so on but at some point
there are multiple ones because these ones do not have dependencies and they can run in parallel parallel and Likewise the last row of rules. They are also they also may run in parallel, but each one is depending on the previous one and Afterwards we have this aggregation at the end. It is similar to the MPI
Paradigm, basically you can run things distributed then you can aggregate them and this Direct acyclic graph of job. So this is This is created automatically. We don't need to tell them look
Okay, first do this rule then the second rule then the third rule we there's a way of enforcing Workflow management system to do that but by using the inputs and outputs it is automatically detecting the direct acyclic graph of the job execution
And here I will show you one more realistic snake file. So this is a complete make file from For their basic example, so here I want to show you how similar it is to Python syntax It is basically a Python. So it's a Python Library in fact and in the in the first first lines, we are importing some Python modules
it can they can be built in modules or custom modules like the secondary analysis here and This is yeah, just regular Python later. We have this config that is So it's the configuration file and since so I mean in a program there are often parameters and
Workflow is set of program. So there are more parameters and therefore it's a common practice to have configuration files separated Than the main workflow and in snake make there's this dictionary called config it is built in and
When you are invoking snake make you can just specify a config file and it will be automatically parsed this way You don't have to you. So this way you separate the workflow and the end the config And the rest is so there's here just a Python function. You can have any Python functions List comprehensions all the Python syntactic stuff, but the difference here is the rule
So so yeah, there's this rule and there's input output like in the past example But here instead of shell there's this run comment and which is just accepting Python code so we have Yeah, some Python code here to
Go through the files in one directory and do some stuff there and in the end create this File which happens to be the output And yeah, so this is how simple it is compared to Other workflow languages you just you are within scope of Python and you can you can make use of yes
Yes Yes, you can you can mix those two exactly
So good question. Yeah So personally, I usually use the make syntax. I usually call everything from the shell even if it is Python I may have Python classes, but I will write an executable in Python and call it from the shell That way I can better manage the the outputs and the logs standard error
On the warnings because this way if this crashes here the snake file will terminate and the other way is much easier to deal with It and yeah, so this is an example
Config file so it conflicts defined in snake make so if they support to syntax to Formats one is JSON. The other is YAML. YAML has the advantage of allowing comments, but JSON has the advantage of Easily being easily serializable. I mean because I
Often create the Jasons from some Python dictionary. So I automate certain tasks therefore in this Example, I use JSON but here you can use any config file and you can use any other Python parser for the config this is the supported so this is one of the two supported syntaxes in snake make and
The you had the execution also So yeah, snake make is automatically configurable with LSF scheduler. You can just pass it And in the config file given that you define the resources for each job Like this much memory for this job that much memory for the second job snake make automatically will create sub tasks
sub jobs on the clusters and That way you can Yeah, give how much so you can specify how much memory or how much runtime or how much disk space you want to give To each job. I will be very quick. So and another thing technology we are using is
hdf5 this hierarchical data formats Yet our binary is it to manipulate this comes very handy in genomics because we usually Make use of metadata and it so this day they are storing the data alongside with the metadata And since in the pipeline we are using C++ sometimes Python and sometimes are we need to have a common
Serializable format and we cannot use people we cannot import people in C++ or vice versa. So this binary files So there is another use of hdf5 basically we can use it and we can load the exact same thing
and then Yeah, and we can so the hdf5 allow us to just connect to some data and Perform operations on a subset of it without having to load everything into memory, which is also quite handy
and yeah, this is the Last slide or the one before the last one So in H in Python, there are two wrappers one is pie tables, which is high-level a very nice Rapper of hdf5 that interacts with pandas And the other is h5 py which is very similar to the C++ API
And yeah, so this this was the outline basically the problem at hand our statistical model and the bioinformatics parts the pipeline and the tools we are using to deal with that and the feature work so we Will publish the method first the statistical method
We will compare to other methods and we will have evaluations on real data Simulated data and we will show it on real data and later. This pipeline is going to be wrapped up We will do all the bindings to other languages and we will provide it On GitHub open source Lee and yeah, this concludes the talk. Thank you very much for your attention
Q&A Anyone has any questions?
Just go Microphones yes, please Perhaps that go to the microphone there. Just like yeah the closest microphone to you if I understand correctly Snake make is compatible with Singularity, so it's your project using singularity or not. Oh, I'm just wondering that I don't know what singularity is
kind of container technology But maybe I just just out of topic and maybe we can discuss it Thanks, okay, no one else then
Lunch will be served soon. I think everyone's hungry. Yeah, so let's give a round of applause