Using Machine Learning to Distinguish Infected from Non-Infected Subjects at an Early Stage Based on Viral Inoculation.
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 12 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/38611 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
Level (video gaming)DisintegrationMachine learningComputerElectronic meeting systemIdentical particlesVirtual machinePresentation of a groupComputer fontState of matterLevel (video gaming)XMLUMLLecture/ConferenceComputer animation
00:30
Experimentelle VersuchsforschungDisintegrationFrequencyScale (map)Direction (geometry)CausalityPresentation of a groupFrequencySet (mathematics)Computer animation
01:10
Level (video gaming)FrequencyDisintegrationScale (map)Hidden Markov modelElectronic mailing listLevel (video gaming)Computer animation
01:26
DisintegrationDifferent (Kate Ryan album)MetreTimestampSet (mathematics)Computer animation
01:42
DisintegrationGeometryDifferent (Kate Ryan album)Coefficient of determinationPoint (geometry)Repository (publishing)NumberUniform resource locatorObservational studyProjective planeSet (mathematics)Lattice (group)Order (biology)Form (programming)Multiplication signView (database)Sampling (statistics)BitOpen setComputer virusComputer animation
03:00
DisintegrationCodeReading (process)Table (information)Total S.A.Intrusion detection systemGeometryPredictionVideo game1 (number)Set (mathematics)Total S.A.Observational studyComputer animation
03:19
DisintegrationCodeReading (process)Table (information)Intrusion detection systemTotal S.A.Observational studyOpen sourceFunction (mathematics)Metric systemConsistencyNumberLink (knot theory)CodeVisualization (computer graphics)Intrusion detection systemSet (mathematics)Computer animation
04:15
DisintegrationCharacteristic polynomialVisualization (computer graphics)Sample (statistics)WindowSampling (statistics)Object (grammar)Multiplication signPoint (geometry)Link (knot theory)TimestampDimensional analysisPlotterLetterpress printingNumberDifferent (Kate Ryan album)NeuroinformatikSymbol tableComputer virusCellular automatonCharacteristic polynomialSlide ruleComputer animation
06:14
Sample (statistics)Level (video gaming)DisintegrationSound effectComputer virusMultiplication signPoint (geometry)Domain nameComputer virusCoefficient of determinationSlide ruleMusical ensembleFood energyCategory of beingSound effectRevision controlSampling (statistics)Level (video gaming)State of matterComputer animation
07:23
DisintegrationExperimentelle VersuchsforschungRandom numberLinear mapForestSummierbarkeitPerformance appraisalMathematical analysisQuantum stateSystem identificationEndliche ModelltheorieSoftware testingSample (statistics)Design of experimentsPoint (geometry)AdditionDifferent (Kate Ryan album)Multiplication signMachine learning40 (number)Slide ruleRhombusNumberEndliche ModelltheorieState of matterCross-validation (statistics)Software testingNonlinear systemLinearizationResultantVirtual machineSampling (statistics)Computer animationProgram flowchart
08:30
Performance appraisalMathematical analysisQuantum stateSystem identificationSample (statistics)Endliche ModelltheorieSoftware testingExperimentelle VersuchsforschungDisintegrationEndliche ModelltheorieLevel (video gaming)Quantum stateEuklidischer RingComputer animationProgram flowchart
08:54
Experimentelle VersuchsforschungDisintegrationNumberQuantum stateDesign of experimentsMultiplication signState of matterPoint (geometry)Combinational logicObservational studyDifferent (Kate Ryan album)Computer animation
10:07
Experimentelle VersuchsforschungDisintegrationCombinational logicNumberNegative numberSampling (statistics)Position operatorState of matterMultiplication signDesign of experimentsComputer animation
10:56
AlgorithmDisintegrationForestRandom numberSupport vector machineLinear mapKernel (computing)Radial basis functionPattern languageBinary fileSocial classLinearizationAlgorithmSupport vector machineSocial classVirtual machineRandomizationResultantBinary codeNonlinear systemState of matterSampling (statistics)Pattern languagePredictabilityMachine learningCase moddingForestComputer animation
11:49
Support vector machineRadial basis functionLinear mapKernel (computing)Random numberForestDisintegrationTable (information)Sigma-algebraNetwork topologyMachine learningMedical imagingDifferent (Kate Ryan album)Software testingResultantPoint (geometry)Virtual machineParameter (computer programming)Range (statistics)Set (mathematics)Endliche ModelltheorieMultiplication signComputer animation
12:41
Radial basis functionForestRandom numberDisintegrationTable (information)Kernel (computing)Parameter (computer programming)Data modelAlgorithmSigma-algebraNetwork topologyProtein foldingSupport vector machineMultiplication signGoodness of fitDifferent (Kate Ryan album)Computer animation
12:56
AlgorithmParameter (computer programming)Data modelRadial basis functionRandom numberForestDisintegrationTable (information)Kernel (computing)Sigma-algebraNetwork topologyProtein foldingNetwork topologyComputer animation
13:17
Support vector machineRadial basis functionKernel (computing)ForestRandom numberDisintegrationPairwise comparisonError messageDivisorState of matterVirtual machineForestDifferent (Kate Ryan album)Inclusion mapRankingMultiplication signState observerRandomizationComputer virusComputer animation
14:33
System identificationDisintegrationRandom numberForestThresholding (image processing)Electronic mailing listThresholding (image processing)Level (video gaming)Computer animation
15:04
DisintegrationVideo gameSymbol tableForestRandom numberTable (information)MP3Plot (narrative)Point (geometry)Game theorySymbol tableForestMachine learningDifferent (Kate Ryan album)PlotterCuboidAlgorithmVirtual machinePoint (geometry)Multiplication signComputer animation
15:56
DisintegrationTime domainSymbol tableHill differential equationPositional notationFunctional (mathematics)Mathematical analysisTable (information)Total S.A.CodeSample (statistics)Level (video gaming)Mathematical analysisSampling (statistics)Expert systemMultiplication signSet (mathematics)Level (video gaming)P-valueState of matterDomain nameEndliche ModelltheorieSymbol tableReading (process)Scaling (geometry)Computer animation
17:29
DisintegrationMeeting/InterviewXMLUML
Transcript: English(auto-generated)
00:00
Hello, everyone. My name is Ghansham Verma. I'm from National University of Ireland, Galway. And here's the title of my presentation, Using Machine Learning to Distinguish Infected from Non-infected Subjects at an Early Stage Based on Viral Inoculation. Here, the important thing is we are trying to predict the state of infection at an early stage based
00:26
using gene expression data. So here is the outline of my presentation. First I will talk about the introduction and then I'll talk about the data set on which I am working, then my research questions, then experimental design, result and conclusion. So I'll start
00:46
with the introduction. Respiratory viral infection. It is the most common cause of physician visit in U.S. Approximately around 120 million people seek health care annually. And it can lead to large-scale outbreaks and periodic epidemics. And, you know, it
01:08
involves huge health care cost. If you are able to predict the disease, which is hair respiratory viral infection at an early stage using gene expression data, then
01:23
it can help in dealing these kind of issues. So now I'll talk about the data set. So it is tricky to get the data on which we can get different timestamps and we can extract the early timestamp. So I was, like, exploring data sets and I was finding some data sets
01:45
and then I got to know that DARPA funded a project called PSD. It's not Doctor of Philosophy. It's predicting health and disease. So DARPA funded this project and using this fund of project researchers at three different locations, they collected gene expression data
02:05
and they conducted seven studies. You can see here those seven studies. And for each study, like, we have total four viruses for different studies, like one is RSV, another
02:21
is S3N2, H1N1, and HRV and RSV. So there are four viruses involved in these studies. And these are the number of subjects which are participated in each study. Then for each subject, there are blood samples taken at different time points. And here you can
02:41
see the time points for each study at which those samples were taken, blood samples were taken. And the whole data set, this gene expression data, was released on GEO, Gene Expression Omnibus, which is an open repository for data sets, these kind of data sets. And this data set is released in 2016. But this data is not ready to use. You need
03:05
to do some cleaning. So when I was exploring this data set, like, as I explained here, total 151 participants were involved in these seven studies. So out of these 151 participants,
03:24
we have to exclude 47 participants. Among those 47 participants, 44 subjects' data was faulty. And the three subjects' data, they don't have Affymetrix data collected. So they don't have gene expression data. So we need to exclude these 47 subjects.
03:43
And why these subjects were faulty, 47 subjects? Because they have inconsistencies between their declared symptomatic status and their setting status. So here you can see for each study, these number of subjects, like, these are the subject IDs we identified for each study. And there are total 44 subjects we identified. And we removed these subjects
04:06
from the data set. And now this ready-to-use data set is available on this link with my code. So here, like, I will talk about the layout of the data, like visualizations
04:20
of characteristics of the data. So in total, we have 104 subjects after excluding those faulty subjects. Now, for each subject, there are blood samples taken at different time points. Here we can see from 23, minus 23, 0 hour, 5 hour, 12 hour, and up to nth
04:45
hour. So this each cell is representing the blood sample taken at that particular time point. Like, each cell is representing the blood sample. And using that blood sample, they used Affymetrix device. And using that Affymetrix device, their gene expression
05:03
data was collected. And each cell is representing gene expression of 12,000 genes. So it's a kind of, like, we have huge dimensions here in respect to the number of subjects, because we have, like, 104 subjects. So then when the subjects participated, they
05:26
were examined that they were all healthy. And after taking their blood sample, each subject is given a virus among those four viruses. And then again, their blood sample
05:41
taken at different time points. So it happened that at some point of time, 64 subjects out of these 104 subjects, they got infected. So here, I was curious to know that at early time points, how well we can predict. So I was interested in early time points. So
06:01
I selected all those samples at 0 hour and at an earlier hour, around 48 hour. I took another time point. On set time, I'll talk about that in next slide. So here are my research questions, and both the research questions dealing issues at an early stage.
06:21
So my first question is, how accurately we can predict the state of infection at an early stage by analyzing gene expression data? And the next question is, that is also dealing the issue at an early time stage, that what are the genes which are contributing most in discrimination of infected sample from non-infected samples? So these are my
06:45
research questions, and this is my assumption. So after either 48 hour or at an on set time, the effect of virus should be visible in gene expression data. So in biomedical domain, like doctors and practitioners, they believe that around 48 hour time point,
07:06
the effect of the virus should be visible. So I took this 48 hour. And on set time is the time when people start, when someone starts showing their symptoms. So I'm interested in these two time points. These are early time points. So here is my
07:29
research questions. So I took 64 subjects, and then 64 subjects got infected. So we have different number of time points as well, like 48 hour and on set time. So based
07:41
on these, like I designed four experiments based on number of subjects and blood sample taken. So these are the four experiments. And for each experiment, I'll talk about these experiments in detail in next slide. So for each experiment, I divided the data into training and test. So after dividing the data in training and test using stratified
08:05
sampling, for each experiment, training data is used to train these well-known four machine learning algorithms. This is KNN, Random Forest, linear SVM, non-linear SVM, which are well-known in machine learning and gives very good results. And I used tenfold
08:21
cross validation to train these models. And then I predicted the infection state of infection, and I used this data to evaluate the performance of the models. And then I went one step further, and I identified the most important genes at an early stage.
08:42
So now let's see these four experiments in detail. So I identified these six states, A, B, B dash, C, D, and D dash. And using these states, I designed these four experiments. So, for example, in experiment number one, this experiment is designed using these two
09:03
states, A plus B, A plus B. So A is having those 64 subjects which are not infected at zero hours but turn out to be infected at 48 hours. Now, the next experiment is A plus D, A plus D, the same 64 subjects but at different time point at onset time.
09:28
Similarly, experiment number three, this is the combination of these states, A, B, B dash, and then C. So it's at zero hour, all the subjects, and then at 48 hour, these
09:48
64 subjects plus those subjects which never got infected, 40 subjects. Because there are 40 subjects, even then, given virus, they never got infected throughout the study. So these are
10:03
the important subjects. And then in experiment number four, this is the combination of state A, C, D, and D dash. A, C, D, and D dash. So here those total 104 subjects, which is onset time at zero hour and onset time. So this is all about these four experiments
10:28
and for each experiment, here you can see the number of negative and positive samples. So, for example, for experiment one, at zero hour, you can see 64 subjects were there
10:40
and they all were not infected, negative. And at 48 hour, they became infected. So similarly, for each experiment, you can see at what hour, how many number of subjects were infected and non-infected. So this is all about the experiment design. And here you can see the methodology and algorithm used by me. So for predicting health and disease of those
11:05
subjects, I'm exploiting machine learning algorithm's capability to learn the pattern. And I'm using for each experiment, there are four experiments, and for each experiment, I am using these four machine learning algorithm to predict state of infection. One is K-nearest
11:25
neighbor, another is random forest, linear SVM, non-linear SVM with RBF kernel, which is a non-linear SVM. And then I'm predicting whether a given blood sample is infected or not infected. So this is a kind of binary class classification problem. Here
11:42
I have two classes. One is not infected and another is infected. So now let's talk about the result. So in results, I have results of four experiments. For all the four experiments, this is experiment one results, and this is the result of experiment two. Here, you
12:01
can see for each experiment, I have trained four algorithm, machine learning algorithm, and then their model parameters and accuracy at tenfold cross-validation and accuracy on holdout test set. So you can see for experiment one, where we have 64 subjects, and for experiment two, we have 64 subjects as well, but different time point. We are able
12:26
to get around similar accuracy, around nearly 67 to 75 percent in the range of that. So I did t-test to compare this, and I found that there is no significant difference between experiment one and experiment two. However, when we go to experiment three and experiment
12:43
four, which is involving 104 subjects at different time points, 48 hour and onset time, we are able to get kind of good accuracy. Previously, we are able to get around 67 to 75, and here we can get around, for this experiment three is around 81 to like, yeah,
13:03
78 to 81, and here around 81 to 85 kind of. So here is like t-statistics shows that there is huge, like significant improvement in the accuracy. So why there is improvement?
13:23
So in experiment three and experiment four, we have 104 subjects, whereas in experiment one and experiment two, we have 64 subjects. So inclusion of these 40 subjects who were exposed to the virus but never got infected, they give machine learning classifier an extra
13:43
capability to learn those differences between what happened to those 40 subjects and what did not happen to those subjects that got infected. So, and then the another observation was like, at highest, we are able to achieve 85% accuracy with experiment four,
14:07
which is at onset time, and around 84% and 85% using those experiments, SVM and random forest. But t-test says that there is no significant difference. However, random forest has
14:22
an extra capability to assign rank or say importance score to the features, and here gene are the features. So I used random forest to assign importance score to each gene.
14:40
So here I am identifying the top most important genes at an early stage. So here you can see the genes and here the overall importance score. So here clearly we can see that there is a threshold around 53, and these top 10 genes are the most important
15:03
genes. So here you can see the detail of all those top important genes. These are the gene gene symbols and overall importance score assigned by random forest, which is well known machine learning algorithm. And then I plotted the box plot for those top 10 genes at different
15:24
time point at 0 hour, 48 hour and onset time, and you can see that these top 10 genes are actually differentially expressed genes. For example, you can see for this gene, IFIT1, which has importance score 100, and you can see that there is like very, very much
15:44
different significant difference between 0 hour and 48 hour. So it's actually the highly differentially expressed genes and it supports our finding. So yeah, then we went one step further and we did JASA analysis, gene set enrich analysis. And this is a kind of analysis
16:07
which is used to, which is done by domain experts to find out whether a particular set of gene is following, associated with some kind of disease or not. So here we use this
16:24
analysis to find out whether our set of gene is significantly associated with the respiratory viral infection or not. And here you can see the P value showing that yes, these top 10 genes are strongly associated with the respiratory viral infection. So
16:43
in conclusion, we answered those two important questions, which are addressing the issues of early time stage. And the question one was how accurately we can predict the state of infection at an early stage by analyzing gene expression data. And yes, we can do that with 82% accuracy nearest to 48 hour and with 85% accuracy nearest to onset time. And the second question,
17:06
which is also addressing the issue of early stage, which is what are the genes which are contributing most in discrimination of infected samples from non-infected samples. And we identify top 10 important genes and I believe that these top important genes can help in
17:23
early treatment and drug discovery at early stage. Thank you very much.
Recommendations
Series of 2 media