Using Machine Learning to Distinguish Infected from Non-Infected Subjects at an Early Stage Based on Viral Inoculation. - TIB AV-Portal

Using Machine Learning to Distinguish Infected from Non-Infected Subjects at an Early Stage Based on Viral Inoculation.

00:00

24

Related Material

Technische Informationsbibliothek (TIB)

Verma, Ghanshyam Jha, Alokkumar Rebholz-Schuhmann, Dietrich Madden, Michael G.

Formal Metadata

Title

Using Machine Learning to Distinguish Infected from Non-Infected Subjects at an Early Stage Based on Viral Inoculation.

Title of Series

Data Integration in the Life Sciences (DILS2018)

Number of Parts

12

Author

Verma, Ghanshyam

Rebholz-Schuhmann, Dietrich

Madden, Michael G.

License

CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/38611 (DOI)

Publisher

Technische Informationsbibliothek (TIB)

0000-0002-5190-1867 (ORCID)

1080328793 (GND)

04aj4c181 (ROR)

Release Date

Language

Content Metadata

Subject Area

Life Sciences Computer Science

Genre

Conference/Talk

Abstract

Gene expression profiles help to capture the functional state in the body and to determine dysfunctional conditions in individuals. In principle, respiratory and other viral infections can be judged from blood samples; however, it has not yet been determined which genetic expression levels are predictive, in particular for the early transition states of the disease onset. For these reasons, we analyse the expression levels of infected and non-infected individuals to determine genes (potential biomarkers) which are active during the progression of the disease. We use machine learning (ML) classification algorithms to determine the state of respiratory viral infections in humans exploiting time-dependent gene expression measurements; the study comprises four respiratory viruses (H1N1, H3N2, RSV, and HRV), seven distinct clinical studies and 104 healthy test candidates involved overall. From the overall set of 12,023 genes, we identified the 10 top-ranked genes which proved to be most discriminatory with regards to prediction of the infection state. Our two models focus on the time stamp nearest to t = 48 hours and nearest to t = \Onset Time" denoting the symptom onset (at different time points) according to the candidate's specific immune system response to the viral infection. We evaluated algorithms including k-Nearest Neighbour (k-NN), Random Forest, linear Support Vector Machine (SVM), and SVM with radial basis function (RBF) kernel, in order to classify whether the gene expression sample collected at early time point t is infected or not infected. The \Onset Time" appears to play a vital role in prediction and identification of ten most discriminatory genes.

Keywords

Machine learning

Respiratory viral infection Prediction

Deferentially expressed genes

Data Integration in the Life Sciences (DILS2018)3 / 12

1

18:25

Converting Alzheimer’s disease map into a heavyweight ontology: a formal network to integrate data.

2

23:24

Interactive Visualization for large-scale multi-factorial Research Designs.

3

17:37

Using Machine Learning to Distinguish Infected from Non-Infected Subjects at an Early Stage Based on Viral Inoculation.

4

36:02

The Hannover Medical School Enterprise Clinical Research Data Warehouse: 5 years of experience.

5

22:24

Automated Coding of Medical Diagnostics from Free-Text: the Role of Parameters Optimization and Imbalanced Classes.

6

19:21

A learning-based approach to combine medical annotation results.

7

52:04

Invited Talk on "The de.NBI network – a Bioinformatics Infrastructure in Germany for Handling Big Data in Life Sciences.”

8

18:37

Lung Cancer Concept Annotation from Spanish Clinical Narratives.

9

22:00

Linked Data based Multi-Omics Integration and Visualization for Cancer Decision Networks.

10

31:17

Leaving no stone unturned: Using machine learning based approaches for information extraction from full texts of a research data warehouse.

11

22:37

Construction and Visualization of Dynamic Biological Networks: Benchmarking the Neo4J Graph Database.

12

24:10

Towards research infrastructures that curate scientific information: A use case in life sciences.

Automatic playback

Speech

Text

Image

00:00

Level (video gaming)DisintegrationMachine learningComputerElectronic meeting systemIdentical particlesVirtual machinePresentation of a groupComputer fontState of matterLevel (video gaming)XMLUMLLecture/ConferenceComputer animation

00:30

Experimentelle VersuchsforschungDisintegrationFrequencyScale (map)Direction (geometry)CausalityPresentation of a groupFrequencySet (mathematics)Computer animation

01:10

Level (video gaming)FrequencyDisintegrationScale (map)Hidden Markov modelElectronic mailing listLevel (video gaming)Computer animation

01:26

DisintegrationDifferent (Kate Ryan album)MetreTimestampSet (mathematics)Computer animation

01:42

DisintegrationGeometryDifferent (Kate Ryan album)Coefficient of determinationPoint (geometry)Repository (publishing)NumberUniform resource locatorObservational studyProjective planeSet (mathematics)Lattice (group)Order (biology)Form (programming)Multiplication signView (database)Sampling (statistics)BitOpen setComputer virusComputer animation

03:00

DisintegrationCodeReading (process)Table (information)Total S.A.Intrusion detection systemGeometryPredictionVideo game1 (number)Set (mathematics)Total S.A.Observational studyComputer animation

03:19

DisintegrationCodeReading (process)Table (information)Intrusion detection systemTotal S.A.Observational studyOpen sourceFunction (mathematics)Metric systemConsistencyNumberLink (knot theory)CodeVisualization (computer graphics)Intrusion detection systemSet (mathematics)Computer animation

04:15

DisintegrationCharacteristic polynomialVisualization (computer graphics)Sample (statistics)WindowSampling (statistics)Object (grammar)Multiplication signPoint (geometry)Link (knot theory)TimestampDimensional analysisPlotterLetterpress printingNumberDifferent (Kate Ryan album)NeuroinformatikSymbol tableComputer virusCellular automatonCharacteristic polynomialSlide ruleComputer animation

06:14

Sample (statistics)Level (video gaming)DisintegrationSound effectComputer virusMultiplication signPoint (geometry)Domain nameComputer virusCoefficient of determinationSlide ruleMusical ensembleFood energyCategory of beingSound effectRevision controlSampling (statistics)Level (video gaming)State of matterComputer animation

07:23

DisintegrationExperimentelle VersuchsforschungRandom numberLinear mapForestSummierbarkeitPerformance appraisalMathematical analysisQuantum stateSystem identificationEndliche ModelltheorieSoftware testingSample (statistics)Design of experimentsPoint (geometry)AdditionDifferent (Kate Ryan album)Multiplication signMachine learning40 (number)Slide ruleRhombusNumberEndliche ModelltheorieState of matterCross-validation (statistics)Software testingNonlinear systemLinearizationResultantVirtual machineSampling (statistics)Computer animationProgram flowchart

08:30

Performance appraisalMathematical analysisQuantum stateSystem identificationSample (statistics)Endliche ModelltheorieSoftware testingExperimentelle VersuchsforschungDisintegrationEndliche ModelltheorieLevel (video gaming)Quantum stateEuklidischer RingComputer animationProgram flowchart

08:54

Experimentelle VersuchsforschungDisintegrationNumberQuantum stateDesign of experimentsMultiplication signState of matterPoint (geometry)Combinational logicObservational studyDifferent (Kate Ryan album)Computer animation

10:07

Experimentelle VersuchsforschungDisintegrationCombinational logicNumberNegative numberSampling (statistics)Position operatorState of matterMultiplication signDesign of experimentsComputer animation

10:56

AlgorithmDisintegrationForestRandom numberSupport vector machineLinear mapKernel (computing)Radial basis functionPattern languageBinary fileSocial classLinearizationAlgorithmSupport vector machineSocial classVirtual machineRandomizationResultantBinary codeNonlinear systemState of matterSampling (statistics)Pattern languagePredictabilityMachine learningCase moddingForestComputer animation

11:49

Support vector machineRadial basis functionLinear mapKernel (computing)Random numberForestDisintegrationTable (information)Sigma-algebraNetwork topologyMachine learningMedical imagingDifferent (Kate Ryan album)Software testingResultantPoint (geometry)Virtual machineParameter (computer programming)Range (statistics)Set (mathematics)Endliche ModelltheorieMultiplication signComputer animation

12:41

Radial basis functionForestRandom numberDisintegrationTable (information)Kernel (computing)Parameter (computer programming)Data modelAlgorithmSigma-algebraNetwork topologyProtein foldingSupport vector machineMultiplication signGoodness of fitDifferent (Kate Ryan album)Computer animation

12:56

AlgorithmParameter (computer programming)Data modelRadial basis functionRandom numberForestDisintegrationTable (information)Kernel (computing)Sigma-algebraNetwork topologyProtein foldingNetwork topologyComputer animation

13:17

Support vector machineRadial basis functionKernel (computing)ForestRandom numberDisintegrationPairwise comparisonError messageDivisorState of matterVirtual machineForestDifferent (Kate Ryan album)Inclusion mapRankingMultiplication signState observerRandomizationComputer virusComputer animation

14:33

System identificationDisintegrationRandom numberForestThresholding (image processing)Electronic mailing listThresholding (image processing)Level (video gaming)Computer animation

15:04

DisintegrationVideo gameSymbol tableForestRandom numberTable (information)MP3Plot (narrative)Point (geometry)Game theorySymbol tableForestMachine learningDifferent (Kate Ryan album)PlotterCuboidAlgorithmVirtual machinePoint (geometry)Multiplication signComputer animation

15:56

DisintegrationTime domainSymbol tableHill differential equationPositional notationFunctional (mathematics)Mathematical analysisTable (information)Total S.A.CodeSample (statistics)Level (video gaming)Mathematical analysisSampling (statistics)Expert systemMultiplication signSet (mathematics)Level (video gaming)P-valueState of matterDomain nameEndliche ModelltheorieSymbol tableReading (process)Scaling (geometry)Computer animation

17:29

DisintegrationMeeting/InterviewXMLUML

Transcript: English(auto-generated)

00:00

Hello, everyone. My name is Ghansham Verma. I'm from National University of Ireland, Galway. And here's the title of my presentation, Using Machine Learning to Distinguish Infected from Non-infected Subjects at an Early Stage Based on Viral Inoculation. Here, the important thing is we are trying to predict the state of infection at an early stage based

00:26

using gene expression data. So here is the outline of my presentation. First I will talk about the introduction and then I'll talk about the data set on which I am working, then my research questions, then experimental design, result and conclusion. So I'll start

00:46

with the introduction. Respiratory viral infection. It is the most common cause of physician visit in U.S. Approximately around 120 million people seek health care annually. And it can lead to large-scale outbreaks and periodic epidemics. And, you know, it

01:08

involves huge health care cost. If you are able to predict the disease, which is hair respiratory viral infection at an early stage using gene expression data, then

01:23

it can help in dealing these kind of issues. So now I'll talk about the data set. So it is tricky to get the data on which we can get different timestamps and we can extract the early timestamp. So I was, like, exploring data sets and I was finding some data sets

01:45

and then I got to know that DARPA funded a project called PSD. It's not Doctor of Philosophy. It's predicting health and disease. So DARPA funded this project and using this fund of project researchers at three different locations, they collected gene expression data

02:05

and they conducted seven studies. You can see here those seven studies. And for each study, like, we have total four viruses for different studies, like one is RSV, another

02:21

is S3N2, H1N1, and HRV and RSV. So there are four viruses involved in these studies. And these are the number of subjects which are participated in each study. Then for each subject, there are blood samples taken at different time points. And here you can

02:41

see the time points for each study at which those samples were taken, blood samples were taken. And the whole data set, this gene expression data, was released on GEO, Gene Expression Omnibus, which is an open repository for data sets, these kind of data sets. And this data set is released in 2016. But this data is not ready to use. You need

03:05

to do some cleaning. So when I was exploring this data set, like, as I explained here, total 151 participants were involved in these seven studies. So out of these 151 participants,

03:24

we have to exclude 47 participants. Among those 47 participants, 44 subjects' data was faulty. And the three subjects' data, they don't have Affymetrix data collected. So they don't have gene expression data. So we need to exclude these 47 subjects.

03:43

And why these subjects were faulty, 47 subjects? Because they have inconsistencies between their declared symptomatic status and their setting status. So here you can see for each study, these number of subjects, like, these are the subject IDs we identified for each study. And there are total 44 subjects we identified. And we removed these subjects

04:06

from the data set. And now this ready-to-use data set is available on this link with my code. So here, like, I will talk about the layout of the data, like visualizations

04:20

of characteristics of the data. So in total, we have 104 subjects after excluding those faulty subjects. Now, for each subject, there are blood samples taken at different time points. Here we can see from 23, minus 23, 0 hour, 5 hour, 12 hour, and up to nth

04:45

hour. So this each cell is representing the blood sample taken at that particular time point. Like, each cell is representing the blood sample. And using that blood sample, they used Affymetrix device. And using that Affymetrix device, their gene expression

05:03

data was collected. And each cell is representing gene expression of 12,000 genes. So it's a kind of, like, we have huge dimensions here in respect to the number of subjects, because we have, like, 104 subjects. So then when the subjects participated, they

05:26

were examined that they were all healthy. And after taking their blood sample, each subject is given a virus among those four viruses. And then again, their blood sample

05:41

taken at different time points. So it happened that at some point of time, 64 subjects out of these 104 subjects, they got infected. So here, I was curious to know that at early time points, how well we can predict. So I was interested in early time points. So

06:01

I selected all those samples at 0 hour and at an earlier hour, around 48 hour. I took another time point. On set time, I'll talk about that in next slide. So here are my research questions, and both the research questions dealing issues at an early stage.

06:21

So my first question is, how accurately we can predict the state of infection at an early stage by analyzing gene expression data? And the next question is, that is also dealing the issue at an early time stage, that what are the genes which are contributing most in discrimination of infected sample from non-infected samples? So these are my

06:45

research questions, and this is my assumption. So after either 48 hour or at an on set time, the effect of virus should be visible in gene expression data. So in biomedical domain, like doctors and practitioners, they believe that around 48 hour time point,

07:06

the effect of the virus should be visible. So I took this 48 hour. And on set time is the time when people start, when someone starts showing their symptoms. So I'm interested in these two time points. These are early time points. So here is my

07:29

research questions. So I took 64 subjects, and then 64 subjects got infected. So we have different number of time points as well, like 48 hour and on set time. So based

07:41

on these, like I designed four experiments based on number of subjects and blood sample taken. So these are the four experiments. And for each experiment, I'll talk about these experiments in detail in next slide. So for each experiment, I divided the data into training and test. So after dividing the data in training and test using stratified

08:05

sampling, for each experiment, training data is used to train these well-known four machine learning algorithms. This is KNN, Random Forest, linear SVM, non-linear SVM, which are well-known in machine learning and gives very good results. And I used tenfold

08:21

cross validation to train these models. And then I predicted the infection state of infection, and I used this data to evaluate the performance of the models. And then I went one step further, and I identified the most important genes at an early stage.

08:42

So now let's see these four experiments in detail. So I identified these six states, A, B, B dash, C, D, and D dash. And using these states, I designed these four experiments. So, for example, in experiment number one, this experiment is designed using these two

09:03

states, A plus B, A plus B. So A is having those 64 subjects which are not infected at zero hours but turn out to be infected at 48 hours. Now, the next experiment is A plus D, A plus D, the same 64 subjects but at different time point at onset time.

09:28

Similarly, experiment number three, this is the combination of these states, A, B, B dash, and then C. So it's at zero hour, all the subjects, and then at 48 hour, these

09:48

64 subjects plus those subjects which never got infected, 40 subjects. Because there are 40 subjects, even then, given virus, they never got infected throughout the study. So these are

10:03

the important subjects. And then in experiment number four, this is the combination of state A, C, D, and D dash. A, C, D, and D dash. So here those total 104 subjects, which is onset time at zero hour and onset time. So this is all about these four experiments

10:28

and for each experiment, here you can see the number of negative and positive samples. So, for example, for experiment one, at zero hour, you can see 64 subjects were there

10:40

and they all were not infected, negative. And at 48 hour, they became infected. So similarly, for each experiment, you can see at what hour, how many number of subjects were infected and non-infected. So this is all about the experiment design. And here you can see the methodology and algorithm used by me. So for predicting health and disease of those

11:05

subjects, I'm exploiting machine learning algorithm's capability to learn the pattern. And I'm using for each experiment, there are four experiments, and for each experiment, I am using these four machine learning algorithm to predict state of infection. One is K-nearest

11:25

neighbor, another is random forest, linear SVM, non-linear SVM with RBF kernel, which is a non-linear SVM. And then I'm predicting whether a given blood sample is infected or not infected. So this is a kind of binary class classification problem. Here

11:42

I have two classes. One is not infected and another is infected. So now let's talk about the result. So in results, I have results of four experiments. For all the four experiments, this is experiment one results, and this is the result of experiment two. Here, you

12:01

can see for each experiment, I have trained four algorithm, machine learning algorithm, and then their model parameters and accuracy at tenfold cross-validation and accuracy on holdout test set. So you can see for experiment one, where we have 64 subjects, and for experiment two, we have 64 subjects as well, but different time point. We are able

12:26

to get around similar accuracy, around nearly 67 to 75 percent in the range of that. So I did t-test to compare this, and I found that there is no significant difference between experiment one and experiment two. However, when we go to experiment three and experiment

12:43

four, which is involving 104 subjects at different time points, 48 hour and onset time, we are able to get kind of good accuracy. Previously, we are able to get around 67 to 75, and here we can get around, for this experiment three is around 81 to like, yeah,

13:03

78 to 81, and here around 81 to 85 kind of. So here is like t-statistics shows that there is huge, like significant improvement in the accuracy. So why there is improvement?

13:23

So in experiment three and experiment four, we have 104 subjects, whereas in experiment one and experiment two, we have 64 subjects. So inclusion of these 40 subjects who were exposed to the virus but never got infected, they give machine learning classifier an extra

13:43

capability to learn those differences between what happened to those 40 subjects and what did not happen to those subjects that got infected. So, and then the another observation was like, at highest, we are able to achieve 85% accuracy with experiment four,

14:07

which is at onset time, and around 84% and 85% using those experiments, SVM and random forest. But t-test says that there is no significant difference. However, random forest has

14:22

an extra capability to assign rank or say importance score to the features, and here gene are the features. So I used random forest to assign importance score to each gene.

14:40

So here I am identifying the top most important genes at an early stage. So here you can see the genes and here the overall importance score. So here clearly we can see that there is a threshold around 53, and these top 10 genes are the most important

15:03

genes. So here you can see the detail of all those top important genes. These are the gene gene symbols and overall importance score assigned by random forest, which is well known machine learning algorithm. And then I plotted the box plot for those top 10 genes at different

15:24

time point at 0 hour, 48 hour and onset time, and you can see that these top 10 genes are actually differentially expressed genes. For example, you can see for this gene, IFIT1, which has importance score 100, and you can see that there is like very, very much

15:44

different significant difference between 0 hour and 48 hour. So it's actually the highly differentially expressed genes and it supports our finding. So yeah, then we went one step further and we did JASA analysis, gene set enrich analysis. And this is a kind of analysis

16:07

which is used to, which is done by domain experts to find out whether a particular set of gene is following, associated with some kind of disease or not. So here we use this

16:24

analysis to find out whether our set of gene is significantly associated with the respiratory viral infection or not. And here you can see the P value showing that yes, these top 10 genes are strongly associated with the respiratory viral infection. So

16:43

in conclusion, we answered those two important questions, which are addressing the issues of early time stage. And the question one was how accurately we can predict the state of infection at an early stage by analyzing gene expression data. And yes, we can do that with 82% accuracy nearest to 48 hour and with 85% accuracy nearest to onset time. And the second question,

17:06

which is also addressing the issue of early stage, which is what are the genes which are contributing most in discrimination of infected samples from non-infected samples. And we identify top 10 important genes and I believe that these top important genes can help in

17:23

early treatment and drug discovery at early stage. Thank you very much.