We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Development of an AI-assisted algorithm for the prediction of novel causal genes and variants for mendelian disorders from whole genome sequencing

00:00

Formal Metadata

Title
Development of an AI-assisted algorithm for the prediction of novel causal genes and variants for mendelian disorders from whole genome sequencing
Title of Series
Number of Parts
34
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Development of an AI-assisted algorithm for the prediction of novel causal genes and variants for mendelian disorders from whole genome sequencing Advances in DNA sequencing technologies have now enabled the rapid and cost-efficient identification of casual genes and variants for a number of diseases. This is especially true for Mendelian disorders, where patients who carry a causative variant in their genome, can finally obtain a definitive diagnosis on their disease. However, even with this revolutionary technology, the actual success rate of genetic diagnosis via next generation sequencing is currently only at around 30% for undiagnosed Mendelian disease cases. This is in part due to the limitations of the analytical methods that are available to identify and prioritize casual variants from the vast amounts of sequencing data generated. Currently, the genetic diagnosis of Mendelian disorders is performed by comparing the genome of a patient to those of a large number of controls. Such comparisons generally produce a large list of genetic variants that are unique to the patient. Many of these are probably benign and identifying the causal gene and variant can be a real challenge. To address this problem, we have developed a novel method thatranks candidate genes and variants using an AI-assisted algorithm that relies on IBM Watson’stext mining approach. As a proof of concept, we used a large whole-genome sequencing (WGS) dataset on Retinitis pigmentosa (RP) with 523 cases and 2,143 controls. Our method consists of the following steps: 1) Select the inclusion criteria of variants to maximize the difference between true positive rate for patients and false positive rate for controls based on previously known causal genes from a public database. 2) Using this inclusion criteria, create a list of candidate genes and variants. 3) Use IBM Watson to sort and prioritize this list of genes. Using this strategy on the RP WGS dataset, we were able to identify and priority 994 candidate genes. Notably, many of our top ranked genes shared structural and functional features with previously known RP genes. We also succeeded in increasing the diagnosis rate of RP from 37% to 52% by incorporating these top ranked candidates without increasing the rateof false positives in controls. Going forward we plan to further improve the approach by integrating other AI technologies that rely on omics or image analysis data. We also plan to develop a gene and variant registry with the aim of constructing a comprehensive infrastructure in Japan for studying the genetics of intractable diseases. In this registry, various AI technologies will be implemented to perform integrative analyses across various diseases.
Keywords
Lecture/Conference
Transcript: English(auto-generated)
I'm Shuji Kawaguchi of Kyoto University, not Sushi.
Today, I want to talk about methods of prediction of causal genes for Mendelian disorders. For Mendelian disorders, identification of novel causal genes is very important to genetically diagnosis.
It works. However, the actual success rate of genetic diagnosis is yet around 30%. Also, you can see technology is improved and makes low cost.
To address this problem, we developed a method to predict novel causal genes by using 4-organome sequence and AI technology based on IBM Watson.
Then, we adapted the developed method, a feasible study of latinitis pigmental cells. Our final target is to improve the diagnosis rate to more than 70%, but our research is halfway.
This slide shows why detection of causal genes is difficult.
By searching the 4-organome sequence, we can detect variant specifically to cast. So, we set this gene as candidate and then advanced analysis such as the
real data or family data, protein interaction, metabolic pathway or ontology by researching such information. And then, we decided this candidate is causal or not.
But there are many variants when we analyze 4-organome sequence.
So, we must do advanced analysis for all candidates. Then, we decided this candidate is causal or unrelated.
But unfortunately, most of the candidates are false positive by chance. So, to analyze all candidates is very time confusing and expensive.
So, we want to get breakthrough to identify optimal candidates. Then, we used IBM Watson to solve the problem.
There is Watson-Wadlock discovery, WDD. WDD is one of the solutions of IBM Watson. WDD incorporates tens of millions of articles in MEDLAR and discovers relations between genes and disease and drugs.
Predictive analysis is one of the functions of WDD. Predictive analysis, PA, needs two lists. One is known gene lists and the other is candidate gene lists.
Predictive analysis ranks these candidate gene lists by using similarity of known gene lists with candidate genes.
However, WDD does not work when input all gene candidates as candidate gene lists. So, appropriate selection of these lists is important for predictive analysis.
So, we created these two lists as a first creation of known gene lists. Several years ago, our group searched the causal variants of the ratinitis
pigmentos by target to exome analysis of 365 genes against 326 RP patients. We detected causal variants in 30 genes and 122 cases correspond to 37.4% genetically diagnosed.
There is another information called Retina database.
Retina database provides information of genes and genetic loss causing inherited retinal diseases. In the database, 90 causal genes of RP are registered at September of last year.
This past study and Retina database is suitable to known gene lists and used to create candidate lists. I will explain next slide. This slide shows the creation of candidate gene lists.
We used the whole genome sequence data of 523 RP cases and 2,143 controls. Then we decided the criteria for variants to fit to known causal genes.
Then we decided the criteria for variants as follows. One is stop-gain or splashing, or variant is stop-gain or splashing or frame shift.
The second is further non-synonymous mutations. 4 or more of 9 protein function prediction software predict the mutation as damaged.
The third is minor reflexes in RP case is 1.5 times greater than the controls.
Then, if at least one RP case of mozygote or compound heterozygote satisfies above 1 and 2 and 3 conditions,
we picked up this gene and 1028 genes satisfied this condition 4, among which 34 genes were non-causal genes.
68 is RP and 1 is control. This criteria is very fit to known genes.
However, the other 994 genes, 435 is RP, but also 423 controls are also having variants in G,
like mozygote or compound heterozygote in these genes. This criteria is also having high false positives. I think many true causal genes are included in these 994 genes.
We set these genes as candidate lists and rank them by WDD.
This slide shows the backflow to identify causal genes. First, input known information and set these as known gene lists.
Then, by using for genome sequence dataset, create criteria for variants and create the candidate gene lists. Shuji, how do you show one is a causal gene? How do you prove it's a causal gene?
This is not decided by our RP and known genes.
They are recorded in databases. This database was created by other papers and research results. This is known already. If you find a new one, how do you prove it's a causal gene?
How do you demonstrate a gene is causal for RP? Can I answer for him? I think at the end, whatever he will have as the top candidate, he will take it for genetic diagnosis.
That doesn't prove it's a causal gene. So, by checking the territorial data or family data... If you see it segregated, it's something like that.
Then, set two lists to what's on drug discovery and WDD sort and calculate the score of sorting these candidates.
Then, we used two strategies for WDD. I will explain it after. Then, we tested the developed method by using the RP case.
Some of them are already diagnosed by known genes. There are 326 patients and 37.5% cases are already diagnosed.
Then, we performed a whole genome sequence to list 135 cases. The list is now ongoing.
We adjusted the rate of diagnosis as false. Here is the calculated diagnostic data. The blue solid line is the ordinary use of WDD.
This data point shows the diagnosis rate by only using known genes.
The monotonically diagnosed rate was increased. By ordinary use of WDD, after the top 50 ranked candidates, the positive rate is also slightly increased.
I think this is because the lower ranked gene is affected by these top ranked genes. So, we used other strategies called recursive methods.
At first, WDD ranked only top 20 genes and picked up these top 20 genes and removed these 20 genes from candidate lists.
Then, WDD ranked top 20 genes and removed these top 20 genes and continued until false positive rate is reached at some threshold.
By using the recursive methods, false positive rate is backed to very low ranked candidates.
By using top 80 ranked candidates, the diagnosis rate is improved to 52%.
Then, I rechecked the top 50 ranked genes. In the top 50 genes, 70 genes are causal genes of other retinal diseases.
Indeed, 90 of 37 RP causal genes are also the causal of other retinal diseases in that net database. Four of the 50 genes are in the same gene families to which known RP causal genes belong.
We used the known causal gene list in September of last year. Then after, two genes were very recently added to the retinal.
So, I think we don't say all of the top ranked genes are truly causal genes, but WDD seems to rank these genes correctly.
So, we developed a prediction method of causal genes for Mendelian disorder by using HSCs and Watson for drug discoveries.
Then, we did a feasible study of RP. Then, we found that many top ranked candidates shared structural functional features with known RP genes.
Two top ranked genes were released very recently in the net, which suggests that our AI assist approaches useful. Then, by including the top ranked genes, diagnosis rate was increased 50% without increasing the false positive rate.
So, finally, we want to introduce the vision of the developed method. Our group took to construct, integrate, plot, term, register for release, called radarJ.
The developed method also integrated in these systems. So, I want to say many thanks for collaborating and thank you for your attention.
Any radar questions? So, is it possible that your false positives are coming because of penetrance issues?
So, it is possible that the mutation exists, but doesn't express itself into disease. This is called penetrance in the United States. The penetrance of a disease. Have you heard of that concept? Sometimes a mutation exists, but it does not express itself into disease.
He is giving you more credit, the program more credit than we see. He is optimistic. Can I just say something about all these short talks? I asked the speakers of the short talks and the workshops to also have their abstracts.
So, the idea is to draw you also to their, sorry, their posters. So, if you are interested in more discussion, they should also stay there, stay near their posters. Do you focus on exons rather than intragenic variations?
Yes, I will now only focus on exons. Just another level of debate. Yes. I mean, the other thing that's weird is in yeast, a lot of people are finding that synonymous mutations actually lead to a phenotype.
So, maybe 25%, yeah, that's what you're saying. So, it's always intriguing to think of all the variation that we're ignoring when we do these studies
because it's so difficult to capture it all. Yeah, you've got a very good point. I'm going to correlate out here. For example, if you have mutations in a very important part of the regulatory element, and if the patient is heterozygous for that mutation, and the heterozygote for the protein calling mutation, then only one allele can be transcribed in the yeast.
Right. Yeah, behaves like a homozygote. Yeah, it's very important for me. Just a quick one. Mr. Shifu said it. How many cases of retinitis do you think are considered to be of a genetic origin now?
Or how many? Or fraction, yeah. Can I ask a question for him? Sure, sure. I'm sorry. It is considered as at least 50% are sort of family cluster or hereditary.
And the other 50%, we don't know much about that. It can be de novo mutation. It can be because of the lack of the information of family. So we don't have very clear score or number how much fraction of the disease is hereditary.
But it is said more than 50%, which is generally accepted. But not yet correctly checked, it's a variant. It is a test simulation result, and assumed that top-ranked genes are causal genes.
And we are not considered only a research model, not an included dominant model.
But because of the dominant model, it's more difficult to predict.
Because the positive rate is very improved because there are many variants if we assume it's a dominant model. But it is future work.
Okay. Well, thank you very much, Suki, for a great talk. Thank you very much.