Development of an AI-assisted algorithm for the prediction of novel causal genes and variants for mendelian disorders from whole genome sequencing
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 34 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/50904 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
9
22
28
00:00
Lecture/Conference
Transcript: English(auto-generated)
00:16
I'm Shuji Kawaguchi of Kyoto University, not Sushi.
00:22
Today, I want to talk about methods of prediction of causal genes for Mendelian disorders. For Mendelian disorders, identification of novel causal genes is very important to genetically diagnosis.
00:44
It works. However, the actual success rate of genetic diagnosis is yet around 30%. Also, you can see technology is improved and makes low cost.
01:03
To address this problem, we developed a method to predict novel causal genes by using 4-organome sequence and AI technology based on IBM Watson.
01:25
Then, we adapted the developed method, a feasible study of latinitis pigmental cells. Our final target is to improve the diagnosis rate to more than 70%, but our research is halfway.
01:54
This slide shows why detection of causal genes is difficult.
02:04
By searching the 4-organome sequence, we can detect variant specifically to cast. So, we set this gene as candidate and then advanced analysis such as the
02:27
real data or family data, protein interaction, metabolic pathway or ontology by researching such information. And then, we decided this candidate is causal or not.
02:53
But there are many variants when we analyze 4-organome sequence.
03:03
So, we must do advanced analysis for all candidates. Then, we decided this candidate is causal or unrelated.
03:27
But unfortunately, most of the candidates are false positive by chance. So, to analyze all candidates is very time confusing and expensive.
03:50
So, we want to get breakthrough to identify optimal candidates. Then, we used IBM Watson to solve the problem.
04:07
There is Watson-Wadlock discovery, WDD. WDD is one of the solutions of IBM Watson. WDD incorporates tens of millions of articles in MEDLAR and discovers relations between genes and disease and drugs.
04:29
Predictive analysis is one of the functions of WDD. Predictive analysis, PA, needs two lists. One is known gene lists and the other is candidate gene lists.
04:44
Predictive analysis ranks these candidate gene lists by using similarity of known gene lists with candidate genes.
05:02
However, WDD does not work when input all gene candidates as candidate gene lists. So, appropriate selection of these lists is important for predictive analysis.
05:27
So, we created these two lists as a first creation of known gene lists. Several years ago, our group searched the causal variants of the ratinitis
05:44
pigmentos by target to exome analysis of 365 genes against 326 RP patients. We detected causal variants in 30 genes and 122 cases correspond to 37.4% genetically diagnosed.
06:17
There is another information called Retina database.
06:24
Retina database provides information of genes and genetic loss causing inherited retinal diseases. In the database, 90 causal genes of RP are registered at September of last year.
06:41
This past study and Retina database is suitable to known gene lists and used to create candidate lists. I will explain next slide. This slide shows the creation of candidate gene lists.
07:00
We used the whole genome sequence data of 523 RP cases and 2,143 controls. Then we decided the criteria for variants to fit to known causal genes.
07:29
Then we decided the criteria for variants as follows. One is stop-gain or splashing, or variant is stop-gain or splashing or frame shift.
07:45
The second is further non-synonymous mutations. 4 or more of 9 protein function prediction software predict the mutation as damaged.
08:12
The third is minor reflexes in RP case is 1.5 times greater than the controls.
08:24
Then, if at least one RP case of mozygote or compound heterozygote satisfies above 1 and 2 and 3 conditions,
08:41
we picked up this gene and 1028 genes satisfied this condition 4, among which 34 genes were non-causal genes.
09:11
68 is RP and 1 is control. This criteria is very fit to known genes.
09:24
However, the other 994 genes, 435 is RP, but also 423 controls are also having variants in G,
09:45
like mozygote or compound heterozygote in these genes. This criteria is also having high false positives. I think many true causal genes are included in these 994 genes.
10:15
We set these genes as candidate lists and rank them by WDD.
10:25
This slide shows the backflow to identify causal genes. First, input known information and set these as known gene lists.
10:40
Then, by using for genome sequence dataset, create criteria for variants and create the candidate gene lists. Shuji, how do you show one is a causal gene? How do you prove it's a causal gene?
11:02
This is not decided by our RP and known genes.
11:22
They are recorded in databases. This database was created by other papers and research results. This is known already. If you find a new one, how do you prove it's a causal gene?
11:45
How do you demonstrate a gene is causal for RP? Can I answer for him? I think at the end, whatever he will have as the top candidate, he will take it for genetic diagnosis.
12:04
That doesn't prove it's a causal gene. So, by checking the territorial data or family data... If you see it segregated, it's something like that.
12:23
Then, set two lists to what's on drug discovery and WDD sort and calculate the score of sorting these candidates.
12:40
Then, we used two strategies for WDD. I will explain it after. Then, we tested the developed method by using the RP case.
13:03
Some of them are already diagnosed by known genes. There are 326 patients and 37.5% cases are already diagnosed.
13:26
Then, we performed a whole genome sequence to list 135 cases. The list is now ongoing.
13:40
We adjusted the rate of diagnosis as false. Here is the calculated diagnostic data. The blue solid line is the ordinary use of WDD.
14:14
This data point shows the diagnosis rate by only using known genes.
14:21
The monotonically diagnosed rate was increased. By ordinary use of WDD, after the top 50 ranked candidates, the positive rate is also slightly increased.
14:41
I think this is because the lower ranked gene is affected by these top ranked genes. So, we used other strategies called recursive methods.
15:08
At first, WDD ranked only top 20 genes and picked up these top 20 genes and removed these 20 genes from candidate lists.
15:26
Then, WDD ranked top 20 genes and removed these top 20 genes and continued until false positive rate is reached at some threshold.
15:57
By using the recursive methods, false positive rate is backed to very low ranked candidates.
16:11
By using top 80 ranked candidates, the diagnosis rate is improved to 52%.
16:25
Then, I rechecked the top 50 ranked genes. In the top 50 genes, 70 genes are causal genes of other retinal diseases.
16:41
Indeed, 90 of 37 RP causal genes are also the causal of other retinal diseases in that net database. Four of the 50 genes are in the same gene families to which known RP causal genes belong.
17:10
We used the known causal gene list in September of last year. Then after, two genes were very recently added to the retinal.
17:28
So, I think we don't say all of the top ranked genes are truly causal genes, but WDD seems to rank these genes correctly.
17:48
So, we developed a prediction method of causal genes for Mendelian disorder by using HSCs and Watson for drug discoveries.
18:04
Then, we did a feasible study of RP. Then, we found that many top ranked candidates shared structural functional features with known RP genes.
18:22
Two top ranked genes were released very recently in the net, which suggests that our AI assist approaches useful. Then, by including the top ranked genes, diagnosis rate was increased 50% without increasing the false positive rate.
18:48
So, finally, we want to introduce the vision of the developed method. Our group took to construct, integrate, plot, term, register for release, called radarJ.
19:09
The developed method also integrated in these systems. So, I want to say many thanks for collaborating and thank you for your attention.
19:28
Any radar questions? So, is it possible that your false positives are coming because of penetrance issues?
19:42
So, it is possible that the mutation exists, but doesn't express itself into disease. This is called penetrance in the United States. The penetrance of a disease. Have you heard of that concept? Sometimes a mutation exists, but it does not express itself into disease.
20:06
He is giving you more credit, the program more credit than we see. He is optimistic. Can I just say something about all these short talks? I asked the speakers of the short talks and the workshops to also have their abstracts.
20:21
So, the idea is to draw you also to their, sorry, their posters. So, if you are interested in more discussion, they should also stay there, stay near their posters. Do you focus on exons rather than intragenic variations?
20:43
Yes, I will now only focus on exons. Just another level of debate. Yes. I mean, the other thing that's weird is in yeast, a lot of people are finding that synonymous mutations actually lead to a phenotype.
21:09
So, maybe 25%, yeah, that's what you're saying. So, it's always intriguing to think of all the variation that we're ignoring when we do these studies
21:20
because it's so difficult to capture it all. Yeah, you've got a very good point. I'm going to correlate out here. For example, if you have mutations in a very important part of the regulatory element, and if the patient is heterozygous for that mutation, and the heterozygote for the protein calling mutation, then only one allele can be transcribed in the yeast.
21:47
Right. Yeah, behaves like a homozygote. Yeah, it's very important for me. Just a quick one. Mr. Shifu said it. How many cases of retinitis do you think are considered to be of a genetic origin now?
22:10
Or how many? Or fraction, yeah. Can I ask a question for him? Sure, sure. I'm sorry. It is considered as at least 50% are sort of family cluster or hereditary.
22:26
And the other 50%, we don't know much about that. It can be de novo mutation. It can be because of the lack of the information of family. So we don't have very clear score or number how much fraction of the disease is hereditary.
22:48
But it is said more than 50%, which is generally accepted. But not yet correctly checked, it's a variant. It is a test simulation result, and assumed that top-ranked genes are causal genes.
23:15
And we are not considered only a research model, not an included dominant model.
23:33
But because of the dominant model, it's more difficult to predict.
23:40
Because the positive rate is very improved because there are many variants if we assume it's a dominant model. But it is future work.
24:03
Okay. Well, thank you very much, Suki, for a great talk. Thank you very much.