Development of an AI-assisted algorithm for the prediction of novel causal genes and variants for mendelian disorders from whole genome sequencing

Institut des Hautes Études Scientifiques (IHÉS)

Formal Metadata

Title

Title of Series

From Molecules and Cells to Human Health

Number of Parts

Author

Kawaguchi, Shuji

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/50904 (DOI)

Publisher

Institut des Hautes Études Scientifiques (IHÉS)

Release Date

2018

Language

English

Content Metadata

Subject Area

Life Sciences Mathematics

Genre

Conference/Talk

Abstract

Development of an AI-assisted algorithm for the prediction of novel causal genes and variants for mendelian disorders from whole genome sequencing Advances in DNA sequencing technologies have now enabled the rapid and cost-efficient identification of casual genes and variants for a number of diseases. This is especially true for Mendelian disorders, where patients who carry a causative variant in their genome, can finally obtain a definitive diagnosis on their disease. However, even with this revolutionary technology, the actual success rate of genetic diagnosis via next generation sequencing is currently only at around 30% for undiagnosed Mendelian disease cases. This is in part due to the limitations of the analytical methods that are available to identify and prioritize casual variants from the vast amounts of sequencing data generated. Currently, the genetic diagnosis of Mendelian disorders is performed by comparing the genome of a patient to those of a large number of controls. Such comparisons generally produce a large list of genetic variants that are unique to the patient. Many of these are probably benign and identifying the causal gene and variant can be a real challenge. To address this problem, we have developed a novel method thatranks candidate genes and variants using an AI-assisted algorithm that relies on IBM Watson’stext mining approach. As a proof of concept, we used a large whole-genome sequencing (WGS) dataset on Retinitis pigmentosa (RP) with 523 cases and 2,143 controls. Our method consists of the following steps: 1) Select the inclusion criteria of variants to maximize the difference between true positive rate for patients and false positive rate for controls based on previously known causal genes from a public database. 2) Using this inclusion criteria, create a list of candidate genes and variants. 3) Use IBM Watson to sort and prioritize this list of genes. Using this strategy on the RP WGS dataset, we were able to identify and priority 994 candidate genes. Notably, many of our top ranked genes shared structural and functional features with previously known RP genes. We also succeeded in increasing the diagnosis rate of RP from 37% to 52% by incorporating these top ranked candidates without increasing the rateof false positives in controls. Going forward we plan to further improve the approach by integrating other AI technologies that rely on omics or image analysis data. We also plan to develop a gene and variant registry with the aim of constructing a comprehensive infrastructure in Japan for studying the genetics of intractable diseases. In this registry, various AI technologies will be implemented to perform integrative analyses across various diseases.

Keywords

Molecular machines

cellular pathways and mechanism

intra- and extra-cellular coordination and communication

genomes and cell fate

disease, cancer and aging