Rare Disease Diagnosis with a Knowledge Graph Trained with Synthetic Data

(T) I attended yesterday a workshop from Emily Alsentzer, researcher at the Brigham and Women’s Hospital and Harvard Medical School at Stanford Medicine. Ms. Alsentzer presented SHEPHERD, a few shots predictor for hard to diagnose diseases.

Following is a summary of my key takeaways, from this outstanding work, leveraging the content of the papers about SHEPHERD.

Rare diseases diagnosis: There are from 300 to 400 millions people in the world (one out of twenty) who suffers from one of the 7,000 unique rare diseases. It can take up from four to five years to diagnose those rare diseases while many patients with those diseases remain undiagnosed. 80% of those diseases have a genetic origin.

To overcome the challenges of a diagnosis, SHEPHERD introduces a deep learning model that learns with limited data sets, and extrapolate to never seen before genetic conditions.

SHEPHERD modeling approach: SHEPHERD is trained by framing diagnosis as a subgraph self-supervised prediction task, developing a graph neural network which represents the patient’s phenotypic and genetic data in relation to a knowledge graph (of 100,272 nodes of 7 types and 1,678,274 edges of 15 types) of known phenotype, gene, and disease associations.

SHEPHERD training data: SHEPHERD is trained with 42,680 simulated patient data from 2,134 rare diseases.

SHEPHERD model predictions: SHEPHERD takes in as input the patient’s set of phenotypes as well as a list of either candidates genes, patients, or diseases, and predict a multi-faceted rare disease diagnosis that includes:

  • Discovery of the causal genes
  • Identification of “patients-like-me” with the same causal gene or disease
  • Novel disease interpretable characterizations

Rare disease diagnosis pipeline:Once a patient is accepted to the undiagnosed diseases network (UDN), he or she receive a thorough clinical workup and genetic sequencing, and their case is analyzed in an iterative process to identify the candidate genes likely to explain the patient’s symptoms. SHEPHERD can be utilized throughout the pipeline to accelerate the diagnosis process: after the clinical workup to find similar patients, after the sequencing analysis to identify strong candidate genes, and after case review to further prioritize candidate genes, characterize the patient’s disease, and/or validate candidate genes by finding phenotype and genotype-matched patients.”

References:

Papers:

There are more than 7,000 rare diseases, some of which affect 3,500 or fewer patients in the US. Due to clinicians’ limited experience with such diseases and the considerable heterogeneity of their clinical presentations, many patients with rare genetic diseases remain undiagnosed. While artificial intelligence has demonstrated success in assisting diagnosis, its success is usually contingent on the availability of large labeled datasets. Here, we present shepherd, a deep learning approach for multi-faceted rare disease diagnosis. shepherd is guided by existing knowledge of diseases, phenotypes, and genes to learn novel connections between a patient’s clinico-genetic information and phenotype and gene relationships. We train shepherd exclusively on simulated patients and evaluate on a cohort of 465 patients representing 299 diseases (79% of genes and 83% of diseases are represented in only a single patient) in the Undiagnosed Diseases Network. shepherd excels at several diagnostic facets: performing causal gene discovery (causal genes are predicted at rank = 3.52 on average), retrieving “patients-like-me” with the same gene or disease, and providing interpretable characterizations of novel disease presentations. shepherd demonstrates the potential of artificial intelligence to accelerate the diagnosis of rare disease patients and has implications for the use of deep learning on medical datasets with very few labels.

Rare Mendelian disorders pose a major diagnostic challenge and collectively affect 300-400 million patients worldwide. Many automated tools aim to uncover causal genes in patients with suspected genetic disorders, but evaluation of these tools is limited due to the lack of comprehensive benchmark datasets that include previously unpublished conditions. Here, we present a computational pipeline that simulates realistic clinical datasets to address this deficit. Our framework jointly simulates complex phenotypes and challenging candidate genes and produces patients with novel genetic conditions. We demonstrate the similarity of our simulated patients to real patients from the Undiagnosed Diseases Network and evaluate common gene prioritization methods on the simulated cohort. These prioritization methods recover known gene-disease associations but perform poorly on diagnosing patients with novel genetic disorders. Our publicly-available dataset and codebase can be utilized by medical genetics researchers to evaluate, compare, and improve tools that aid in the diagnostic process.”

Web site: Deep Learning for Diagnosing Patients with Rare Genetic Diseases

Code: https://github.com/mims-harvard/SHEPHERD

Note 1: The expression “a few shots learning” is sometime used in that research project to outlines that the learning algorithms learn from limited data sets, but it does not mean that the algorithm is based on meta-learning.

Note 2: The picture in the blog post are from the papers.

Note 3: The picture above are two kids free from a rare disease living in a tropical island.

Copyright © 2005-2024 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com.