Skip to content

kpgbrock/CS89_VariantEffectPrediction

Repository files navigation

CS89_VariantEffectPrediction

Final Project for CSCI-S89, Introduction to Deep Learning

Harvard Summer School 2022

Kelly Brock

Installation hint for cloning: You must have Git LFS https://git-lfs.github.com/ installed to properly download the initial ClinVar dataset.

Important note: these toy models were built for training purposes only, and DEFINITELY should not be used for clinical application! Please refer to https://evemodel.org/ for an example of a more rigorous computational predictor.

Link to Youtube presentation: https://youtu.be/qcD5TAge05U

For a detailed set of installation instructions and details about implementation, please refer to the write-up document in the top level of this repository:

GeneticVariantEffectPrediction_KellyBrock_CS89_Writeup.pdf

Abstract: Genetic disorders can involve changes to the coding region of our DNA - or in other words, the part of our genome that can be transcribed and translated to form proteins, which are the building blocks of cells. In particular, missense mutations (where a change in the DNA nucleotide(s) leads to a change in amino acids in the corresponding protein sequence) are of special interest, because they can have a more ambiguous effect on the encoded protein than other types of mutations like premature truncations. Although the scientific community has identified thousands of these single-amino acid changes that are thought to be pathogenic (contributing to disease), genetics studies can be time-consuming and statistically underpowered, particularly in the case of rare variants. Most genetic variants found in humans remain of unknown significance. To fill this gap, a growing wealth of computational methods aim to predict whether certain missense variants are either pathogenic or benign in the context of human disease. For this case study in biology, I built 3 separate neural net architectures that take as input increasingly more information about protein sequence and the missense mutation of interest, and output a classification for that missense mutant as either pathogenic or benign. A particular challenge was how to split our available data for training and testing, so I also experimented with different split methods as well. Across the three methods, I obtained a validation accuracy of up to 80%. I then used these architectures to generate predictions for all possible missense mutations for the KCNQ1 gene, which encodes a potassium channel implicated in congenital heart arrhythmia disorder called long QT syndrome.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors