M.S. Candidate: Sana Basharat
Program: Data Informatics
Date: 07.06.2024 / 10:00
Place: A-212
Abstract: Driver coding mutations are extensively studied and frequently detected due to their deleterious amino acid changes that affect protein function. However, driver non-coding mutations need further analysis and experimental validation for detection. Here, we employ the XGBoost (eXtreme Gradient Boosting) algorithm to predict driver non-coding mutations based on novel long-range interaction features and engineered transcription factor binding site features, augmented with features from existing annotation and effect prediction tools. We utilize a novel method involving arrays to accurately capture the frequency and distribution of long-range interacting regions of interest. We use transcription factor (TF) models trained using the stochastic gradient descent (SGD) algorithm to predict the loss and gain of functions at TF binding sites. We also include features from existing annotation and variant effect prediction tools. The resulting dataset is passed through a forward stepwise selection and feature engineering pipeline and then trained with our gradient boosting model to predict driver versus passenger non-coding mutations. We also pass our dataset through a known driver discovery model from existing literature, which is a combination of 50 gradient-boosted tree models. We then use non-coding driver mutations found in other state-of-the-art studies, similarly annotate them, and predict their driver-ness using our models in order to evaluate our models' prediction capabilities. Furthermore, we use Explainable AI methodologies to perform an in-depth analysis of the generated predictions. Our results show an above-average performance on the unseen validation data and suggest that, by using our annotations and training the resulting data using gradient boosting trees, the classification between a driver versus passenger non-coding mutation is possible with relatively high degrees of accuracy.