Classification of Cancer-Associated Mutations Integrating Machine Learning with Structural and Topological Parameters of Residue Interaction Networks
Missense Mutations, Predictors, Residue interaction Networks, Machine Learning
The large volume of single nucleotide polymorphism data currently available has driven the development of methods capable of distinguishing neutral alterations from those associated with diseases such as cancer. Obtaining experimental evidence on the pathogenicity of variants is a labor-intensive, time-consuming, and costly process. Several in silico tools have been employed for pathogenicity prediction, including PolyPhen-2, PROVEAN, SIFT, FATHMM, MutationTaster, MutationAssessor, and LRT, as well as ensemble-based methods that combine multiple independent predictors, such as ClinPred, MetaLR, and MetaSVM. However, most of these approaches rely primarily on genomic information and allele frequency data. In recent decades, tools that integrate topological features from residue interaction networks (RINs) with outputs from conventional predictors have demonstrated superior performance. The objective of this work was to develop a classification model capable of assessing the impact of structural and topological RIN features on improving the accuracy of mutation classifiers. To this end, curated databases were constructed containing functional predictions, genomic, structural, and functional information associated with 33 cancer types, followed by the application and evaluation of several supervised machine learning algorithms. The results showed that integrating structural and topological parameters derived from RINs enhances the predictive performance of machine learning models in classifying cancer-associated missense mutations. The XGBoost-based model achieved consistent performance, with an accuracy of 74.0%, sensitivity of 73.9%, specificity of 74.1%, and an F1-score of 74.5%. These findings indicate that the proposed model presents a well-balanced trade-off between sensitivity and specificity, avoids bias toward either class, and demonstrates strong generalization capability in a highly heterogeneous scenario comprising multiple genes and distinct tumor contexts.