Transformers and Embeddings of Amino Acid Substitutions for Analysis and Classification of SARS-CoV-2 Variants
SARS-CoV-2; Artificial Intelligence; Amino Acid Substitutions; Embeddings; Transformers; Fuzzy Clustering; Supervised Classification; Genomic Surveillance
The COVID-19 pandemic, caused by SARS-CoV-2, highlighted the need for scalable genomic surveillance methods in response to the continuous emergence of variants of concern. This work proposes an innovative approach based on vector representations of amino acid substitutions, generated through Transformer models, for the analysis and classification of viral variants. Genomic sequences were processed into high-dimensional embeddings, which served as the foundation for two complementary experiments. In the first, unsupervised techniques such as Fuzzy C-Means clustering and t-SNE projection revealed groupings consistent with known variants and identified transitional zones and ambiguous samples. In the second, supervised classification models were developed, evaluating algorithms such as SVM, Random Forest, k-NN, and XGBoost, the latter achieving 99.83% accuracy and F1-macro score on an external test set. Results demonstrate that representations derived from amino acid substitutions enable robust variant discrimination and the interpretation of biologically relevant mutational signatures, without requiring genomic alignment. The proposed methodology emerges as a scalable and adaptable solution for automated genomic surveillance, with potential applications in public health.