Stacked Sparse Autoencoder applied to SARS-CoV-2 virus classification based on image representations of genome sequences
Deep Learning, SARS-CoV-2, COVID-19, Viral classification, Image representations.
Since December 2019, the COVID-19 pandemic caused by the SARS-CoV-2 virus has intensely affected the world. In the case of a novel virus identification, the early elucidation of taxonomic classification and origin of the virus genomic sequence is essential for strategic planning, containment, and treatments. Deep learning techniques have been successfully used in many viral classification problems associated with viral infection diagnosis, metagenomics, phylogenetics, and analysis. Considering that motivation, this work proposed an efficient viral genome classifier for SARS-CoV-2 using the deep neural network based on the stacked sparse autoencoder (SSAE). For the best performance of the model, we explored the utilization of image representations of the complete genome sequences as the SSAE input to provide a classification of the SARS-CoV-2. For that, two datasets were explored: based on k-mers image representation and based on CGR image representation. The dataset based on k-mers image representation was applied in the experiments of different levels of taxonomic classification of the SARS-CoV-2 virus, and the dataset based on CGR images was applied to the experiments of classification of SARS-CoV-2 variants of concern (VOC). For the experiments of taxonomy classification, the SSAE technique provided great performance results, achieving classification accuracy between 92% and 100% for the validation set and between 98.9% and 100% when the SARS-CoV-2 samples were applied for the test set. These results indicate that our model can be adapted to classify other emerging viruses. For the experiments of SARS-CoV-2 variants classification using CGR images, the SSAE technique provided even better results, achieving classification accuracy of 99.9% for the validation set and 99.8% for the test set. Finally, the results indicated the applicability of this deep learning technique in genome classification problems.