Banca de DEFESA: GIOVANNA ASSUNÇÃO PEREIRA SOARES

Uma banca de DEFESA de MESTRADO foi cadastrada pelo programa.
STUDENT : GIOVANNA ASSUNÇÃO PEREIRA SOARES
DATE: 31/07/2026
TIME: 14:00
LOCAL: Remoto
TITLE:

A Unified Domain-Agnostic Pipeline for Protein and Small Molecule Representation via k-mer Images and Vision Transformers


KEY WORDS:

Bioinformatics; k-mer Encoding; Vision Transformer; Representation Learning; Protein and Molecular Embeddings; Classification and Clustering.


PAGES: 89
BIG AREA: Engenharias
AREA: Engenharia Elétrica
SUMMARY:

The representation and analysis of biological sequences and molecular structures are fundamental tasks in bioinformatics and cheminformatics, yet traditional methods based on sequence alignments and molecular descriptors face limitations in scalability and in capturing subtle structural and functional relationships. This work investigates machine learning approaches based on k-mer image representations and Vision Transformers for the analysis of proteins and small molecules. The proposed approach converts biological sequences and molecular SMILES strings into fixed-size images derived from k-mer frequency patterns, which are processed by a pretrained Vision Transformer to produce discriminative embedding vectors. This unified, alignment-free pipeline is evaluated under both supervised and unsupervised paradigms across two application domains. For proteins, supervised classification of clusters from the UniRef100 and UniRef90 datasets is performed using Logistic Regression, Random Forest, k-Nearest Neighbors, and XGBoost, while unsupervised analysis is conducted using DBSCAN with two proposed metrics (contamination and spreading) to assess cluster quality. For small molecules, blood--brain barrier permeability prediction is addressed on the BBBP and B3DB datasets using both classical machine learning classifiers and deep learning architectures, including MLP, ResMLP, DCN, DCNv2, FT-Transformer, and TabNet, while the unsupervised analysis extends the methodology developed for proteins to the molecular domain, applying DBSCAN and the contamination and spreading metrics. The results obtained demonstrate that the generated embeddings are effective for both classification and clustering tasks across domains, showing that the proposed representation captures structural and functional information relevant to both protein sequences and small molecules.


COMMITTEE MEMBERS:
Presidente - 1837240 - MARCELO AUGUSTO COSTA FERNANDES
Interno - 347628 - ADRIAO DUARTE DORIA NETO
Externo ao Programa - 1458979 - ANDRE LUIS FONSECA FAUSTINO - UFRNExterna à Instituição - RAQUEL DE MELO BARBOSA - UGR
Notícia cadastrada em: 11/06/2026 16:47
SIGAA | Superintendência de Tecnologia da Informação - (84) 3342 2210 | Copyright © 2006-2026 - UFRN - sigaa08-producao.info.ufrn.br.sigaa08-producao