Using Hierarchical Classification and Language Models to Improve Protein Functional Annotation
Function Prediction; Multi-label Classification; Deep Learning; Hierarchical Classification; Database
Tens of thousands of different proteins are produced by the human body, performing a wide variety of functions.This molecular function is determined by their structure, physicochemical characteristics, environment, and biological context.Despite recent advances in protein structure prediction, the accurate identification of their molecular functions remains limited.This work explores the use of protein language models (PLMs)—based on the Transformer architecture—combined with hierarchical and multi-label classification techniques, aiming to capture the semantic complexity of the Gene Ontology (GO).In addition to using models such as ProtT5, Ankh, and ESM2, the research proposes integrating pre-computed protein embeddings and experimentally validated annotations into a unified database.This database, the Protein Dimension DB, was recently published and is already being used by the community.Benchmarking of different feature sets for molecular function prediction demonstrated the importance of combining multiple models and taxonomic information.The benchmarking results were used to guide the development of a new molecular function prediction tool called MF Swarm.Tests with experimentally validated data demonstrate promising performance in predicting 1,325 molecular functions.This methodological proposal aims to provide an accurate and scalable tool to aid biomedical research, the functional understanding of newly sequenced proteins, and the development of pharmacological applications.