Banca de DEFESA: NALBERT GABRIEL MELO LEAL

Uma banca de DEFESA de MESTRADO foi cadastrada pelo programa.
STUDENT : NALBERT GABRIEL MELO LEAL
DATE: 22/09/2025
TIME: 16:30
LOCAL: meet.google.com/iqw-xeaq-krp
TITLE:

Automatic error detection techniques applied to supervised datasets for handling labels from weakly supervised learning pipelines


KEY WORDS:

Weakly supervised learning; Data programming; noise labels; automatic noise detection.


PAGES: 132
BIG AREA: Ciências Exatas e da Terra
AREA: Ciência da Computação
SUMMARY:

The high cost of data labeling for training machine learning models has motivated the development of weakly supervised learning(WSL); in turn,this introduces noise into the labels, affecting the models’ performance. Among WSL techniques, data programming (DP) stands out by using noisy sources (such as heuristics and pre-trained models) to perform automated data labeling at a low cost, resulting in potentially inaccurate labels that impact the end-model’s performance. The objective of this work is to evaluate whether techniques that detect noisy instances can improve the performance of the final model obtained with the DP pipeline for classification tasks. For this, an experiment was conducted to identify the impact on performance and cost that the use of noisy instance detection has on the DP pipeline. Some of the techniques for the experiment were already known by the author but not previously linked to WSL, while others were selected from a literature review that searched for noise detection techniques already applied to WSL. The impact of each technique on the end-model’s performance was evaluated by the Matthews correlation coefficient metric and the cost by the execution time of the pipeline in which the technique was introduced. The results demonstrate that the application of detection techniques, in most cases, degraded the performance of the end-models in a statistically significant manner. Only 4% of the pipelines with detection showed a performance improvement that was statistically significant and superior to the baseline. The improvements, when they occurred, were sporadic and accompanied by a high computational cost. Furthermore, the baselines, especially those with the hyper label model and majority vote LMs, showed a better balance between performance and cost. Thus, the DP pipeline without detection techniques proved to be a more efficient approach.


COMMITTEE MEMBERS:
Presidente - 1669545 - DANIEL SABINO AMORIM DE ARAUJO
Interno - 2353000 - ELIAS JACOB DE MENEZES NETO
Interno - 4351681 - JOAO CARLOS XAVIER JUNIOR
Externo à Instituição - ARAKEN DE MEDEIROS SANTOS - UFERSA
Notícia cadastrada em: 09/09/2025 16:21
SIGAA | Superintendência de Tecnologia da Informação - (84) 3342 2210 | Copyright © 2006-2025 - UFRN - sigaa10-producao.info.ufrn.br.sigaa10-producao