Automatic error detection techniques applied on supervised datasets for noise detection on labels originating from weakly supervised learning pipelines
Weakly supervised learning; Data programming; noise labels; automatic noise detection.
The high cost of data labeling for training machine learning models motivated the development of weakly supervised learning (WSL), which introduces noise into labels, affecting model performance. Among WSL techniques, data programming (DP) stands out by utilizing noisy sources (such as heuristics and pre-trained models) to automate data labeling with low cost, resulting in potentially imprecise labels that impact the final model’s performance. The objective of this work is to evaluate whether techniques that detect noisy instances can improve the final model’s performance obtained through DP pipelines for classification tasks. To achieve this, a preliminary experiment was conducted whose goal was to identify the viability of using noise detection in labeled instances with DP pipeline. Some of the techniques used were already known by the author but not previously linked to WSL, while others were selected from a literature review that sought out noise detection techniques already applied to WSL. The impact of each technique on the final model’s performance was evaluated using accuracy, Matthews correlation, F1-score, and execution time metrics.