A Framework for Semi-supervised Data Stream Classification in Non-stationary Environments
Semi-supervised data stream classification; Concept drift; DyDaSL
Semi-supervised learning is a machine learning area that trains a classifier with few labelled
instances in the dataset. This strategy is used, especially when the number of instances
whose label is known may not be sufficient to generate an effective model. In recent
years, data generation has become more and more about, and much of the data is quickly
generated. Mixing these two presented early concepts, the data is quickly generated, and
not enough data are labelled to train a classifier; a new classification task emerges in the
data stream scenario. This process becomes even more challenging in the semi-supervised
scenario where only a few labelled instances are available. A semi-supervised framework
called Dynamic Data Stream Learner (DyDaSL) has been proposed in the literature to
address this issue. This framework generates a fixed-size ensemble to classify the remaining
data in the first iteration. This research proposes extensions for each module of the
DyDaSL framework, aiming to optimise the ensemble generation, drift detection, and drift
reaction processes to increase classification effectiveness in the semi-supervised task. Two
extensions are proposed for the training module to generate more effective ensembles,
starting with only one classifier and increasing throughout the learning process. Three
extensions are proposed for the drift detection modules to increase the effectiveness of
anomaly detections based on flexible thresholds or statistical tests. Two extensions have
been proposed for the reaction module to enhance the ensemble response to these anomalies.
The preliminary results point to positive results in all three modules; one of the training
extensions outperforms the standard training module in 17 out of 20 (85%) cases. An
extension outperforms the standard drift detector for the drift detection module in 12 out
of 15 (80%). Finally, one of the extensions achieved superior results for the drift reaction
module in 73 out of 80 (91.25%) cases.