A Data Stream Framework for Semi-supervised Classification in Non-Stationary Environments
Semi-supervised learning, classification in data stream, concept drift
Data stream applications receive a large volume of data quickly, and they need to process
them sequentially. In these applications, the data may change during the use of the model;
in addition, the number of instances whose label is known may not be sufficient to generate
an effective model. Semi-supervised learning can be used to suppress the difficulty
of the small number of instances labelled. Also, an ensemble of classifiers can assist in
detecting the concept drift. So, in this work, we proposed a framework to perform the
semi-supervised classification in tasks in a data stream context, using an approach based
on an ensemble of classifiers. In order to evaluate the effectiveness of this proposal, empirical
tests are carried out with eleven databases using two different batches sizes, nine
supervised approaches (three simple classifiers and six ensembles), using the metrics accuracy,
precision, recall and F-Score. When assessing the number of instances processed, the
supervised approaches achieved practically stable performance, while the proposal showed
an improvement of 8.28% and 3.81% using 5% and 10% of labelled instances, respectively.
In general, the results show that increasing the number of instances processed in batches
implies, in most cases, improving the results of the semi-supervised approach.