DC Health: Node-Level Online Anomaly Detection in Datacenter Performance Data Monitoring
Anomaly detection; Datacenter; Half–Space–Trees.
Datacenters are critical environments for the availability of technology-based services. Aiming at the high availability of these services, performance metrics of nodes such as Virtual Machines (VM) or VM clusters are widely monitored. These metrics, such as CPU and memory utilization, can show anomalous patterns associated with failures and performance degradation, culminating in resource exhaustion and total node failure. Thus, early detection of anomalies can enable remediation measures, such as VM migration and resource reallocation, before losses occur. However, traditional monitoring tools often use fixed thresholds for detecting problems on nodes and lack automatic ways to detect anomalies at runtime. In this sense, machine learning techniques have been reported to detect anomalies in computer systems with online and offline approaches. Thus, this work aims to propose and evaluate the DC Health application, which seeks to anticipate the detection of anomalies in data center nodes. For this, this research was conducted from i) Systematic Literature Mapping, ii) problem modeling from real VM data and iii) DCH evaluation using the prequential method in 6 real-world datasets. Preliminary results showed that DCH excelled in constant memory usage and detection accuracy above 75%. As a continuation of this research, it is expected to develop a case study with data center operators and the evaluation of the tool in a large volume of nodes.