Machine learning-based run-time anomaly detection in software systems: An industrial evaluation


Anomalies are an inevitable occurrence while operating enterprise software systems. Traditionally, anomalies are detected by threshold-based alarms for critical metrics, or health probing requests. However, fully automated detection in complex systems is challenging, since it is very difficult to distinguish truly anomalous behavior from normal operation. To this end, the traditional approaches may not be sufficient. Thus, we propose machine learning classifiers to predict the system’s health status. We evaluated our approach in an industrial case study, on a large, real-world dataset of 7.5*10^6 data points for 231 features. Our results show that recurrent neural networks with long short-term memory (LSTM) are more effective in detecting anomalies and health issues, as compared to other classifiers. We achieved an area under precision-recall curve of 0.44. At the default threshold, we can automatically detect 70% of the anomalies. Despite the low precision of 31 %, the rate in which false positives occur is only 4 %.

2018 IEEE Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE)