随着业务系统规模不断扩大, 系统结构也变得十分复杂, 常规基于规则的方法已经很难判断多个系统相互作用下的复合型故障, 也难以对潜在故障进行预测. 本文在多业务系统的复杂场景下, 使用ELK平台对日志进行集中化管理, 梳理出复杂系统环境下日志与各业务系统、主机、进程之间的关系, 筛选出系统中直接与故障相关的日志文件, 进而在深度学习框架TensorFlow中使用这些海量数据对LSTM算法模型进行训练, 从而实现对系统的实时故障预测.
As the scale of systems continues to expand, the system structure also becomes very complex. The rule-based methods have been difficult to judge the composite faults under the interaction of multiple systems, and it is also hard to predict potential faults. Firstly, the study uses the ELK platform for centralized management of logs in complex scenarios of multi-business systems. Then, it sorts out the relationship between logs and various business systems, hosts, and processes in a complex system environment. Finally, we filter out the log files related to the failure in the system, and use these data in the deep learning framework TensorFlow to train the LSTM algorithm model, so as to realize the real-time fault prediction of the system.