论文标题

使用降低维度降低通过自动对数集群来改善问题识别

Improving Problem Identification via Automated Log Clustering using Dimensionality Reduction

论文作者

Rosenberg, Carl Martin, Moonen, Leon

论文摘要

目标:我们考虑以同样的基本原因将失败的运行日志进行自动分组的问题,以便可以更有效地对待它们,并研究以下问题:(1)是否开发了一种方法来识别系统日志中的问题,以识别识别持续部署日志中的问题? (2)降低降低如何影响自动对数集群的质量? (3)用于合并聚类算法中簇的标准如何影响聚类质量? 方法:我们在集群系统日志文件上复制并扩展了早期的工作,以评估其对连续部署日志的概括。我们考虑将这些维度降低技术之一的可选包含:主成分分析(PCA),潜在语义索引(LSI)和非负矩阵分解(NMF)。此外,除了早期工作中使用的完整链接标准外,我们还考虑了三个替代群集合并标准(单个链接,平均链接和加权链接)。我们经验评估了由工业合作者提供的连续部署日志的16种配置。 结果:我们的研究表明,(1)通过聚类识别连续部署日志中的问题是可行的,(2)包括NMF显着提高了整体准确性和鲁棒性,并且(3)完整的链接在所有分析的合并标准中都表现最好。 结论:我们得出的结论是,通过降低维度降低,通过自动对数集群进行的问题识别可以改善,因为它降低了管道对参数选择的敏感性,从而提高了其处理不同输入的鲁棒性。

Goal: We consider the problem of automatically grouping logs of runs that failed for the same underlying reasons, so that they can be treated more effectively, and investigate the following questions: (1) Does an approach developed to identify problems in system logs generalize to identifying problems in continuous deployment logs? (2) How does dimensionality reduction affect the quality of automated log clustering? (3) How does the criterion used for merging clusters in the clustering algorithm affect clustering quality? Method: We replicate and extend earlier work on clustering system log files to assess its generalization to continuous deployment logs. We consider the optional inclusion of one of these dimensionality reduction techniques: Principal Component Analysis (PCA), Latent Semantic Indexing (LSI), and Non-negative Matrix Factorization (NMF). Moreover, we consider three alternative cluster merge criteria (Single Linkage, Average Linkage, and Weighted Linkage), in addition to the Complete Linkage criterion used in earlier work. We empirically evaluate the 16 resulting configurations on continuous deployment logs provided by our industrial collaborator. Results: Our study shows that (1) identifying problems in continuous deployment logs via clustering is feasible, (2) including NMF significantly improves overall accuracy and robustness, and (3) Complete Linkage performs best of all merge criteria analyzed. Conclusions: We conclude that problem identification via automated log clustering is improved by including dimensionality reduction, as it decreases the pipeline's sensitivity to parameter choice, thereby increasing its robustness for handling different inputs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源