论文标题
炼制:通过混乱工程增强因果分析在数据库上的性能调试
PerfCE: Performance Debugging on Databases with Chaos Engineering-Enhanced Causality Analysis
论文作者
论文摘要
在现实世界数据库中调试性能异常是具有挑战性的。因果推理技术可实现定性和定量根本原因降级。然而,因果分析实际上是具有挑战性的,特别是由于可观察力有限。最近,混乱工程已应用于测试复杂的现实世界软件系统。混沌网眼等混乱框架将一组混乱变量突变为注入灾难性事件(例如,网络减速)以“压力”软件系统。然后,使用诸如差异测试之类的方法检查混乱应力下的系统以检查它们是否保留其正常功能(例如,在压力下,SQL查询输出始终是正确的)。尽管在行业中无处不在,但Chaos Engineering现在主要用于帮助软件测试,而不是进行性能调试。 本文确定了混乱工程的新颖使用,以帮助开发人员诊断数据库中的性能异常。我们提出的框架Perfce包括离线阶段和在线阶段。离线阶段了解目标数据库系统的统计模型,而在线阶段则诊断了被监视性能异常的根本原因。在离线阶段,挖掘利用被动观察和主动混乱实验构成准确的因果图和结构方程模型(SEMS)。当观察在线阶段的性能异常时,因果图可实现定性根本原因识别(例如,高CPU使用率)和SEMS启用定量反事实分析(例如,确定CPU使用何时降低到45 \%,绩效返回何时,绩效返回为正常”)。在普通合成数据集上的审议明显胜过先前的作品,以及我们对现实数据库MySQL和TIDB的评估表明,Perfce高度准确且中等昂贵。
Debugging performance anomalies in real-world databases is challenging. Causal inference techniques enable qualitative and quantitative root cause analysis of performance downgrade. Nevertheless, causality analysis is practically challenging, particularly due to limited observability. Recently, chaos engineering has been applied to test complex real-world software systems. Chaos frameworks like Chaos Mesh mutate a set of chaos variables to inject catastrophic events (e.g., network slowdowns) to "stress" software systems. The systems under chaos stress are then tested using methods like differential testing to check if they retain their normal functionality (e.g., SQL query output is always correct under stress). Despite its ubiquity in the industry, chaos engineering is now employed mostly to aid software testing rather for performance debugging. This paper identifies novel usage of chaos engineering on helping developers diagnose performance anomalies in databases. Our presented framework, PERFCE, comprises an offline phase and an online phase. The offline phase learns the statistical models of the target database system, whilst the online phase diagnoses the root cause of monitored performance anomalies on the fly. During the offline phase, PERFCE leverages both passive observations and proactive chaos experiments to constitute accurate causal graphs and structural equation models (SEMs). When observing performance anomalies during the online phase, causal graphs enable qualitative root cause identification (e.g., high CPU usage) and SEMs enable quantitative counterfactual analysis (e.g., determining "when CPU usage is reduced to 45\%, performance returns to normal"). PERFCE notably outperforms prior works on common synthetic datasets, and our evaluation on real-world databases, MySQL and TiDB, shows that PERFCE is highly accurate and moderately expensive.