论文标题

Lumos:用于诊断网络尺度应用程序中度量回归的库

Lumos: A Library for Diagnosing Metric Regressions in Web-Scale Applications

论文作者

Pool, Jamie, Beyrami, Ebrahim, Gopal, Vishak, Aazami, Ashkan, Gupchup, Jayant, Rowland, Jeff, Li, Binlong, Kanani, Pritesh, Cutler, Ross, Gehrke, Johannes

论文摘要

网络尺度应用程序可以每天运送代码到每周节奏。这些应用程序依靠在线指标来监视新版本的健康。需要尽早检测和诊断度量值的回归,以减少对用户和产品所有者的破坏。指标的回归可能由于多种原因而浮出水面:真正的产品回归,用户群体的变化以及由于遥测损失(或处理)而引起的偏见。对于工程团队而言,诊断这些度量回归的原因是昂贵的,因为他们需要花费时间来尽快找到问题的根本原因。我们提出了Lumos,这是一个使用AB测试原理构建的Python库,以系统地诊断度量回归以自动进行此类分析。 Lumos已在Microsoft的实时通信应用程序Skype和Microsoft团队中部署在组件团队中。它使工程团队能够检测到指标的100个实际变化,并拒绝由异常检测器检测到的1000次错误警报。 Lumos的应用导致分配给基于公制的调查的时间的95%。在这项工作中,我们开源Lumos并提出了我们的结果,将其应用于RTC组中数百万次会议的两个不同组件。该通用库可以与任何生产系统结合使用,以高效管理警报的量。

Web-scale applications can ship code on a daily to weekly cadence. These applications rely on online metrics to monitor the health of new releases. Regressions in metric values need to be detected and diagnosed as early as possible to reduce the disruption to users and product owners. Regressions in metrics can surface due to a variety of reasons: genuine product regressions, changes in user population, and bias due to telemetry loss (or processing) are among the common causes. Diagnosing the cause of these metric regressions is costly for engineering teams as they need to invest time in finding the root cause of the issue as soon as possible. We present Lumos, a Python library built using the principles of AB testing to systematically diagnose metric regressions to automate such analysis. Lumos has been deployed across the component teams in Microsoft's Real-Time Communication applications Skype and Microsoft Teams. It has enabled engineering teams to detect 100s of real changes in metrics and reject 1000s of false alarms detected by anomaly detectors. The application of Lumos has resulted in freeing up as much as 95% of the time allocated to metric-based investigations. In this work, we open source Lumos and present our results from applying it to two different components within the RTC group over millions of sessions. This general library can be coupled with any production system to manage the volume of alerting efficiently.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源