论文标题
笔记本中的数据泄漏:静态检测和更好的过程
Data Leakage in Notebooks: Static Detection and Better Processes
论文作者
论文摘要
通过机器学习训练和评估模型的数据科学管道可能像其他任何代码一样包含错误。训练和测试数据之间的泄漏可能会导致在离线评估期间高估模型的准确性,这可能导致生产中低质量模型的部署。这种泄漏很容易通过错误或遵循不良的做法而发生,但手动检测可能是乏味和挑战性的。我们开发了一种静态分析方法,以检测数据科学代码中的数据泄漏形式。我们的评估表明,我们的分析准确地检测到数据泄漏,并且在超过100,000个经过分析的公共笔记本中,这种泄漏无处不在。我们讨论我们的静态分析方法如何帮助从业者和教育者,以及如何在开发过程中设计泄漏。
Data science pipelines to train and evaluate models with machine learning may contain bugs just like any other code. Leakage between training and test data can lead to overestimating the model's accuracy during offline evaluations, possibly leading to deployment of low-quality models in production. Such leakage can happen easily by mistake or by following poor practices, but may be tedious and challenging to detect manually. We develop a static analysis approach to detect common forms of data leakage in data science code. Our evaluation shows that our analysis accurately detects data leakage and that such leakage is pervasive among over 100,000 analyzed public notebooks. We discuss how our static analysis approach can help both practitioners and educators, and how leakage prevention can be designed into the development process.