论文标题
使用概率模型的数据质量评估
Data Quality Evaluation using Probability Models
论文作者
论文摘要
本文讨论了一种使用机器学习概率模型的方法,以评估数据集中的好数据和坏数据质量之间的差异。决策树算法用于基于未经检查的数据集知识来预测数据质量。结果表明,对于所检查的数据,基于简单/不良预先标记的学习示例预测数据质量的能力是准确的,但是通常,它可能不足以用于有用的生产数据质量评估。
This paper discusses an approach with machine-learning probability models to evaluate the difference between good and bad data quality in a dataset. A decision tree algorithm is used to predict data quality based on no domain knowledge of the datasets under examination. It is shown that for the data examined, the ability to predict the quality of data based on simple good/bad pre-labelled learning examples is accurate, however in general it may not be sufficient for useful production data quality assessment.