数量与质量：调查样本量和标签可靠性之间的权衡

论文标题

数量与质量：调查样本量和标签可靠性之间的权衡

Quantity vs Quality: Investigating the Trade-Off between Sample Size and Label Reliability

论文作者

Bertram, Timo, Fürnkranz, Johannes, Müller, Martin

论文摘要

在本文中，我们研究了学习者可能会收到不正确标签的概率领域的学习，但可以通过反复采样来提高标签的可靠性。在这种情况下，人们是否应该使用固定的培训示例预算来获取所有不同的示例或通过重新采样标签来提高较小示例的标签质量。我们在应用程序中激发了这个问题，以比较训练信号取决于隐藏的社区卡的强度，然后在人工环境中深入研究它，在该环境中，我们将受控的噪声水平插入MNIST数据库中。我们的结果表明，随着噪声水平的增加，重新采样以前的示例越来越重要，而获取新示例越来越重要，因为当不正确的标签数量太高时，分类器的性能会恶化。此外，我们提出了两种不同的验证策略。在训练过程中，从较低的验证转换为更高的验证，并使用卡方统计数据近似于获得的标签的信心。

In this paper, we study learning in probabilistic domains where the learner may receive incorrect labels but can improve the reliability of labels by repeatedly sampling them. In such a setting, one faces the problem of whether the fixed budget for obtaining training examples should rather be used for obtaining all different examples or for improving the label quality of a smaller number of examples by re-sampling their labels. We motivate this problem in an application to compare the strength of poker hands where the training signal depends on the hidden community cards, and then study it in depth in an artificial setting where we insert controlled noise levels into the MNIST database. Our results show that with increasing levels of noise, resampling previous examples becomes increasingly more important than obtaining new examples, as classifier performance deteriorates when the number of incorrect labels is too high. In addition, we propose two different validation strategies; switching from lower to higher validations over the course of training and using chi-square statistics to approximate the confidence in obtained labels.

下载PDF全文

下载文献需遵守相关版权规定

论文标题