论文标题

数据评估无数据共享

Data Appraisal Without Data Sharing

论文作者

Xu, Mimee, van der Maaten, Laurens, Hannun, Awni

论文摘要

改善机器学习模型性能的最有效方法之一是采购其他培训数据。从数据所有者那里寻求相关培训数据的模型所有者需要在获取数据之前对数据进行评估。但是,没有正式协议,数据所有者不想共享数据。由此产生的捕获22可阻止有效的数据市场形成。本文提出了添加数据评估阶段,该阶段不需要数据所有者和模型所有者之间的数据共享。具体来说,我们使用多方计算来实现根据私人数据计算的评估功能。评估价值是促进数据选择和交易的指南。我们提出了一种基于正向影响函数的有效数据评估方法,该方法通过对当前模型的一阶损失减少来近似数据值。该方法不需要其他超参数或重新训练。我们表明,尽管有标签噪声,不平衡和缺失的数据,但在私人,正面影响功能中,在高质量评估和所需计算之间提供了一个吸引人的权衡。我们的工作旨在激发一个开放市场,以激励有效,公平的特定领域培训数据。

One of the most effective approaches to improving the performance of a machine learning model is to procure additional training data. A model owner seeking relevant training data from a data owner needs to appraise the data before acquiring it. However, without a formal agreement, the data owner does not want to share data. The resulting Catch-22 prevents efficient data markets from forming. This paper proposes adding a data appraisal stage that requires no data sharing between data owners and model owners. Specifically, we use multi-party computation to implement an appraisal function computed on private data. The appraised value serves as a guide to facilitate data selection and transaction. We propose an efficient data appraisal method based on forward influence functions that approximates data value through its first-order loss reduction on the current model. The method requires no additional hyper-parameters or re-training. We show that in private, forward influence functions provide an appealing trade-off between high quality appraisal and required computation, in spite of label noise, class imbalance, and missing data. Our work seeks to inspire an open market that incentivizes efficient, equitable exchange of domain-specific training data.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源