最佳数据选择：在线分布式视图

论文标题

最佳数据选择：在线分布式视图

Optimal Data Selection: An Online Distributed View

论文作者

Werner, Mariel, Angelopoulos, Anastasios, Bates, Stephen, Jordan, Michael I.

论文摘要

无处不在的数据的祝福也带有诅咒：大量（大多数是冗余数据集）的通信，存储和标签。我们试图以其核心解决此问题，仅收集有价值的数据并通过suppodular最大化抛弃其余的数据。具体来说，我们为问题的在线和分布式版本开发了算法，其中数据选择以多个数据流的不协调方式进行。我们为我们的算法设计了一般且灵活的核心选择程序，鉴于任何数据流，对其价值的任何评估以及其选择成本的任何表述，将流的最有价值的子集提取到使用最小内存的同时，将流的最有价值的子集提取到恒定因素。值得注意的是，我们的方法具有与离线同行相同的理论保证，据我们所知，在文献中为在线分布式的suplodular优化提供了第一个保证。最后，在学习Imagenet和MNIST上的任务时，我们表明我们的选择方法的表现优于$ 5-20 \％$。

The blessing of ubiquitous data also comes with a curse: the communication, storage, and labeling of massive, mostly redundant datasets. We seek to solve this problem at its core, collecting only valuable data and throwing out the rest via submodular maximization. Specifically, we develop algorithms for the online and distributed version of the problem, where data selection occurs in an uncoordinated fashion across multiple data streams. We design a general and flexible core selection routine for our algorithms which, given any stream of data, any assessment of its value, and any formulation of its selection cost, extracts the most valuable subset of the stream up to a constant factor while using minimal memory. Notably, our methods have the same theoretical guarantees as their offline counterparts, and, as far as we know, provide the first guarantees for online distributed submodular optimization in the literature. Finally, in learning tasks on ImageNet and MNIST, we show that our selection methods outperform random selection by $5-20\%$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题