基于一般核心的多样性方法在矩阵约束下最大化

论文标题

基于一般核心的多样性方法在矩阵约束下最大化

A General Coreset-Based Approach to Diversity Maximization under Matroid Constraints

论文作者

Ceccarello, Matteo, Pietracaprina, Andrea, Pucci, Geppino

论文摘要

多样性最大化是网络搜索和数据挖掘中的一个基本问题。对于给定的数据集$ n $元素的$ s $，问题需要确定包含$ k \ ll n $“代表”的$ s $的子集，该$ k \ ll n $“代表”最小化了以成对距离表示的一些多样性函数，其中距离模型的相似性不同。问题的一个重要变体是规定该解决方案满足额外的正交需求，可以将其指定为矩阵约束（即，可行的解决方案必须是给定Matroid的尺寸$ K $的独立集合）。尽管不受限制的多样性最大化允许多种多样性功能的有效核心策略，但处理额外的矩阵约束的已知方法仅适用于一个多样性函数（距离之和），并且基于整个输入数据集中昂贵的，固有的，固有的，固有的，固有的，本质上的本地搜索。我们设计了第一个基于核心的算法，用于在Matroid约束下为各种多样性函数以及有效的顺序，MAPREDUCE和流式传输实现的多样性最大化。从技术上讲，我们的算法依赖于一个小核心的构建，即包含可行解决方案的$ s $的子集，该解决方案不超过$1-ε$的$1-ε$，距离最佳解决方案的$ s $。虽然我们的算法是完全一般的，但对于分区和横向矩阵，如果$ε$是$（0,1）$的常数，并且$ s $具有限制的两倍尺寸，则核心尺寸与$ n $无关，并且足够小，并且足以使缓慢的顺序algorithm在最终的ALGORITHM中的执行，以提取最终，准确，准确的时间，解决方案。广泛的实验表明，我们的算法是准确，快速和可扩展的，因此它们能够处理大数据方案的典型输入实例。

Diversity maximization is a fundamental problem in web search and data mining. For a given dataset $S$ of $n$ elements, the problem requires to determine a subset of $S$ containing $k\ll n$ "representatives" which minimize some diversity function expressed in terms of pairwise distances, where distance models dissimilarity. An important variant of the problem prescribes that the solution satisfy an additional orthogonal requirement, which can be specified as a matroid constraint (i.e., a feasible solution must be an independent set of size $k$ of a given matroid). While unconstrained diversity maximization admits efficient coreset-based strategies for several diversity functions, known approaches dealing with the additional matroid constraint apply only to one diversity function (sum of distances), and are based on an expensive, inherently sequential, local search over the entire input dataset. We devise the first coreset-based algorithms for diversity maximization under matroid constraints for various diversity functions, together with efficient sequential, MapReduce and Streaming implementations. Technically, our algorithms rely on the construction of a small coreset, that is, a subset of $S$ containing a feasible solution which is no more than a factor $1-ε$ away from the optimal solution for $S$. While our algorithms are fully general, for the partition and transversal matroids, if $ε$ is a constant in $(0,1)$ and $S$ has bounded doubling dimension, the coreset size is independent of $n$ and it is small enough to afford the execution of a slow sequential algorithm to extract a final, accurate, solution in reasonable time. Extensive experiments show that our algorithms are accurate, fast and scalable, and therefore they are capable of dealing with the large input instances typical of the big data scenario.

下载PDF全文

下载文献需遵守相关版权规定

论文标题