论文标题

频率功能的合成草图:超越最坏的情况

Composable Sketches for Functions of Frequencies: Beyond the Worst Case

论文作者

Cohen, Edith, Geri, Ofir, Pagh, Rasmus

论文摘要

最近,人们对使用机器学习技术来改善经典算法的兴趣增加了。在本文中,我们研究了何时根据数据频率的功能构建用于加权采样和统计估算的可组合草图。这些结构现在是大规模数据分析和机器学习管道的中心组成部分。但是,已知许多常见的功能,例如阈值和p> 2的频率矩,在最坏情况下需要多项式大小的草图。在两种不同类型的假设下,我们探索了最坏情况以外的性能。首先是访问有关项目频率的嘈杂建议。这延续了Hsu等人的工作。 (ICLR 2019),假设预测是由机器学习模型提供的。第二个是在限制的输入频率分布类别上提供保证的性能,这些频率分布与实践中观察到的情况更好。这扩展了Charikar等人的开创性论文中Zipfian分布的重击球手的工作。 (ICALP 2002)。出乎意料的是,我们在分析和经验上表明,“在实践中”“在”小粒子大小的草图为“硬”功能提供了准确性。

Recently there has been increased interest in using machine learning techniques to improve classical algorithms. In this paper we study when it is possible to construct compact, composable sketches for weighted sampling and statistics estimation according to functions of data frequencies. Such structures are now central components of large-scale data analytics and machine learning pipelines. However, many common functions, such as thresholds and p-th frequency moments with p > 2, are known to require polynomial-size sketches in the worst case. We explore performance beyond the worst case under two different types of assumptions. The first is having access to noisy advice on item frequencies. This continues the line of work of Hsu et al. (ICLR 2019), who assume predictions are provided by a machine learning model. The second is providing guaranteed performance on a restricted class of input frequency distributions that are better aligned with what is observed in practice. This extends the work on heavy hitters under Zipfian distributions in a seminal paper of Charikar et al. (ICALP 2002). Surprisingly, we show analytically and empirically that "in practice" small polylogarithmic-size sketches provide accuracy for "hard" functions.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源