论文标题

基于样本的查询的尖锐频率边界

Sharp Frequency Bounds for Sample-Based Queries

论文作者

Bax, Eric, Donald, John

论文摘要

数据草图算法扫描一个大数据集,收集少量数据 - 该草图可用于从统计上推断大数据集的属性。一些数据草图算法对大数据集进行了固定尺寸的随机样本,并使用该样本来推断符合大数据集中各种标准的项目频率。本文展示了如何有效且准确地推断出统计学上的近似正确(PAC)边界,以至于频率边界的频率界限是尖锐的或仅一个一个频率,这是最佳的结果,而无需精确计算。

A data sketch algorithm scans a big data set, collecting a small amount of data -- the sketch, which can be used to statistically infer properties of the big data set. Some data sketch algorithms take a fixed-size random sample of a big data set, and use that sample to infer frequencies of items that meet various criteria in the big data set. This paper shows how to statistically infer probably approximately correct (PAC) bounds for those frequencies, efficiently, and precisely enough that the frequency bounds are either sharp or off by only one, which is the best possible result without exact computation.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源