数据流中的频率估计：学习最佳哈希方案

论文标题

数据流中的频率估计：学习最佳哈希方案

Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme

论文作者

Bertsimas, Dimitris, Digalakis Jr, Vassilis

论文摘要

我们为基于优化和机器学习的数据流中频率估计问题提供了一种新颖的方法。与最新的流频率估计算法相反，该算法在很大程度上依赖于随机哈希来使用有限的存储来维持数据蒸汽的频率分布，而拟议的方法利用了观察到的流前缀来近距离地哈希元素并压缩目标频率分布。我们开发了精确的混合构成线性优化公式，使我们能够为观察到的流前缀中看到的元素计算最佳或近乎最佳的哈希方案；然后，我们使用机器学习来散布看不见的元素。此外，我们开发了一种有效的块坐标下降算法，正如我们经验所表明的那样，该算法产生了高质量的解决方案，并且在特殊情况下，我们能够使用动态编程在线性时间中精确地求解所提出的配方。我们在综合数据集和现实世界搜索查询数据上凭经验评估了所提出的方法。我们表明，所提出的方法以其平均值（每个元素）估计误差（每个元素）估计误差（每个元素）估计误差（每元素）的预期估计幅度的估计幅度而言，以其平均值（每个元素）估计误差（每个元素）估计误差（每个元素）估计误差（每个元素）估计误差，均优于一到两个数量级。

We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning. Contrary to state-of-the-art streaming frequency estimation algorithms, which heavily rely on random hashing to maintain the frequency distribution of the data steam using limited storage, the proposed approach exploits an observed stream prefix to near-optimally hash elements and compress the target frequency distribution. We develop an exact mixed-integer linear optimization formulation, which enables us to compute optimal or near-optimal hashing schemes for elements seen in the observed stream prefix; then, we use machine learning to hash unseen elements. Further, we develop an efficient block coordinate descent algorithm, which, as we empirically show, produces high quality solutions, and, in a special case, we are able to solve the proposed formulation exactly in linear time using dynamic programming. We empirically evaluate the proposed approach both on synthetic datasets and on real-world search query data. We show that the proposed approach outperforms existing approaches by one to two orders of magnitude in terms of its average (per element) estimation error and by 45-90% in terms of its expected magnitude of estimation error.

下载PDF全文

下载文献需遵守相关版权规定

论文标题