论文标题
HyperLoglog和PCSA的简单,更好的核心估计器
Simpler and Better Cardinality Estimators for HyperLogLog and PCSA
论文作者
论文摘要
\ emph {基数估计}(aka \ emph {不同的元素})是用许多工业应用绘制的经典问题。尽管素描\ emph {算法}非常简单,但众所周知,分析基数\ emph {估算师}还是很难的,即使到了今天,诸如Hyperloglog和(压缩)\ PCSA {}等最先进的草图也没有在研究生水平的大数据课程中覆盖。 在本文中,我们定义了一类\ emph {广义剩余区域}(\ tgra)估计量,并观察到pcSA的超置,logog和某些估计量仅仅是$ tgra {}的实例化\ tgra {}的各种积分值$ $τ$。然后,我们分析\ tgra {}估算器的限制相对方差。事实证明,可以通过选择$τ$的\ emph {分数}值来改进超置log和PCSA的标准估计器。由此产生的估计量来到\ emph {非常}靠近cramér-rao的较低范围,用于超置槽{}和pcsa,并从其渔民信息中得出。尽管可以使用最大似然估计器(MLE)来实现cramér-rao下限\ emph {},但MLE很麻烦地计算并动态更新。相比之下,\ tgra {}估计器在恒定时间内进行更新是微不足道的。 我们的演讲仅假定基本的演算和概率,没有任何复杂的分析〜\ cite {flajoletm85,durandf03,flajoletfgm07}。
\emph{Cardinality Estimation} (aka \emph{Distinct Elements}) is a classic problem in sketching with many industrial applications. Although sketching \emph{algorithms} are fairly simple, analyzing the cardinality \emph{estimators} is notoriously difficult, and even today the state-of-the-art sketches such as HyperLogLog and (compressed) \PCSA{} are not covered in graduate level Big Data courses. In this paper we define a class of \emph{generalized remaining area} (\tGRA) estimators, and observe that HyperLogLog, LogLog, and some estimators for PCSA are merely instantiations of \tGRA{} for various integral values of $τ$. We then analyze the limiting relative variance of \tGRA{} estimators. It turns out that the standard estimators for HyperLogLog and PCSA can be improved by choosing a \emph{fractional} value of $τ$. The resulting estimators come \emph{very} close to the Cramér-Rao lower bounds for HyperLogLog{} and PCSA derived from their Fisher information. Although the Cramér-Rao lower bound \emph{can} be achieved with the Maximum Likelihood Estimator (MLE), the MLE is cumbersome to compute and dynamically update. In contrast, \tGRA{} estimators are trivial to update in constant time. Our presentation assumes only basic calculus and probability, not any complex analysis~\cite{FlajoletM85,DurandF03,FlajoletFGM07}.