论文标题

单词统计的光谱分析

Spectral Analysis of Word Statistics

论文作者

Even-Zohar, Chaim, Lakrec, Tsviqa, Tessler, Ran J.

论文摘要

给定有限字母的随机文本,我们研究了固定长度单词作为子序列发生的频率。随着数据大小的增长,单词计数的联合分布表现出丰富的渐近结构。我们研究子字统计的所有线性组合,并使用各种代数工具充分表征它们的不同数量级。 此外,我们建立了每个顺序单词统计空间的光谱分解。我们为这些统计数据的多元分布的协方差矩阵提供明确的公式和特征值。我们的技术包括并详细介绍了由Dieker和Saliola最近研究和雇用的一组代数单词操作员(Adv Math,2018)。 子字数在组合,统计和计算机科学中找到应用程序。我们从组合文献中重新审视特殊情况,例如不及物骰子,随机核心分区以及关于随机步行的问题。我们的结构方法在统一框架中描述了几种经典统计检验。我们为数据分析和机器学习提出了进一步的潜在应用。

Given a random text over a finite alphabet, we study the frequencies at which fixed-length words occur as subsequences. As the data size grows, the joint distribution of word counts exhibits a rich asymptotic structure. We investigate all linear combinations of subword statistics, and fully characterize their different orders of magnitude using diverse algebraic tools. Moreover, we establish the spectral decomposition of the space of word statistics of each order. We provide explicit formulas for the eigenvectors and eigenvalues of the covariance matrix of the multivariate distribution of these statistics. Our techniques include and elaborate on a set of algebraic word operators, recently studied and employed by Dieker and Saliola (Adv Math, 2018). Subword counts find applications in Combinatorics, Statistics, and Computer Science. We revisit special cases from the combinatorial literature, such as intransitive dice, random core partitions, and questions on random walk. Our structural approach describes in a unified framework several classical statistical tests. We propose further potential applications to data analysis and machine learning.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源