论文标题
在标准文本分析任务上,超几何测试与TF-IDF相当
The hypergeometric test performs comparably to TF-IDF on standard text analysis tasks
论文作者
论文摘要
术语频率为单位的文档频率或简称TF-IDF,其许多变体形成了一类项权重功能,其成员被广泛用于文本分析应用程序中。尽管最初提出了TF-IDF作为一种启发式,但基于信息理论,概率和随机性范式的差异的理论理由已得到提出。在这项工作中,我们提出了一项实证研究,表明TF-IDF几乎与所选Real-DATA文档检索,摘要和分类任务的统计显着性的超几何测试相对应。这些发现表明,TF-IDF与超几何测试P值的负对数之间的基本数学联系(即,超几何分布尾概率)尚待阐明。我们将本文的经验分析推进了从统计显着性测试镜头中解释TF-IDF长期有效性的第一步。我们的愿望是,这些结果将打开对文本分析应用程序中显着性测试派生的术语加权功能的系统评估。
Term frequency-inverse document frequency, or TF-IDF for short, and its many variants form a class of term weighting functions the members of which are widely used in text analysis applications. While TF-IDF was originally proposed as a heuristic, theoretical justifications grounded in information theory, probability, and the divergence from randomness paradigm have been advanced. In this work, we present an empirical study showing that TF-IDF corresponds very nearly with the hypergeometric test of statistical significance on selected real-data document retrieval, summarization, and classification tasks. These findings suggest that a fundamental mathematical connection between TF-IDF and the negative logarithm of the hypergeometric test P-value (i.e., a hypergeometric distribution tail probability) remains to be elucidated. We advance the empirical analyses herein as a first step toward explaining the long-standing effectiveness of TF-IDF from a statistical significance testing lens. It is our aspiration that these results will open the door to the systematic evaluation of significance testing derived term weighting functions in text analysis applications.