有意义的文本的两半在统计上是不同的

论文标题

有意义的文本的两半在统计上是不同的

Two halves of a meaningful text are statistically different

论文作者

Deng, Weibing, Xie, R., Deng, S., Allahverdyan, Armen E.

论文摘要

哪些统计特征将有意义的文本（可能写在未知系统中）与一组毫无意义的符号区分开来？在这里，我们通过将文本的上半年的特征与下半部分比较来回答这个问题。这种比较可以发现隐藏的效果，因为两半具有许多参数的值（样式，类型{\ it etc}）。我们发现上半场比下半部分有更多不同的单词和更稀有的单词。同样，上半年的单词在频率和反向空间时期之间的差异的意义上在文本上分布较少。这些差异在我们研究的数百个相对短文中的大部分都存在。统计显着性通过Wilcoxon检验确认。在破坏文本线性结构的单词随机置换后，差异消失。差异揭示了有意义的文本中的时间不对称，这可以证明文本以自然方式（即沿着叙述）比以词语形式更好。我们猜想这些结果将文本的语义组织（由其叙事的流动定义）连接到其统计特征。

Which statistical features distinguish a meaningful text (possibly written in an unknown system) from a meaningless set of symbols? Here we answer this question by comparing features of the first half of a text to its second half. This comparison can uncover hidden effects, because the halves have the same values of many parameters (style, genre {\it etc}). We found that the first half has more different words and more rare words than the second half. Also, words in the first half are distributed less homogeneously over the text in the sense of of the difference between the frequency and the inverse spatial period. These differences hold for the significant majority of several hundred relatively short texts we studied. The statistical significance is confirmed via the Wilcoxon test. Differences disappear after random permutation of words that destroys the linear structure of the text. The differences reveal a temporal asymmetry in meaningful texts, which is confirmed by showing that texts are much better compressible in their natural way (i.e. along the narrative) than in the word-inverted form. We conjecture that these results connect the semantic organization of a text (defined by the flow of its narrative) to its statistical features.

下载PDF全文

下载文献需遵守相关版权规定

论文标题