使用单词嵌入来提高共发生文本网络的可区分性

论文标题

使用单词嵌入来提高共发生文本网络的可区分性

Using word embeddings to improve the discriminability of co-occurrence text networks

论文作者

Quispe, Laura V. C., Tohalino, Jorge A. V., Amancio, Diego R.

论文摘要

在实用和理论方案中，已采用单词共发生网络来分析文本。尽管在多个应用程序中取得了相对成功，但传统的共发生网络每当文本中出现遥远的情况下，都无法在类似单词之间建立联系。在这里，我们调查使用单词嵌入作为在共发生网络中创建虚拟链接的工具是否可以提高分类系统的质量。我们的结果表明，使用手套，Word2Vec和fastText时，定型任务的可区分性得到了改善。此外，我们发现，当不忽视止词时，将获得优化的结果，并且使用简单的全局阈值策略来建立虚拟链接。由于所提出的方法能够改善文本作为复杂网络的表示，因此我们认为可以扩展它以研究其他自然语言处理任务。同样，理论语言的研究可以从通过单词共发生网络的丰富表示中受益。

Word co-occurrence networks have been employed to analyze texts both in the practical and theoretical scenarios. Despite the relative success in several applications, traditional co-occurrence networks fail in establishing links between similar words whenever they appear distant in the text. Here we investigate whether the use of word embeddings as a tool to create virtual links in co-occurrence networks may improve the quality of classification systems. Our results revealed that the discriminability in the stylometry task is improved when using Glove, Word2Vec and FastText. In addition, we found that optimized results are obtained when stopwords are not disregarded and a simple global thresholding strategy is used to establish virtual links. Because the proposed approach is able to improve the representation of texts as complex networks, we believe that it could be extended to study other natural language processing tasks. Likewise, theoretical languages studies could benefit from the adopted enriched representation of word co-occurrence networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题