带有单词嵌入正规化和软相似度的文本分类

论文标题

带有单词嵌入正规化和软相似度的文本分类

Text classification with word embedding regularization and soft similarity measure

论文作者

Novotný, Vít, Ayetiran, Eniafe Festus, Štefánik, Michal, Sojka, Petr

论文摘要

由于Mikolov等人的开创性工作，因此单词嵌入已成为许多自然语言处理任务的首选单词表示。据报道，从单词嵌入中提取的文档相似性度量，例如软余弦度量（SCM）和单词移动器的距离（WMD），以实现语义文本相似性和文本分类的最新性能。尽管WMD在文本分类和语义文本相似性方面表现出色，但其超我们的平均时间复杂性是不切实际的。 SCM具有二次最差的时间复杂性，但是其在文本分类上的性能从未与WMD进行比较。最近，证明了两个单词嵌入正规化技术可降低存储和记忆成本，并提高训练速度，文档处理速度以及单词类比，单词相似性和语义文本相似性的任务性能。但是，尚未研究这些技术对文本分类的影响。在我们的工作中，我们研究了两个单词嵌入正则化技术对文档处理速度以及SCM和WMD的任务性能的个人和关节效应。为了进行评估，我们使用$ K $ nn分类器和六个标准数据集：BBCSPORT，TWITTER，OHSED，REUTERS-21578，AMAZON和20NEWS。与非注册单词嵌入相比，我们显示了39％的平均$ K $ NN测试错误降低错误单词嵌入。我们描述了通过cholesky分解得出这种正则嵌入的实际程序。我们还表明，带有正则单词嵌入式的SCM在文本分类上的表现明显优于WMD，并且更快的速度超过10,000倍。

Since the seminal work of Mikolov et al., word embeddings have become the preferred word representations for many natural language processing tasks. Document similarity measures extracted from word embeddings, such as the soft cosine measure (SCM) and the Word Mover's Distance (WMD), were reported to achieve state-of-the-art performance on semantic text similarity and text classification. Despite the strong performance of the WMD on text classification and semantic text similarity, its super-cubic average time complexity is impractical. The SCM has quadratic worst-case time complexity, but its performance on text classification has never been compared with the WMD. Recently, two word embedding regularization techniques were shown to reduce storage and memory costs, and to improve training speed, document processing speed, and task performance on word analogy, word similarity, and semantic text similarity. However, the effect of these techniques on text classification has not yet been studied. In our work, we investigate the individual and joint effect of the two word embedding regularization techniques on the document processing speed and the task performance of the SCM and the WMD on text classification. For evaluation, we use the $k$NN classifier and six standard datasets: BBCSPORT, TWITTER, OHSUMED, REUTERS-21578, AMAZON, and 20NEWS. We show 39% average $k$NN test error reduction with regularized word embeddings compared to non-regularized word embeddings. We describe a practical procedure for deriving such regularized embeddings through Cholesky factorization. We also show that the SCM with regularized word embeddings significantly outperforms the WMD on text classification and is over 10,000 times faster.

下载PDF全文

下载文献需遵守相关版权规定

论文标题