PBO：用于概括单词嵌入的概率袋

论文标题

PBO：用于概括单词嵌入的概率袋

PBoS: Probabilistic Bag-of-Subwords for Generalizing Word Embedding

论文作者

Jinman, Zhao, Zhong, Shawn, Zhang, Xiaomin, Liang, Yingyu

论文摘要

我们研究了\ emph {precymitization}单词嵌入的任务：给定一组预训练的单词矢量，而不是有限的词汇，目的是预测嵌入矢量的嵌入矢量，用于vocabulary单词，\ emph {\ emph {nove}没有}额外的上下文信息。我们仅依靠单词的拼写，并提出一个模型以及有效的算法，同时对子词进行分割并计算基于子词的构图单词嵌入。我们称其为模型概率 - 持续袋（PBO），因为它根据其可能性适用所有可能的细分袋。检查和词缀预测实验表明，PBO能够在没有任何明确的形态学知识来源的情况下产生有意义的子词细分和子词排名。单词相似性和POS标记实验显示了PBO在跨语言中生成的单词嵌入质量的PBO比以前的子词级模型的明显优势。

We look into the task of \emph{generalizing} word embeddings: given a set of pre-trained word vectors over a finite vocabulary, the goal is to predict embedding vectors for out-of-vocabulary words, \emph{without} extra contextual information. We rely solely on the spellings of words and propose a model, along with an efficient algorithm, that simultaneously models subword segmentation and computes subword-based compositional word embedding. We call the model probabilistic bag-of-subwords (PBoS), as it applies bag-of-subwords for all possible segmentations based on their likelihood. Inspections and affix prediction experiment show that PBoS is able to produce meaningful subword segmentations and subword rankings without any source of explicit morphological knowledge. Word similarity and POS tagging experiments show clear advantages of PBoS over previous subword-level models in the quality of generated word embeddings across languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题