通过微调Word2Vec模型对孟加拉语的单词嵌入单词嵌入的稳定且一致

论文标题

通过微调Word2Vec模型对孟加拉语的单词嵌入单词嵌入的稳定且一致

Robust and Consistent Estimation of Word Embedding for Bangla Language by fine-tuning Word2Vec Model

论文作者

Rahman, Rifat

论文摘要

单词嵌入或矢量表示单词具有句法和语义特征，对于任何基于机器学习的自然语言处理模型而言，这可能是一个信息的功能。有几种基于深度学习的模型，用于诸如Word2Vec，FastText，Gensim，Glove等单词的矢量化。在本研究中，我们通过调整不同的超参数来分析学习单词向量的Word2Vec模型，并呈现最有效的词语嵌入Bangla语言。为了测试通过Word2Vec模型微调生成的不同单词嵌入的性能，我们同时执行内在和外部评估。我们将矢量一词聚集，以检查单词内在评估的关系相似性，并使用不同的单词嵌入作为新闻文章分类器的特征进行外部评估。从我们的实验中，我们发现，具有300个维数的单词向量，使用4个窗口大小为4的Word2Vec模型生成的vectors，它为Bangla语言提供了最强大的向量表示。

Word embedding or vector representation of word holds syntactical and semantic characteristics of a word which can be an informative feature for any machine learning-based models of natural language processing. There are several deep learning-based models for the vectorization of words like word2vec, fasttext, gensim, glove, etc. In this study, we analyze word2vec model for learning word vectors by tuning different hyper-parameters and present the most effective word embedding for Bangla language. For testing the performances of different word embeddings generated by fine-tuning of word2vec model, we perform both intrinsic and extrinsic evaluations. We cluster the word vectors to examine the relational similarity of words for intrinsic evaluation and also use different word embeddings as the feature of news article classifier for extrinsic evaluation. From our experiment, we discover that the word vectors with 300 dimensions, generated from "skip-gram" method of word2vec model using the sliding window size of 4, are giving the most robust vector representations for Bangla language.

下载PDF全文

下载文献需遵守相关版权规定

论文标题