COLPORA比较：瑞典Gigaword＆Wikipedia Corpora的案例

论文标题

COLPORA比较：瑞典Gigaword＆Wikipedia Corpora的案例

Corpora Compared: The Case of the Swedish Gigaword & Wikipedia Corpora

论文作者

Adewumi, Tosin P., Liwicki, Foteini, Liwicki, Marcus

论文摘要

在这项工作中，我们表明，除了数据大小以外，给定语言的不同数据的嵌入嵌入性能差异可能是由于其他因素所致。自然语言处理（NLP）任务通常可以通过较大语料库的嵌入来表现更好。但是，覆盖域和噪音的广泛性可以起着重要的作用。我们根据两个瑞典语料库：Gigaword和Wikipedia评估嵌入，以类比（内在的）测试，发现来自Wikipedia corpus的嵌入通常超过了来自Gigaword语料库的嵌入，这是一个更大的语料库。下游测试将被要求进行确定的评估。

In this work, we show that the difference in performance of embeddings from differently sourced data for a given language can be due to other factors besides data size. Natural language processing (NLP) tasks usually perform better with embeddings from bigger corpora. However, broadness of covered domain and noise can play important roles. We evaluate embeddings based on two Swedish corpora: The Gigaword and Wikipedia, in analogy (intrinsic) tests and discover that the embeddings from the Wikipedia corpus generally outperform those from the Gigaword corpus, which is a bigger corpus. Downstream tests will be required to have a definite evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题