裁缝和评估Wikipedia的内域可比语料库提取

论文标题

裁缝和评估Wikipedia的内域可比语料库提取

Tailoring and Evaluating the Wikipedia for in-Domain Comparable Corpora Extraction

论文作者

España-Bonet, Cristina, Barrón-Cedeño, Alberto, Màrquez, Lluís

论文摘要

我们提出了一种基于语言的自动基于图形的方法，以从Wikipedia的用户定义域构建à-la-carte文章集合。核心模型基于对百科全书类别图的探索，并且可以产生单语和多语言可比收集。我们进行彻底的实验，以评估10种语言和743个域获得的CORPORA的质量。根据广泛的手动评估，我们的基于图的模型优于基于检索的方法，并且在内域文章中的平均精度为84％。由于手动评估的昂贵，我们介绍了“域名”的概念，并设计了几个自动指标来说明收藏的质量。我们对领域的最佳指标与人为判断的精度有着密切的相关性，代表了评估特定领域质量质量的合理自动替代方案。我们通过实施提取方法，评估措施和多个公用事业来释放Wikitailor工具包。 Wikitailor使Wikipedia的多语言内域数据变得容易。

We propose an automatic language-independent graph-based method to build à-la-carte article collections on user-defined domains from the Wikipedia. The core model is based on the exploration of the encyclopaedia's category graph and can produce both monolingual and multilingual comparable collections. We run thorough experiments to assess the quality of the obtained corpora in 10 languages and 743 domains. According to an extensive manual evaluation, our graph-based model outperforms a retrieval-based approach and reaches an average precision of 84% on in-domain articles. As manual evaluations are costly, we introduce the concept of "domainness" and design several automatic metrics to account for the quality of the collections. Our best metric for domainness shows a strong correlation with the human-judged precision, representing a reasonable automatic alternative to assess the quality of domain-specific corpora. We release the WikiTailor toolkit with the implementation of the extraction methods, the evaluation measures and several utilities. WikiTailor makes obtaining multilingual in-domain data from the Wikipedia easy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题