论文标题

多语言开放文本版本1:44种语言的公共领域新闻

Multilingual Open Text Release 1: Public Domain News in 44 Languages

论文作者

Palen-Michel, Chester, Kim, June, Lignos, Constantine

论文摘要

我们提出了多语言开放文本(MOT),这是一种包含44种语言的文本的新型多语言语料库,其中许多语言限制了自然语言处理的现有文本资源。该语料库的第一个版本包含超过280万篇新闻文章和100万个短片段(照片标题,视频描述等),该片段在2001 - 2022年之间发表,并从美国语音的新闻网站收集。我们描述了收集,过滤和处理数据的过程。原始材料在公共领域,我们的收藏品使用Creative Commons许可证(CC By 4.0)获得许可,并且用于创建该语料库的所有软件均在MIT许可下发布。随着其他文档的发布,该语料库将定期更新。

We present Multilingual Open Text (MOT), a new multilingual corpus containing text in 44 languages, many of which have limited existing text resources for natural language processing. The first release of the corpus contains over 2.8 million news articles and an additional 1 million short snippets (photo captions, video descriptions, etc.) published between 2001--2022 and collected from Voice of America's news websites. We describe our process for collecting, filtering, and processing the data. The source material is in the public domain, our collection is licensed using a creative commons license (CC BY 4.0), and all software used to create the corpus is released under the MIT License. The corpus will be regularly updated as additional documents are published.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源