论文标题
借用还是代码开关?在语言混合中对更细粒度的区分进行注释
Borrowing or Codeswitching? Annotating for Finer-Grained Distinctions in Language Mixing
论文作者
论文摘要
我们提出了一个新的Twitter数据语料库,该数据注释了西班牙语和英语之间的代码交换和借用。该语料库包含带有代码开关,借款和命名实体的令牌级别注释的9,500条推文。该语料库与先前的代码开关情况有所不同,因为我们试图清楚地定义和注释codeswitching and Loarding之间的边界,而在原本单语的上下文中使用时,并不将常见的“互联网说话”('LOL'等)视为代码开关。结果是一个语料库,可以在一个数据集中的Twitter上进行西班牙语 - 英语借用和代码开关的研究和建模。我们提出了使用基于变压器的语言模型对该语料库的标签进行建模的基线得分。注释本身由CC by 4.0许可发布,而其适用的文本则按照Twitter服务条款分发。
We present a new corpus of Twitter data annotated for codeswitching and borrowing between Spanish and English. The corpus contains 9,500 tweets annotated at the token level with codeswitches, borrowings, and named entities. This corpus differs from prior corpora of codeswitching in that we attempt to clearly define and annotate the boundary between codeswitching and borrowing and do not treat common "internet-speak" ('lol', etc.) as codeswitching when used in an otherwise monolingual context. The result is a corpus that enables the study and modeling of Spanish-English borrowing and codeswitching on Twitter in one dataset. We present baseline scores for modeling the labels of this corpus using Transformer-based language models. The annotation itself is released with a CC BY 4.0 license, while the text it applies to is distributed in compliance with the Twitter terms of service.