仅使用英语注释来检测多种语言的食源性疾病投诉

论文标题

仅使用英语注释来检测多种语言的食源性疾病投诉

Detecting Foodborne Illness Complaints in Multiple Languages Using English Annotations Only

论文作者

Liu, Ziyi, Karamanolakis, Giannis, Hsu, Daniel, Gravano, Luis

论文摘要

卫生部门一直在部署文本分类系统，以便在社交媒体文件（例如Yelp餐厅评论）中早期发现食源性疾病投诉。当前的系统已成功地用于英语文档，因此，一个有希望的方向是通过考虑其他语言（例如西班牙语或中文）来增加覆盖范围和召回。但是，对以前的语言进行培训系统将是昂贵的，因为它需要对每种新目标语言的许多文档进行手动注释。为了应对这一挑战，我们仅使用英语评论的注释来考虑跨语言学习和培训多语言分类器。最近，基于预训练的多语言BERT（MBERT）的最新零射击方法已被证明可以有效地对准情感等方面的语言。有趣的是，我们表明，这些方法在捕获我们的公共卫生应用中捕获食源性疾病的细微差别而有效。为了提高性能而没有额外的注释，我们通过机器翻译创建人工培训文档，并共同培训Mbert的来源（英语）和目标语言。此外，我们表明将标记文档转换为多种语言会导致某些目标语言的其他性能改进。我们通过使用七种语言的Yelp餐厅评论进行了广泛的实验来证明我们的方法的好处。我们的分类器在Yelp挑战数据集的多语言评论中确定了食源性疾病的投诉，这突出了我们在卫生部门部署的一般方法的潜力。

Health departments have been deploying text classification systems for the early detection of foodborne illness complaints in social media documents such as Yelp restaurant reviews. Current systems have been successfully applied for documents in English and, as a result, a promising direction is to increase coverage and recall by considering documents in additional languages, such as Spanish or Chinese. Training previous systems for more languages, however, would be expensive, as it would require the manual annotation of many documents for each new target language. To address this challenge, we consider cross-lingual learning and train multilingual classifiers using only the annotations for English-language reviews. Recent zero-shot approaches based on pre-trained multi-lingual BERT (mBERT) have been shown to effectively align languages for aspects such as sentiment. Interestingly, we show that those approaches are less effective for capturing the nuances of foodborne illness, our public health application of interest. To improve performance without extra annotations, we create artificial training documents in the target language through machine translation and train mBERT jointly for the source (English) and target language. Furthermore, we show that translating labeled documents to multiple languages leads to additional performance improvements for some target languages. We demonstrate the benefits of our approach through extensive experiments with Yelp restaurant reviews in seven languages. Our classifiers identify foodborne illness complaints in multilingual reviews from the Yelp Challenge dataset, which highlights the potential of our general approach for deployment in health departments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题