论文标题

通过转移一位稀疏的老师,跨语性文本分类和最少的资源分类

Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher

论文作者

Karamanolakis, Giannis, Hsu, Daniel, Gravano, Luis

论文摘要

跨语性文本分类通过利用其他语言的标记文档来减轻目标语言手动标记文档的需求。跨语言转移监督的现有方法需要昂贵的跨语性资源,例如平行语料库,而较便宜的跨语言表示学习方法培训分类器而没有目标标记的文档。在这项工作中,我们提出了一种跨语性的教师研究方法CLTS,该方法使用少量单词翻译的形式使用最小的跨语性资源以目标语言产生“弱”监督。鉴于翻译预算有限,CLT仅提取和转移跨语言的最重要的特定任务种子单词,并根据翻译的种子单词初始化教师分类器。然后,CLT迭代训练一个更强大的学生,该学生还利用未标记的目标文档中种子单词的上下文,并优于老师。 CLT在18种不同的语言中简单且令人惊讶地有效:通过仅传输20个种子单词,即使是单词的逻辑回归学生的表现都优于最先进的跨语性方法(例如,基于多语言BERT)。此外,CLT可以容纳任何类型的学生分类器:利用单语言学生的学生可以进一步改进,并且超过更高昂贵的方法的准确性高达12%。最后,CLT仅使用少量单词翻译以低资源语言来解决新兴任务。

Cross-lingual text classification alleviates the need for manually labeled documents in a target language by leveraging labeled documents from other languages. Existing approaches for transferring supervision across languages require expensive cross-lingual resources, such as parallel corpora, while less expensive cross-lingual representation learning approaches train classifiers without target labeled documents. In this work, we propose a cross-lingual teacher-student method, CLTS, that generates "weak" supervision in the target language using minimal cross-lingual resources, in the form of a small number of word translations. Given a limited translation budget, CLTS extracts and transfers only the most important task-specific seed words across languages and initializes a teacher classifier based on the translated seed words. Then, CLTS iteratively trains a more powerful student that also exploits the context of the seed words in unlabeled target documents and outperforms the teacher. CLTS is simple and surprisingly effective in 18 diverse languages: by transferring just 20 seed words, even a bag-of-words logistic regression student outperforms state-of-the-art cross-lingual methods (e.g., based on multilingual BERT). Moreover, CLTS can accommodate any type of student classifier: leveraging a monolingual BERT student leads to further improvements and outperforms even more expensive approaches by up to 12% in accuracy. Finally, CLTS addresses emerging tasks in low-resource languages using just a small number of word translations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源