文本数据增强：更好地检测长矛钓鱼电子邮件

论文标题

文本数据增强：更好地检测长矛钓鱼电子邮件

Text Data Augmentation: Towards better detection of spear-phishing emails

论文作者

Regina, Mehdi, Meyer, Maxime, Goutal, Sébastien

论文摘要

文本数据增强，即从现有文本创建新的文本数据，这是具有挑战性的。实际上，增强转换应考虑语言复杂性，同时与目标自然语言处理（NLP）任务相关（例如，机器翻译，文本分类）。最初是出于应用企业电子邮件妥协（BEC）检测的激励，我们提出了一个语料库和任务不可知的增强框架，用作增强我们公司内英语文本的服务。我们的建议结合了不同的方法，利用BERT语言模型，多步反翻译和启发式方法。我们表明，我们的增强框架使用公开可用的模型和语料库以及BEC检测任务改善了几个文本分类任务的性能。我们还提供了有关我们增强框架局限性的全面论证。

Text data augmentation, i.e., the creation of new textual data from an existing text, is challenging. Indeed, augmentation transformations should take into account language complexity while being relevant to the target Natural Language Processing (NLP) task (e.g., Machine Translation, Text Classification). Initially motivated by an application of Business Email Compromise (BEC) detection, we propose a corpus and task agnostic augmentation framework used as a service to augment English texts within our company. Our proposal combines different methods, utilizing BERT language model, multi-step back-translation and heuristics. We show that our augmentation framework improves performances on several text classification tasks using publicly available models and corpora as well as on a BEC detection task. We also provide a comprehensive argumentation about the limitations of our augmentation framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题