自然语言摘要的denoising顺序对序列模型的域特定微调

论文标题

自然语言摘要的denoising顺序对序列模型的域特定微调

Domain Specific Fine-tuning of Denoising Sequence-to-Sequence Models for Natural Language Summarization

论文作者

Parker, Brydon, Sokolov, Alik, Ahmed, Mahtab, Kalebic, Matt, Kocak, Sedef Akinli, Shai, Ofer

论文摘要

长篇文本数据的汇总是一个问题，尤其是在知识经济工作（如医学和金融）中，需要不断了解知识的复杂且不断发展的知识体系。因此，使用自然语言处理（NLP）技术自动隔离和汇总关键内容具有在这些行业中节省大量时间的潜力。我们探索了最先进的NLP模型（BART）的应用，并探索使用数据增强和各种微调策略对其进行调整以最佳性能的策略。我们表明，在对开箱即用的预训练的BART摘要中，在对域特定数据进行测试时，我们的端到端微调方法可能会导致5-6 \％的绝对Rouge-1改进，并在特定于域的特定数据上进行测试，并提供我们的端到端管道，以实现有关融资，医疗，医疗或其他用户指定域的这些结果。

Summarization of long-form text data is a problem especially pertinent in knowledge economy jobs such as medicine and finance, that require continuously remaining informed on a sophisticated and evolving body of knowledge. As such, isolating and summarizing key content automatically using Natural Language Processing (NLP) techniques holds the potential for extensive time savings in these industries. We explore applications of a state-of-the-art NLP model (BART), and explore strategies for tuning it to optimal performance using data augmentation and various fine-tuning strategies. We show that our end-to-end fine-tuning approach can result in a 5-6\% absolute ROUGE-1 improvement over an out-of-the-box pre-trained BART summarizer when tested on domain specific data, and make available our end-to-end pipeline to achieve these results on finance, medical, or other user-specified domains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题