文本简化中的句子对齐方式的神经CRF模型

论文标题

文本简化中的句子对齐方式的神经CRF模型

Neural CRF Model for Sentence Alignment in Text Simplification

论文作者

Jiang, Chao, Maddela, Mounica, Lan, Wuwei, Zhong, Yang, Xu, Wei

论文摘要

文本简化系统的成功在很大程度上取决于训练语料库中复杂句子对的质量和数量，这些句子对通过对平行文章之间的句子进行对齐而提取。为了评估和提高句子对齐质量，我们从两个常用的文本简化语料库，Newsela和Wikipedia中创建了两个手动注释句子对准的数据集。我们提出了一个新型的神经CRF对准模型，该模型不仅利用并行文档中句子的顺序性质，而且还利用神经句子对模型来捕获语义相似性。实验表明，我们提出的方法的表现优于以前关于单语句子对准任务的所有工作，而F1中的所有工作都超过5分。我们将CRF对准器应用于构建两个新的简化数据集，即新闻埃拉 - 奥托和Wiki-auto，它们比现有数据集更大，质量更高。在我们的数据集中训练的基于变压器的SEQ2SEQ模型为自动评估和人类评估中的文本简化建立了新的最新文本。

The success of a text simplification system heavily depends on the quality and quantity of complex-simple sentence pairs in the training corpus, which are extracted by aligning sentences between parallel articles. To evaluate and improve sentence alignment quality, we create two manually annotated sentence-aligned datasets from two commonly used text simplification corpora, Newsela and Wikipedia. We propose a novel neural CRF alignment model which not only leverages the sequential nature of sentences in parallel documents but also utilizes a neural sentence pair model to capture semantic similarity. Experiments demonstrate that our proposed approach outperforms all the previous work on monolingual sentence alignment task by more than 5 points in F1. We apply our CRF aligner to construct two new text simplification datasets, Newsela-Auto and Wiki-Auto, which are much larger and of better quality compared to the existing datasets. A Transformer-based seq2seq model trained on our datasets establishes a new state-of-the-art for text simplification in both automatic and human evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题