论文标题
半监督文本简化,反翻译和不对称的自动编码器
Semi-Supervised Text Simplification with Back-Translation and Asymmetric Denoising Autoencoders
论文作者
论文摘要
文本简化(TS)在保留固有语义的同时,将长句子换成简化的变体。传统的顺序到序列模型在很大程度上依赖于并行句子的数量和质量,这将其适用性限制在不同的语言和域中。这项工作调查了如何在TS任务中利用大量未配对的语料库。我们在无监督的机器翻译(NMT)中采用了反向翻译体系结构,包括通过迭代的反向翻译将自动编码器用于语言建模和自动生成并行数据。但是,如果我们直接将简单且复杂的语料库视为两种不同的语言,那么生成适当的复杂简单对是不平凡的,因为两种类型的句子非常相似,并且模型很难在不同类型的句子中捕获特征。为了解决这个问题,我们提出了单独复杂性句子的不对称非对称方法。在用自动编码器对简单且复杂的句子进行建模时,我们将不同类型的噪声引入训练过程。这种方法可以显着提高简化性能。我们的模型可以以无监督和半监督的方式进行训练。自动和人类评估表明,我们的无监督模型优于先前的系统,并且在有限的监督下,我们的模型可以通过多个最先进的简化系统竞争性能。
Text simplification (TS) rephrases long sentences into simplified variants while preserving inherent semantics. Traditional sequence-to-sequence models heavily rely on the quantity and quality of parallel sentences, which limits their applicability in different languages and domains. This work investigates how to leverage large amounts of unpaired corpora in TS task. We adopt the back-translation architecture in unsupervised machine translation (NMT), including denoising autoencoders for language modeling and automatic generation of parallel data by iterative back-translation. However, it is non-trivial to generate appropriate complex-simple pair if we directly treat the set of simple and complex corpora as two different languages, since the two types of sentences are quite similar and it is hard for the model to capture the characteristics in different types of sentences. To tackle this problem, we propose asymmetric denoising methods for sentences with separate complexity. When modeling simple and complex sentences with autoencoders, we introduce different types of noise into the training process. Such a method can significantly improve the simplification performance. Our model can be trained in both unsupervised and semi-supervised manner. Automatic and human evaluations show that our unsupervised model outperforms the previous systems, and with limited supervision, our model can perform competitively with multiple state-of-the-art simplification systems.