带有扩散语言模型的文本生成：一种带有连续段落DeNoise的预训练方法

论文标题

带有扩散语言模型的文本生成：一种带有连续段落DeNoise的预训练方法

Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise

论文作者

Lin, Zhenghao, Gong, Yeyun, Shen, Yelong, Wu, Tong, Fan, Zhihao, Lin, Chen, Duan, Nan, Chen, Weizhu

论文摘要

在本文中，我们介绍了一种新颖的扩散语言模型，用于文本生成的培训框架，我们称为Genie。 Genie是一种大规模预处理的扩散语言模型，由编码器和基于扩散的解码器组成，可以通过将随机噪声序列逐渐转换为相干的文本序列来生成文本。为了在大规模语言语料库上预先训练精灵，我们设计了一个新的连续段落Denoise目标，该目标鼓励扩散解码器从损坏的版本中重建一个干净的文本段落，同时保留语义和句法相干性。我们评估了四个下游文本生成基准的Genie，即Xsum，CNN/Dailymail，Gigaword和Commongen。我们的实验结果表明，Genie在这些基准上的最先进的自回归模型中实现了可比的性能，并生成了更多样化的文本样本。精灵的代码和模型可在https://github.com/microsoft/prophetnet/tree/master/genie中获得。

In this paper, we introduce a novel dIffusion language modEl pre-training framework for text generation, which we call GENIE. GENIE is a large-scale pretrained diffusion language model that consists of an encoder and a diffusion-based decoder, which can generate text by gradually transforming a random noise sequence into a coherent text sequence. To pre-train GENIE on a large-scale language corpus, we design a new continuous paragraph denoise objective, which encourages the diffusion-decoder to reconstruct a clean text paragraph from a corrupted version, while preserving the semantic and syntactic coherence. We evaluate GENIE on four downstream text generation benchmarks, namely XSum, CNN/DailyMail, Gigaword, and CommonGen. Our experimental results show that GENIE achieves comparable performance with the state-of-the-art autoregressive models on these benchmarks, and generates more diverse text samples. The code and models of GENIE are available at https://github.com/microsoft/ProphetNet/tree/master/GENIE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题