文本生成中重复问题的理论分析

论文标题

文本生成中重复问题的理论分析

A Theoretical Analysis of the Repetition Problem in Text Generation

论文作者

Fu, Zihao, Lam, Wai, So, Anthony Man-Cho, Shi, Bei

论文摘要

文本生成任务，包括翻译，摘要，语言模型等。参见近年来的快速增长。尽管取得了显着的成就，但在几乎所有文本生成模型中都观察到重复问题，从而严重破坏了生成性能。为了解决重复问题，已经提出了许多方法，但是没有现有的理论分析可以表明为什么会发生此问题以及如何解决。在本文中，我们为重复问题提出了一个新的理论分析框架。我们首先定义平均重复概率（ARP），以定量地表征重复问题。然后，我们对马尔可夫生成模型进行了广泛的分析，并以直观的理解得出了平均重复概率的几个上限。我们表明，大多数现有方法基本上都明确或隐式地最小化了上限。基于我们的理论，我们表明重复问题是不幸的是，由我们语言本身的特征引起的。一个主要原因是归因于以下事实：存在太多的单词预测与后续单词相同的单词，具有很高的可能性。因此，很容易回到这个词并形成重复，我们将其视为高流入问题。此外，我们得出了一般生成模型平均重复概率的浓度结合。最后，基于理论上界，我们提出了一种新型的重新平衡编码方法，以减轻高流入问题。实验结果表明，我们的理论框架适用于一般生成模型，我们提出的重新平衡编码方法可大大减轻重复问题。本文的源代码可以从https://github.com/fuzihaofzh/repetition-problem-nlg获得。

Text generation tasks, including translation, summarization, language models, and etc. see rapid growth during recent years. Despite the remarkable achievements, the repetition problem has been observed in nearly all text generation models undermining the generation performance extensively. To solve the repetition problem, many methods have been proposed, but there is no existing theoretical analysis to show why this problem happens and how it is resolved. In this paper, we propose a new framework for theoretical analysis for the repetition problem. We first define the Average Repetition Probability (ARP) to characterize the repetition problem quantitatively. Then, we conduct an extensive analysis of the Markov generation model and derive several upper bounds of the average repetition probability with intuitive understanding. We show that most of the existing methods are essentially minimizing the upper bounds explicitly or implicitly. Grounded on our theory, we show that the repetition problem is, unfortunately, caused by the traits of our language itself. One major reason is attributed to the fact that there exist too many words predicting the same word as the subsequent word with high probability. Consequently, it is easy to go back to that word and form repetitions and we dub it as the high inflow problem. Furthermore, we derive a concentration bound of the average repetition probability for a general generation model. Finally, based on the theoretical upper bounds, we propose a novel rebalanced encoding approach to alleviate the high inflow problem. The experimental results show that our theoretical framework is applicable in general generation models and our proposed rebalanced encoding approach alleviates the repetition problem significantly. The source code of this paper can be obtained from https://github.com/fuzihaofzh/repetition-problem-nlg.

下载PDF全文

下载文献需遵守相关版权规定

论文标题