视觉段落中的变压器的非政策自我批判训练

论文标题

视觉段落中的变压器的非政策自我批判训练

Off-Policy Self-Critical Training for Transformer in Visual Paragraph Generation

论文作者

Yan, Shiyang, Hua, Yang, Robertson, Neil M.

论文摘要

最近，已经提出了几种方法来解决语言生成问题。 Transformer当前是语言生成中最新的SEQ-to-seq模型。强化学习（RL）可用于解决SEQ-to-seq语言学习中非差异指标的暴露偏差和优化。但是，变压器很难与RL结合，因为采样需要昂贵的计算资源。我们通过提出一种非政策RL学习算法来解决这个问题，其中由GRU代表的行为策略执行抽样。我们通过应用截断的相对重要性采样（TRIS）技术和Kullback-Leibler（KL） - 控制概念来减少重要性采样的高度差异（IS）。 Tris是一种简单而有效的技术，并且有一个理论上的证据，即KL-Control有助于减少IS的差异。我们根据自我批评序列训练制定了这种非政策RL。具体来说，我们使用基于变压器的字幕模型作为目标策略，并使用图像引导的语言自动编码器作为探索环境的行为策略。所提出的算法在视觉段落生成上实现了最新的性能，并在图像字幕上提高了结果。

Recently, several approaches have been proposed to solve language generation problems. Transformer is currently state-of-the-art seq-to-seq model in language generation. Reinforcement Learning (RL) is useful in solving exposure bias and the optimisation on non-differentiable metrics in seq-to-seq language learning. However, Transformer is hard to combine with RL as the costly computing resource is required for sampling. We tackle this problem by proposing an off-policy RL learning algorithm where a behaviour policy represented by GRUs performs the sampling. We reduce the high variance of importance sampling (IS) by applying the truncated relative importance sampling (TRIS) technique and Kullback-Leibler (KL)-control concept. TRIS is a simple yet effective technique, and there is a theoretical proof that KL-control helps to reduce the variance of IS. We formulate this off-policy RL based on self-critical sequence training. Specifically, we use a Transformer-based captioning model as the target policy and use an image-guided language auto-encoder as the behaviour policy to explore the environment. The proposed algorithm achieves state-of-the-art performance on the visual paragraph generation and improved results on image captioning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题