论文标题

视觉段落中的变压器的非政策自我批判训练

Off-Policy Self-Critical Training for Transformer in Visual Paragraph Generation

论文作者

Yan, Shiyang, Hua, Yang, Robertson, Neil M.

论文摘要

最近,已经提出了几种方法来解决语言生成问题。 Transformer当前是语言生成中最新的SEQ-to-seq模型。强化学习(RL)可用于解决SEQ-to-seq语言学习中非差异指标的暴露偏差和优化。但是,变压器很难与RL结合,因为采样需要昂贵的计算资源。我们通过提出一种非政策RL学习算法来解决这个问题,其中由GRU代表的行为策略执行抽样。我们通过应用截断的相对重要性采样(TRIS)技术和Kullback-Leibler(KL) - 控制概念来减少重要性采样的高度差异(IS)。 Tris是一种简单而有效的技术,并且有一个理论上的证据,即KL-Control有助于减少IS的差异。我们根据自我批评序列训练制定了这种非政策RL。具体来说,我们使用基于变压器的字幕模型作为目标策略,并使用图像引导的语言自动编码器作为探索环境的行为策略。所提出的算法在视觉段落生成上实现了最新的性能,并在图像字幕上提高了结果。

Recently, several approaches have been proposed to solve language generation problems. Transformer is currently state-of-the-art seq-to-seq model in language generation. Reinforcement Learning (RL) is useful in solving exposure bias and the optimisation on non-differentiable metrics in seq-to-seq language learning. However, Transformer is hard to combine with RL as the costly computing resource is required for sampling. We tackle this problem by proposing an off-policy RL learning algorithm where a behaviour policy represented by GRUs performs the sampling. We reduce the high variance of importance sampling (IS) by applying the truncated relative importance sampling (TRIS) technique and Kullback-Leibler (KL)-control concept. TRIS is a simple yet effective technique, and there is a theoretical proof that KL-control helps to reduce the variance of IS. We formulate this off-policy RL based on self-critical sequence training. Specifically, we use a Transformer-based captioning model as the target policy and use an image-guided language auto-encoder as the behaviour policy to explore the environment. The proposed algorithm achieves state-of-the-art performance on the visual paragraph generation and improved results on image captioning.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源