Discor：通过分配校正进行加固学习中的纠正反馈

论文标题

Discor：通过分配校正进行加固学习中的纠正反馈

DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction

论文作者

Kumar, Aviral, Gupta, Abhishek, Levine, Sergey

论文摘要

深厚的强化学习可以为各种任务学习有效的政策，但由于不稳定性和对超参数的敏感性，众所周知，难以使用。原因尚不清楚。当使用标准监督方法（例如，对于土匪）时，派利数据收集提供了“硬负”，可以纠正该策略可能访问的那些状态和行动中的模型。我们称这种现象为“纠正反馈”。我们表明，基于自举的Q学习算法并不一定受益于这种纠正反馈，并且对算法收集的经验的培训不足以纠正Q功能中的错误。实际上，Q学习和相关方法可以在代理人收集的经验分布与通过训练经验引起的政策之间表现出病理相互作用，从而导致潜在的不稳定性，次优融合，而在从嘈杂，稀疏，稀疏或延迟的奖励中学习时结果不良。我们在理论上和经验上都证明了这个问题的存在。然后，我们证明对数据分布的特定校正可以减轻此问题。基于这些观察结果，我们提出了一种新算法Discor，该算法计算了与此最佳分布的近似值，并使用它来重新权威重新进行培训的过渡，从而在一系列具有挑战性的RL设置中进行了实质性改善，例如多任务的多任务学习和从Noisy奖励信号中学习。博客文章介绍此工作的摘要，请访问：https：//bair.berkeley.edu/blog/2020/03/16/discor/。

Deep reinforcement learning can learn effective policies for a wide range of tasks, but is notoriously difficult to use due to instability and sensitivity to hyperparameters. The reasons for this remain unclear. When using standard supervised methods (e.g., for bandits), on-policy data collection provides "hard negatives" that correct the model in precisely those states and actions that the policy is likely to visit. We call this phenomenon "corrective feedback." We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from this corrective feedback, and training on the experience collected by the algorithm is not sufficient to correct errors in the Q-function. In fact, Q-learning and related methods can exhibit pathological interactions between the distribution of experience collected by the agent and the policy induced by training on that experience, leading to potential instability, sub-optimal convergence, and poor results when learning from noisy, sparse or delayed rewards. We demonstrate the existence of this problem, both theoretically and empirically. We then show that a specific correction to the data distribution can mitigate this issue. Based on these observations, we propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training, resulting in substantial improvements in a range of challenging RL settings, such as multi-task learning and learning from noisy reward signals. Blog post presenting a summary of this work is available at: https://bair.berkeley.edu/blog/2020/03/16/discor/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题