事实证明，良好的批次加固学习没有很好的探索

论文标题

事实证明，良好的批次加固学习没有很好的探索

Provably Good Batch Reinforcement Learning Without Great Exploration

论文作者

Liu, Yao, Swaminathan, Adith, Agarwal, Alekh, Brunskill, Emma

论文摘要

批次加固学习（RL）对于将RL算法应用于许多高赌注任务很重要。以在大型领域中产生可靠的新政策的方式进行批处理RL具有挑战性：新的决策政策可以访问批处理数据的支持以外的国家和行动，并使用有限的样本进行功能近似和优化，可以进一步提高学习政策的潜力，并过分乐观地估计其未来绩效。最近的算法已经表现出希望，但对他们的预期结果仍然过于乐观。为产出政策提供强大保证的理论工作依赖于强大的集中性假设，这使得它不适合在行为策略与某些候选政策之间的比率之间的比率之间的比率很大。这是因为在传统分析中，误差与此比率相比扩展。我们表明，对Bellman的最佳性和评估备份以进行更保守的更新可以具有更强的保证，这是对Bellman的最佳选择和评估备份的小修改。在某些设置中，他们可以在批处理数据探索的州行动空间中找到大约最佳的策略，而无需先验的浓缩性假设。我们强调了我们保守更新的必要性以及以说明性MDP示例的先前算法和分析的局限性，并在标准基准中证明了我们算法和其他最先进的RL基准的经验比较。

Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks. Doing batch RL in a way that yields a reliable new policy in large domains is challenging: a new decision policy may visit states and actions outside the support of the batch data, and function approximation and optimization with limited samples can further increase the potential of learning policies with overly optimistic estimates of their future performance. Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes. Theoretical work that provides strong guarantees on the performance of the output policy relies on a strong concentrability assumption, that makes it unsuitable for cases where the ratio between state-action distributions of behavior policy and some candidate policies is large. This is because in the traditional analysis, the error bound scales up with this ratio. We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees. In certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability. We highlight the necessity of our conservative update and the limitations of previous algorithms and analyses by illustrative MDP examples, and demonstrate an empirical comparison of our algorithm and other state-of-the-art batch RL baselines in standard benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题