论文标题
蝙蝠:最佳动作轨迹缝线
BATS: Best Action Trajectory Stitching
论文作者
论文摘要
离线增强学习的问题集中在于从环境互动日志中学习良好的政策。过去在该领域开发算法的努力围绕在线加强学习算法引入限制,以确保学习策略的行动被限制在已记录的数据上。在这项工作中,我们通过直接在固定数据集上进行计划来探索另一种方法。具体来说,我们引入了一种算法,该算法通过将新的过渡添加到数据集中,在记录数据上形成了表格Markov决策过程(MDP)。我们通过使用学习的动态模型来计划状态之间的短轨迹来做到这一点。由于可以在此构建的MDP上执行精确的值迭代,因此很容易识别哪些轨迹可以添加到MDP中。至关重要的是,由于该MDP中的大多数过渡都来自已记录的数据,因此可以长期置入MDP的轨迹。我们证明,此属性允许人们在值函数上将上限和下界制作到适当的距离指标。最后,我们从经验上证明了如何将学习策略统一的算法限制在整个数据集中可能会导致不良行为,并且我们展示了一个例子,其中简单地行为克隆了我们算法创建的MDP的最佳策略避免了这个问题。
The problem of offline reinforcement learning focuses on learning a good policy from a log of environment interactions. Past efforts for developing algorithms in this area have revolved around introducing constraints to online reinforcement learning algorithms to ensure the actions of the learned policy are constrained to the logged data. In this work, we explore an alternative approach by planning on the fixed dataset directly. Specifically, we introduce an algorithm which forms a tabular Markov Decision Process (MDP) over the logged data by adding new transitions to the dataset. We do this by using learned dynamics models to plan short trajectories between states. Since exact value iteration can be performed on this constructed MDP, it becomes easy to identify which trajectories are advantageous to add to the MDP. Crucially, since most transitions in this MDP come from the logged data, trajectories from the MDP can be rolled out for long periods with confidence. We prove that this property allows one to make upper and lower bounds on the value function up to appropriate distance metrics. Finally, we demonstrate empirically how algorithms that uniformly constrain the learned policy to the entire dataset can result in unwanted behavior, and we show an example in which simply behavior cloning the optimal policy of the MDP created by our algorithm avoids this problem.