论文标题
零级演员 - 批评:顺序决策问题的进化框架
Zeroth-Order Actor-Critic: An Evolutionary Framework for Sequential Decision Problems
论文作者
论文摘要
进化算法(EAS)通过将它们简化为静态优化问题并以零好的方式搜索最佳策略参数,在解决顺序决策问题(SDP)方面表现出了希望。尽管这些方法具有高度的用途,但由于它们对潜在的时间结构的无知,它们通常会遭受高样本复杂性的苦难。相反,加强学习(RL)方法通常将SDP作为马尔可夫决策过程(MDP)。尽管样本效率比EAS更有效,但RL方法仅限于可区分的策略,并且容易陷入本地Optima。为了解决这些问题,我们提出了一种新颖的进化框架零级演员批评(ZOAC)。我们建议在参数空间中使用逐步探索,并理论上得出零级策略梯度。我们进一步利用参与者 - 批判性架构有效利用Markov的SDP属性并减少梯度估计器的差异。在每次迭代中,ZOAC都采用采样器来收集具有参数空间探索的轨迹,以及一阶策略评估(PEV)和零级策略改进(PIM)之间的交替。为了评估ZOAC的有效性,我们将其应用于具有挑战性的多车道驾驶任务,优化基于规则的,非差异的驾驶策略中的参数,该参数由三个子模型组成:行为选择,路径计划和轨迹跟踪。我们还将其与三个体育馆任务的基于梯度的RL方法进行了比较,从而优化了数千个参数的神经网络策略。实验结果表明,ZOAC在求解dps中具有很强的能力。 ZOAC显着优于将问题视为静态优化的EAS,即使没有一阶信息,就所有任务的总平均收益而言,即使没有一阶信息,也可以匹配基于梯度的RL方法的性能。
Evolutionary algorithms (EAs) have shown promise in solving sequential decision problems (SDPs) by simplifying them to static optimization problems and searching for the optimal policy parameters in a zeroth-order way. While these methods are highly versatile, they often suffer from high sample complexity due to their ignorance of the underlying temporal structures. In contrast, reinforcement learning (RL) methods typically formulate SDPs as Markov Decision Process (MDP). Although more sample efficient than EAs, RL methods are restricted to differentiable policies and prone to getting stuck in local optima. To address these issues, we propose a novel evolutionary framework Zeroth-Order Actor-Critic (ZOAC). We propose to use step-wise exploration in parameter space and theoretically derive the zeroth-order policy gradient. We further utilize the actor-critic architecture to effectively leverage the Markov property of SDPs and reduce the variance of gradient estimators. In each iteration, ZOAC employs samplers to collect trajectories with parameter space exploration, and alternates between first-order policy evaluation (PEV) and zeroth-order policy improvement (PIM). To evaluate the effectiveness of ZOAC, we apply it to a challenging multi-lane driving task, optimizing the parameters in a rule-based, non-differentiable driving policy that consists of three sub-modules: behavior selection, path planning, and trajectory tracking. We also compare it with gradient-based RL methods on three Gymnasium tasks, optimizing neural network policies with thousands of parameters. Experimental results demonstrate the strong capability of ZOAC in solving SDPs. ZOAC significantly outperforms EAs that treat the problem as static optimization and matches the performance of gradient-based RL methods even without first-order information, in terms of total average return across all tasks.