凸Q学习，第1部分：确定性最佳控制

论文标题

凸Q学习，第1部分：确定性最佳控制

Convex Q-Learning, Part 1: Deterministic Optimal Control

论文作者

Mehta, Prashant G., Meyn, Sean P.

论文摘要

众所周知，沃特金斯算法扩展到一般函数近似设置具有挑战性：投影的钟手方程是否有解决方案？如果是这样，解决方案在产生良好政策的意义上是否有用？而且，如果以肯定的方式回答了前面的问题，该算法是否一致？即使在参数中线性的Q功能近似的特殊情况下，这些问题也没有解决。考虑到动态编程的凸分析方法的悠久历史，挑战似乎是自相矛盾的。本文从对线性编程方法进行最佳控制方法的简要调查开始，从而导致特定的参数化为强化学习中的应用提供了自身。主要结论总结如下：（i）新的凸Q学习算法是根据Bellman方程的凸松弛而引入的。收敛是在一般条件下建立的，包括Q功能的线性函数近似。（ii）批处理实施似乎类似于著名的DQN算法（Alphazero后面的一个引擎）。结果表明，实际上，算法是非常不同的：虽然凸Q学习求解了一个近似Bellman方程的凸面程序，但DQN的理论与Watkins具有功能近似值的算法相比，并不强。 ode。这些结果是针对具有总成本标准的确定性非线性系统获得的。提出了许多扩展，包括内核实现以及扩展到MDP模型。

It is well known that the extension of Watkins' algorithm to general function approximation settings is challenging: does the projected Bellman equation have a solution? If so, is the solution useful in the sense of generating a good policy? And, if the preceding questions are answered in the affirmative, is the algorithm consistent? These questions are unanswered even in the special case of Q-function approximations that are linear in the parameter. The challenge seems paradoxical, given the long history of convex analytic approaches to dynamic programming. The paper begins with a brief survey of linear programming approaches to optimal control, leading to a particular over parameterization that lends itself to applications in reinforcement learning. The main conclusions are summarized as follows: (i) The new class of convex Q-learning algorithms is introduced based on the convex relaxation of the Bellman equation. Convergence is established under general conditions, including a linear function approximation for the Q-function. (ii) A batch implementation appears similar to the famed DQN algorithm (one engine behind AlphaZero). It is shown that in fact the algorithms are very different: while convex Q-learning solves a convex program that approximates the Bellman equation, theory for DQN is no stronger than for Watkins' algorithm with function approximation: (a) it is shown that both seek solutions to the same fixed point equation, and (b) the ODE approximations for the two algorithms coincide, and little is known about the stability of this ODE. These results are obtained for deterministic nonlinear systems with total cost criterion. Many extensions are proposed, including kernel implementation, and extension to MDP models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题