q学习连续时间

论文标题

q学习连续时间

q-Learning in Continuous Time

论文作者

Jia, Yanwei, Zhou, Xun Yu

论文摘要

我们研究了Wang等人介绍的熵调查的，探索性扩散过程制定的Q-学习（RL）的Q-学习（RL）的连续时间。（2020）。 As the conventional (big) Q-function collapses in continuous time, we consider its first-order approximation and coin the term ``(little) q-function". This function is related to the instantaneous advantage rate function as well as the Hamiltonian. We develop a ``q-learning" theory around the q-function that is independent of time discretization.鉴于随机策略，我们通过某些随机过程的Martingale条件（包括在政策和政策环境中）共同表征了相关的Q功能和价值函数。然后，我们将理论应用来设计不同的参与者批评算法来解决潜在的RL问题，这取决于是否可以明确计算从Q功能产生的Gibbs测量的密度函数。我们的一种算法解释了众所周知的Q学习算法SARSA，另一个算法恢复了基于策略梯度（PG）的连续时间算法（2022b）。最后，我们进行了仿真实验，以将算法的性能与JIA和Zhou（2022b）中的PG基算法的性能以及时间消化的常规Q学习算法进行比较。

We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation introduced by Wang et al. (2020). As the conventional (big) Q-function collapses in continuous time, we consider its first-order approximation and coin the term ``(little) q-function". This function is related to the instantaneous advantage rate function as well as the Hamiltonian. We develop a ``q-learning" theory around the q-function that is independent of time discretization. Given a stochastic policy, we jointly characterize the associated q-function and value function by martingale conditions of certain stochastic processes, in both on-policy and off-policy settings. We then apply the theory to devise different actor-critic algorithms for solving underlying RL problems, depending on whether or not the density function of the Gibbs measure generated from the q-function can be computed explicitly. One of our algorithms interprets the well-known Q-learning algorithm SARSA, and another recovers a policy gradient (PG) based continuous-time algorithm proposed in Jia and Zhou (2022b). Finally, we conduct simulation experiments to compare the performance of our algorithms with those of PG-based algorithms in Jia and Zhou (2022b) and time-discretized conventional Q-learning algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题