论文标题

q学习连续时间

q-Learning in Continuous Time

论文作者

Jia, Yanwei, Zhou, Xun Yu

论文摘要

我们研究了Wang等人介绍的熵调查的,探索性扩散过程制定的Q-学习(RL)的Q-学习(RL)的连续时间。 (2020)。 As the conventional (big) Q-function collapses in continuous time, we consider its first-order approximation and coin the term ``(little) q-function". This function is related to the instantaneous advantage rate function as well as the Hamiltonian. We develop a ``q-learning" theory around the q-function that is independent of time discretization.鉴于随机策略,我们通过某些随机过程的Martingale条件(包括在政策和政策环境中)共同表征了相关的Q功能和价值函数。然后,我们将理论应用来设计不同的参与者批评算法来解决潜在的RL问题,这取决于是否可以明确计算从Q功能产生的Gibbs测量的密度函数。我们的一种算法解释了众所周知的Q学习算法SARSA,另一个算法恢复了基于策略梯度(PG)的连续时间算法(2022b)。最后,我们进行了仿真实验,以将算法的性能与JIA和Zhou(2022b)中的PG基算法的性能以及时间消化的常规Q学习算法进行比较。

We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation introduced by Wang et al. (2020). As the conventional (big) Q-function collapses in continuous time, we consider its first-order approximation and coin the term ``(little) q-function". This function is related to the instantaneous advantage rate function as well as the Hamiltonian. We develop a ``q-learning" theory around the q-function that is independent of time discretization. Given a stochastic policy, we jointly characterize the associated q-function and value function by martingale conditions of certain stochastic processes, in both on-policy and off-policy settings. We then apply the theory to devise different actor-critic algorithms for solving underlying RL problems, depending on whether or not the density function of the Gibbs measure generated from the q-function can be computed explicitly. One of our algorithms interprets the well-known Q-learning algorithm SARSA, and another recovers a policy gradient (PG) based continuous-time algorithm proposed in Jia and Zhou (2022b). Finally, we conduct simulation experiments to compare the performance of our algorithms with those of PG-based algorithms in Jia and Zhou (2022b) and time-discretized conventional Q-learning algorithms.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源