对数遗憾以部分可观察的线性动力学系统约束

论文标题

对数遗憾以部分可观察的线性动力学系统约束

Logarithmic Regret Bound in Partially Observable Linear Dynamical Systems

论文作者

Lale, Sahin, Azizzadenesheli, Kamyar, Hassibi, Babak, Anandkumar, Anima

论文摘要

我们研究了部分可观察到的线性动力学系统中系统识别和自适应控制的问题。由于数据收集中引入的相关性，自适应和闭环系统标识是一个具有挑战性的问题。在本文中，我们介绍了第一个模型估计方法，并在开放式和闭环系统标识中均具有有限的时间保证。通过部署这种估计方法，我们提出了自适应控制在线学习（ADAPTON），这是一种有效的强化学习算法，可自适应地学习系统动力学并通过在线学习步骤不断更新其控制器。 Adapton通过偶尔通过与环境的相互作用来解决线性回归问题来估算模型动力学。使用策略重新参数和估计模型，Adapton构造了反事实损失功能，用于通过在线梯度下降来更新控制器。随着时间的流逝，Adapton改善了其模型估计，并获得了更准确的梯度更新以改善控制器。我们表明，在$ \ text {polylog} \ left（t \ right）$的$ \ text（$ t $ t $ time fime tempeption them-environment互动）之后，Adapton获得了遗憾的上限。据我们所知，Adapton是第一种实现$ \ text {polylog} \ left（t \ right）$遗憾的算法，它在自适应控制未知的部分可观察到的线性动力学系统中，其中包括线性quadratic Quadratic Gaussian（LQG）控制。

We study the problem of system identification and adaptive control in partially observable linear dynamical systems. Adaptive and closed-loop system identification is a challenging problem due to correlations introduced in data collection. In this paper, we present the first model estimation method with finite-time guarantees in both open and closed-loop system identification. Deploying this estimation method, we propose adaptive control online learning (AdaptOn), an efficient reinforcement learning algorithm that adaptively learns the system dynamics and continuously updates its controller through online learning steps. AdaptOn estimates the model dynamics by occasionally solving a linear regression problem through interactions with the environment. Using policy re-parameterization and the estimated model, AdaptOn constructs counterfactual loss functions to be used for updating the controller through online gradient descent. Over time, AdaptOn improves its model estimates and obtains more accurate gradient updates to improve the controller. We show that AdaptOn achieves a regret upper bound of $\text{polylog}\left(T\right)$, after $T$ time steps of agent-environment interaction. To the best of our knowledge, AdaptOn is the first algorithm that achieves $\text{polylog}\left(T\right)$ regret in adaptive control of unknown partially observable linear dynamical systems which includes linear quadratic Gaussian (LQG) control.

下载PDF全文

下载文献需遵守相关版权规定

论文标题