带有可区分函数近似值的非政策拟合Q评估：z估计和推理理论

论文标题

带有可区分函数近似值的非政策拟合Q评估：z估计和推理理论

Off-Policy Fitted Q-Evaluation with Differentiable Function Approximators: Z-Estimation and Inference Theory

论文作者

Zhang, Ruiqi, Zhang, Xuezhou, Ni, Chengzhuo, Wang, Mengdi

论文摘要

非政策评估（OPE）是增强学习（RL）的基石之一。具有各种功能近似器，尤其是深神经网络的拟合Q评估（FQE）已获得实际成功。虽然统计分析已证明FQE是最小的，与表格，线性和几个非参数函数家族相比，其实际性能与更通用的函数近似值较少了解。我们专注于具有一般可区分函数近似值的FQE，使我们的理论适用于神经功能近似。我们使用z估计理论解决了这个问题，并确定了以下结果：FQE估计误差渐近地正常，明确的方差是通过函数类别的切线空间在地面真相，奖励结构和由于额外的学习而造成的分布变化的明确方差；有限样本的FQE误差结合以相同的方差项为主导，并且也可以由函数依赖性差异界定，该差异依赖于类别的差异，该差异衡量了policy分布分布偏移如何与函数近似器交织在一起。此外，我们研究了引导FQE估计器的错误分布推断和估计置信区间，并伴随着与我们的上限相匹配的CRAMER-RAO下限。 Z估计分析提供了一个可推广的理论框架，用于研究RL中的违反政策估计，并为FQE提供了具有可不同函数近似值的FQE的尖锐统计理论。

Off-Policy Evaluation (OPE) serves as one of the cornerstones in Reinforcement Learning (RL). Fitted Q Evaluation (FQE) with various function approximators, especially deep neural networks, has gained practical success. While statistical analysis has proved FQE to be minimax-optimal with tabular, linear and several nonparametric function families, its practical performance with more general function approximator is less theoretically understood. We focus on FQE with general differentiable function approximators, making our theory applicable to neural function approximations. We approach this problem using the Z-estimation theory and establish the following results: The FQE estimation error is asymptotically normal with explicit variance determined jointly by the tangent space of the function class at the ground truth, the reward structure, and the distribution shift due to off-policy learning; The finite-sample FQE error bound is dominated by the same variance term, and it can also be bounded by function class-dependent divergence, which measures how the off-policy distribution shift intertwines with the function approximator. In addition, we study bootstrapping FQE estimators for error distribution inference and estimating confidence intervals, accompanied by a Cramer-Rao lower bound that matches our upper bounds. The Z-estimation analysis provides a generalizable theoretical framework for studying off-policy estimation in RL and provides sharp statistical theory for FQE with differentiable function approximators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题