论文标题

近乎差异的私人强化学习

Near-Optimal Differentially Private Reinforcement Learning

论文作者

Qiao, Dan, Wang, Yu-Xiang

论文摘要

由个性化医疗保健和涉及敏感数据的其他应用程序的动机,我们研究了具有不同隐私(DP)约束的强化学习中的在线探索。关于此问题的现有工作确定,在联合差异隐私(JDP)和当地差异隐私(LDP)下进行无需重新学习,但没有提供最佳遗憾的算法。我们通过设计$ε$ -JDP算法,遗憾地弥补了$ \ widetilde {o}(\ sqrt {sah^2t}+s^2Ah^3/ε)$的遗憾,以弥补JDP案例的差距。 h^2/\ sqrt {t} $。在上面,$ s $,$ a $表示州和行动的数量,$ h $表示计划范围,而$ t $是步骤数。据我们所知,这是第一种私人RL算法,该算法是$ t \ rightarrow \ infty \ infty $ nymptottotion in niftototity \ emph {privacy {privacy {我们的技术 - 可能具有独立的利益 - 包括私下释放伯恩斯坦型勘探奖金以及改进的访问统计方法的改进方法。相同的技术也意味着对自然党案例的遗憾有所改善。

Motivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that no-regret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an $ε$-JDP algorithm with a regret of $\widetilde{O}(\sqrt{SAH^2T}+S^2AH^3/ε)$ which matches the information-theoretic lower bound of non-private learning for all choices of $ε> S^{1.5}A^{0.5} H^2/\sqrt{T}$. In the above, $S$, $A$ denote the number of states and actions, $H$ denotes the planning horizon, and $T$ is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves \emph{privacy for free} asymptotically as $T\rightarrow \infty$. Our techniques -- which could be of independent interest -- include privately releasing Bernstein-type exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源