论文标题

线性结构方程模型的因果匪

Causal Bandits for Linear Structural Equation Models

论文作者

Varici, Burak, Shanmugam, Karthikeyan, Sattigeri, Prasanna, Tajer, Ali

论文摘要

本文研究了在因果图形模型中设计最佳干预措施序列的问题,以最大程度地减少事后最佳干预措施的累积后悔。自然,这是一个因果匪徒问题。重点是线性结构方程模型(SEM)和软干预措施的因果匪徒。假定该图的结构是已知的,并且具有$ n $节点。每个节点都假定使用两种线性机制,一种软干预和一种观察性,产生了$ 2^n $可能的干预措施。大多数现有的因果匪徒算法都认为,至少完全指定了奖励节点父母的介入分布。但是,有$ 2^n $这样的分布(一种与每个干预措施相对应),即使在中等尺寸的图中也变得越来越高。本文依靠知道这些分布或其边际的假设。提出了两种算法(基于UCB)和贝叶斯(基于汤普森采样)的设置。这些算法的关键思想是避免直接估计$ 2^n $奖励分布,而是估算完全指定SEMS($ n $的线性)并使用它们来计算奖励的参数。在这两个算法中,在噪声和参数空间的有限假设下,累积遗憾的是$ \ tilde {\ cal o}(d^{l+\ frac {1} {1} {2}} {2}} \ sqrt {nt})$,其中$ d $是$ d $是图形的最大程度和$ l $ less longes and its longes thang is longest ints longest and longes。此外,提出了$ω(d^{\ frac {l} {2} -2} \ sqrt {t})$的最小值,这表明可实现的和下限符合其比例的缩放行为,相对于地平线$ t $ t $和图形参数$ d $ d $ d $和$ l $。

This paper studies the problem of designing an optimal sequence of interventions in a causal graphical model to minimize cumulative regret with respect to the best intervention in hindsight. This is, naturally, posed as a causal bandit problem. The focus is on causal bandits for linear structural equation models (SEMs) and soft interventions. It is assumed that the graph's structure is known and has $N$ nodes. Two linear mechanisms, one soft intervention and one observational, are assumed for each node, giving rise to $2^N$ possible interventions. Majority of the existing causal bandit algorithms assume that at least the interventional distributions of the reward node's parents are fully specified. However, there are $2^N$ such distributions (one corresponding to each intervention), acquiring which becomes prohibitive even in moderate-sized graphs. This paper dispenses with the assumption of knowing these distributions or their marginals. Two algorithms are proposed for the frequentist (UCB-based) and Bayesian (Thompson Sampling-based) settings. The key idea of these algorithms is to avoid directly estimating the $2^N$ reward distributions and instead estimate the parameters that fully specify the SEMs (linear in $N$) and use them to compute the rewards. In both algorithms, under boundedness assumptions on noise and the parameter space, the cumulative regrets scale as $\tilde{\cal O} (d^{L+\frac{1}{2}} \sqrt{NT})$, where $d$ is the graph's maximum degree, and $L$ is the length of its longest causal path. Additionally, a minimax lower of $Ω(d^{\frac{L}{2}-2}\sqrt{T})$ is presented, which suggests that the achievable and lower bounds conform in their scaling behavior with respect to the horizon $T$ and graph parameters $d$ and $L$.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源