论文标题
快速和三艰难:通过三重态方法加速弱监督
Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods
论文作者
论文摘要
弱监督是一种不依赖地面真相注释而不依赖地面真相注释的流行方法。取而代之的是,它通过估计多个嘈杂标签来源(例如,启发式方法,人群工人)的准确性来生成概率培训标签。现有方法使用潜在变量估计来建模嘈杂的来源,但是这些方法在计算上可能很昂贵,在数据中缩放了超级线性。在这项工作中,我们表明,对于高度适用于弱监督的一类潜在变量模型,我们可以找到建模参数的封闭式解决方案,从而避免了对迭代解决方案(如随机梯度下降(SGD))的需求。我们使用这种见解来构建FlyingSquid,这是一个弱监督框架,其规定的数量级比以前的弱监督方法快,并且需要更少的假设。特别是,我们在概括误差上证明了界限,而无需假设潜在变量模型可以精确参数化基础数据分布。从经验上讲,我们在基准弱监督数据集上验证了FlyingSquid,并发现与以前的方法相比,它具有相同或更高的质量,而无需调整SGD程序,平均恢复模型参数的速度170倍,并启用了新的视频分析和在线学习应用程序。
Weak supervision is a popular method for building machine learning models without relying on ground truth annotations. Instead, it generates probabilistic training labels by estimating the accuracies of multiple noisy labeling sources (e.g., heuristics, crowd workers). Existing approaches use latent variable estimation to model the noisy sources, but these methods can be computationally expensive, scaling superlinearly in the data. In this work, we show that, for a class of latent variable models highly applicable to weak supervision, we can find a closed-form solution to model parameters, obviating the need for iterative solutions like stochastic gradient descent (SGD). We use this insight to build FlyingSquid, a weak supervision framework that runs orders of magnitude faster than previous weak supervision approaches and requires fewer assumptions. In particular, we prove bounds on generalization error without assuming that the latent variable model can exactly parameterize the underlying data distribution. Empirically, we validate FlyingSquid on benchmark weak supervision datasets and find that it achieves the same or higher quality compared to previous approaches without the need to tune an SGD procedure, recovers model parameters 170 times faster on average, and enables new video analysis and online learning applications.