论文标题
潜在标签转移下的无监督学习
Unsupervised Learning under Latent Label Shift
论文作者
论文摘要
哪种结构可以使学习者从未标记的数据中发现类?传统方法依赖于数据的功能空间相似性和英雄假设。在本文中,我们在潜在标签换档(LLS)下介绍了无监督的学习,在那里我们可以从多个域中访问未标记的数据,以便标签边际$ p_d(y)$可以在域上变化,但是类别有条件的$ p(\ m aththbf {x}} | y)$没有。这项工作实例化了识别类别的新原则:将分组分组的要素。对于有限的输入空间,我们在LLS和主题建模之间建立了同构:输入对应于单词,域,文档和标签与主题。解决连续数据时,我们证明,当每个标签的支持包含一个可分离区域(类似于锚词)时,Oracle访问$ P(D | \ Mathbf {X})$足以识别$ p_d(y)$和$ p_d(y | \ Mathbf {x}})$。因此,我们引入了一种实用算法,该算法利用域 - 歧义模型如下:(i)通过域歧视器$ p(d | \ mathbf {x})推动示例; (ii)通过在$ p(d | \ mathbf {x})$ space中以群集示例来离散数据; (iii)对离散数据执行非负矩阵分解; (iv)将回收的$ P(y | d)$与鉴别器输出$ p(d | \ mathbf {x})$结合在一起计算$ p_d(y | x)\; \ forall d $。通过半合成实验,我们表明我们的算法可以利用域信息来改善竞争性无监督分类方法。当功能空间相似性并不表示真实分组时,我们揭示了标准无监督分类方法的故障模式,并从经验上表明我们的方法可以更好地处理这种情况。我们的结果建立了分销转移与主题建模之间的密切联系,为将来的工作开辟了有希望的界限。
What sorts of structure might enable a learner to discover classes from unlabeled data? Traditional approaches rely on feature-space similarity and heroic assumptions on the data. In this paper, we introduce unsupervised learning under Latent Label Shift (LLS), where we have access to unlabeled data from multiple domains such that the label marginals $p_d(y)$ can shift across domains but the class conditionals $p(\mathbf{x}|y)$ do not. This work instantiates a new principle for identifying classes: elements that shift together group together. For finite input spaces, we establish an isomorphism between LLS and topic modeling: inputs correspond to words, domains to documents, and labels to topics. Addressing continuous data, we prove that when each label's support contains a separable region, analogous to an anchor word, oracle access to $p(d|\mathbf{x})$ suffices to identify $p_d(y)$ and $p_d(y|\mathbf{x})$ up to permutation. Thus motivated, we introduce a practical algorithm that leverages domain-discriminative models as follows: (i) push examples through domain discriminator $p(d|\mathbf{x})$; (ii) discretize the data by clustering examples in $p(d|\mathbf{x})$ space; (iii) perform non-negative matrix factorization on the discrete data; (iv) combine the recovered $p(y|d)$ with the discriminator outputs $p(d|\mathbf{x})$ to compute $p_d(y|x) \; \forall d$. With semi-synthetic experiments, we show that our algorithm can leverage domain information to improve upon competitive unsupervised classification methods. We reveal a failure mode of standard unsupervised classification methods when feature-space similarity does not indicate true groupings, and show empirically that our method better handles this case. Our results establish a deep connection between distribution shift and topic modeling, opening promising lines for future work.