论文标题
稀疏受限的最佳运输
Sparsity-Constrained Optimal Transport
论文作者
论文摘要
现在,正则最佳运输(OT)越来越多地用作神经网络中的损失或匹配层。可以使用Sinkhorn算法来计算熵调控的OT,但会导致完全致密的运输计划,这意味着所有来源(分数)与所有目标匹配。为了解决这个问题,改为研究了二次正则化。这种正规化可保留稀疏性,并导致不受约束和光滑的(半)双重目标,可以通过现成的梯度方法来解决。不幸的是,二次正则化不能直接控制运输计划的基数(非齐率)。我们在本文中提出了一种针对运输计划中具有明确基础性限制的OT的新方法。我们的工作是由专家稀疏混合的应用程序的动机,在该混合物中,可以将OT与专家模型(例如神经网络)匹配输入令牌(例如图像补丁)。基数限制确保最多$ K $令牌与专家相匹配,这对于计算绩效的原因至关重要。尽管基数限制不存在,但我们表明,相应的(半)双重问题是可处理的,并且可以使用一阶梯度方法来解决。可以将我们的方法视为未注册的OT(在极限情况下恢复$ k = 1 $)和四次注册的OT之间的中间立场(当$ k $足够大时恢复)。随着$ K $的增加,目标的平稳性会增加,从而在融合速度和最佳计划的稀疏之间取消了权衡。
Regularized optimal transport (OT) is now increasingly used as a loss or as a matching layer in neural networks. Entropy-regularized OT can be computed using the Sinkhorn algorithm but it leads to fully-dense transportation plans, meaning that all sources are (fractionally) matched with all targets. To address this issue, several works have investigated quadratic regularization instead. This regularization preserves sparsity and leads to unconstrained and smooth (semi) dual objectives, that can be solved with off-the-shelf gradient methods. Unfortunately, quadratic regularization does not give direct control over the cardinality (number of nonzeros) of the transportation plan. We propose in this paper a new approach for OT with explicit cardinality constraints on the transportation plan. Our work is motivated by an application to sparse mixture of experts, where OT can be used to match input tokens such as image patches with expert models such as neural networks. Cardinality constraints ensure that at most $k$ tokens are matched with an expert, which is crucial for computational performance reasons. Despite the nonconvexity of cardinality constraints, we show that the corresponding (semi) dual problems are tractable and can be solved with first-order gradient methods. Our method can be thought as a middle ground between unregularized OT (recovered in the limit case $k=1$) and quadratically-regularized OT (recovered when $k$ is large enough). The smoothness of the objectives increases as $k$ increases, giving rise to a trade-off between convergence speed and sparsity of the optimal plan.