杂交随机确定的小型近端梯度：几乎最佳的概括性较小

论文标题

杂交随机确定的小型近端梯度：几乎最佳的概括性较小

Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization

论文作者

Zhou, Pan, Yuan, Xiaotong

论文摘要

随机方差降低梯度（SVRG）算法已显示在解决大规模学习问题方面有利。尽管取得了显着的成功，但SVRG型算法的随机梯度复杂性通常与数据大小线性缩放，因此对于庞大的数据而言仍然可能很昂贵。为了解决这一缺陷，我们提出了一种混合随机确定性的微型近端梯度（HSDMPG）算法，以解决强烈的convex问题，该算法享有可证明可以改善数据尺寸独立的复杂性的确保。更准确地说，对于$ n $组件的二次损失$ f（θ）$，我们证明hsdmpg可以达到$ε$ -optimization-error $ \ mathbb {e} [f（θ）-f（θ）-f（θ^*） $ \ MATHCAL {o} \ big（\ frac {κ^{1.5}ε^{0.75} \ log^{1.5}（\ frac {1}ε）+1}+wedge \ wedge \ wedge \ big（ κ\ sqrt {n} \ log^{1.5} \ big（\ frac {1}ε\ big）+n \ log \ big（\ frac {1}ε\ big）\ big）\ big）\ big）$随机梯度评估，其中$κ$是条件编号。对于通用强烈凸出损失函数，我们证明了几乎相同的复杂性，尽管成本略有增加对数因素。对于大规模的学习问题，我们的复杂性界限优于先前最新的SVRG算法，无论是否依赖于数据大小。特别是在$ε= \ Mathcal {o} \ big（1/\ sqrt {n} \ big）$的情况下，该$处于学习模型的固有过剩误差的顺序，因此足以泛化，对于二次和一般损失函数的随机梯度复杂性，二次和一般损失函数的范围是相应的。（n^{0.875} \ log^{1.5}（n））$和$ \ MATHCAL {O}（n^{0.875} \ log^{2.25}（n））$，据我们所知，这是最佳知识，这是第一次在比单一通行证的情况下首次获得最佳通用。广泛的数值结果证明了我们的算法比先前的算法优势。

Stochastic variance-reduced gradient (SVRG) algorithms have been shown to work favorably in solving large-scale learning problems. Despite the remarkable success, the stochastic gradient complexity of SVRG-type algorithms usually scales linearly with data size and thus could still be expensive for huge data. To address this deficiency, we propose a hybrid stochastic-deterministic minibatch proximal gradient (HSDMPG) algorithm for strongly-convex problems that enjoys provably improved data-size-independent complexity guarantees. More precisely, for quadratic loss $F(θ)$ of $n$ components, we prove that HSDMPG can attain an $ε$-optimization-error $\mathbb{E}[F(θ)-F(θ^*)]\leqε$ within $\mathcal{O}\Big(\frac{κ^{1.5}ε^{0.75}\log^{1.5}(\frac{1}ε)+1}ε\wedge\Big(κ\sqrt{n}\log^{1.5}\big(\frac{1}ε\big)+n\log\big(\frac{1}ε\big)\Big)\Big)$ stochastic gradient evaluations, where $κ$ is condition number. For generic strongly convex loss functions, we prove a nearly identical complexity bound though at the cost of slightly increased logarithmic factors. For large-scale learning problems, our complexity bounds are superior to those of the prior state-of-the-art SVRG algorithms with or without dependence on data size. Particularly, in the case of $ε=\mathcal{O}\big(1/\sqrt{n}\big)$ which is at the order of intrinsic excess error bound of a learning model and thus sufficient for generalization, the stochastic gradient complexity bounds of HSDMPG for quadratic and generic loss functions are respectively $\mathcal{O} (n^{0.875}\log^{1.5}(n))$ and $\mathcal{O} (n^{0.875}\log^{2.25}(n))$, which to our best knowledge, for the first time achieve optimal generalization in less than a single pass over data. Extensive numerical results demonstrate the computational advantages of our algorithm over the prior ones.

下载PDF全文

下载文献需遵守相关版权规定

论文标题