论文标题
在潜在的重尾梯度下更好的可伸缩性
Better scalability under potentially heavy-tailed gradients
论文作者
论文摘要
我们研究了可伸缩的梯度下降(RGD)技术的可扩展替代方法,该技术可以在梯度进行重尾时使用,尽管这对学习者来说是未知的。核心技术很简单:我们选择的候选人与大多数廉价的随机子过程在单个通过的数据上运行的大多数廉价随机子过程都不会相差太远,而不是试图在每个步骤上稳健地汇总梯度,这是昂贵的,并且导致了次优小的依赖性。除了正式的保证外,我们还提供了在高斯和重尾数据下对实验条件扰动的鲁棒性的经验分析。结果是一个易于实施的过程,微不足道的并行化,这可以使RGD方法的形式优势保持较大的范围,但可以更好地缩放到大型学习问题上。
We study a scalable alternative to robust gradient descent (RGD) techniques that can be used when the gradients can be heavy-tailed, though this will be unknown to the learner. The core technique is simple: instead of trying to robustly aggregate gradients at each step, which is costly and leads to sub-optimal dimension dependence in risk bounds, we choose a candidate which does not diverge too far from the majority of cheap stochastic sub-processes run for a single pass over partitioned data. In addition to formal guarantees, we also provide empirical analysis of robustness to perturbations to experimental conditions, under both sub-Gaussian and heavy-tailed data. The result is a procedure that is simple to implement, trivial to parallelize, which keeps the formal strength of RGD methods but scales much better to large learning problems.