预测没有培训的训练时间

论文标题

预测没有培训的训练时间

Predicting Training Time Without Training

论文作者

Zancato, Luca, Achille, Alessandro, Ravichandran, Avinash, Bhotika, Rahul, Soatto, Stefano

论文摘要

我们解决了预测预训练的深网络需要收敛到损失函数的给定值的优化步骤数量的问题。为此，我们利用了一个事实，即在微调过程中深层网络的训练动力学与线性化模型的训练动力学相吻合。这使我们能够通过在功能空间中求解低维随机微分方程（SDE）来近似训练期间的任何一点训练损失和准确性。使用此结果，我们能够预测随机梯度下降（SGD）将模型微调到给定损失而无需进行任何训练所需的时间。在我们的实验中，我们能够预测各种数据集和超参数的误差差距为20％的误差率，与实际培训相比，成本降低了30至45倍。我们还讨论了如何进一步降低方法的计算和记忆成本，特别是我们表明，通过利用梯度矩阵的频谱特性，可以预测大型数据集上的训练时间，同时仅处理样品的子集。

We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function. To do so, we leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model. This allows us to approximate the training loss and accuracy at any point during training by solving a low-dimensional Stochastic Differential Equation (SDE) in function space. Using this result, we are able to predict the time it takes for Stochastic Gradient Descent (SGD) to fine-tune a model to a given loss without having to perform any training. In our experiments, we are able to predict training time of a ResNet within a 20% error margin on a variety of datasets and hyper-parameters, at a 30 to 45-fold reduction in cost compared to actual training. We also discuss how to further reduce the computational and memory cost of our method, and in particular we show that by exploiting the spectral properties of the gradients' matrix it is possible predict training time on a large dataset while processing only a subset of the samples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题