论文标题
温度检查:具有软磁性 - 内部损失的训练模型的理论和实践
Temperature check: theory and practice for training models with softmax-cross-entropy losses
论文作者
论文摘要
SoftMax函数与跨凝性损失相结合是一种建模概率分布的原则方法,在深度学习中已变得无处不在。 SoftMax函数由一个孤独的超参数(通常设置为一个)或被视为训练后调节模型置信度的一种方式来定义。但是,对于温度如何影响训练动态或泛化性能,知之甚少。在这项工作中,我们开发了一种训练有软性跨性渗透损失的模型的早期学习理论,并表明学习动力学至关重要地取决于逆温度$β$以及初始化时的logits幅度,$ ||β{\ bf z} || _ {2} $。我们通过对在CIFAR10,ImageNet和IMDB情绪分析进行的各种模型体系结构进行大规模实证研究进行了遵循这些分析结果。我们发现,概括性能在很大程度上取决于温度,但仅弱于初始逻辑幅度。我们提供的证据表明,概括对$β$的依赖性不是由于模型置信度的变化,而是一种动态现象。因此,添加$β$作为可调的超参数是最大化模型性能的关键。尽管我们发现最佳$β$对架构敏感,但我们的结果表明,将$β$在$ 10^{ - 2} $上调整为$ 10^{ - 2} $至$ 10^1 $,可以改善所研究的所有架构的性能。我们发现,较小的$β$可能会以学习稳定性为代价提高峰值性能。
The softmax function combined with a cross-entropy loss is a principled approach to modeling probability distributions that has become ubiquitous in deep learning. The softmax function is defined by a lone hyperparameter, the temperature, that is commonly set to one or regarded as a way to tune model confidence after training; however, less is known about how the temperature impacts training dynamics or generalization performance. In this work we develop a theory of early learning for models trained with softmax-cross-entropy loss and show that the learning dynamics depend crucially on the inverse-temperature $β$ as well as the magnitude of the logits at initialization, $||β{\bf z}||_{2}$. We follow up these analytic results with a large-scale empirical study of a variety of model architectures trained on CIFAR10, ImageNet, and IMDB sentiment analysis. We find that generalization performance depends strongly on the temperature, but only weakly on the initial logit magnitude. We provide evidence that the dependence of generalization on $β$ is not due to changes in model confidence, but is a dynamical phenomenon. It follows that the addition of $β$ as a tunable hyperparameter is key to maximizing model performance. Although we find the optimal $β$ to be sensitive to the architecture, our results suggest that tuning $β$ over the range $10^{-2}$ to $10^1$ improves performance over all architectures studied. We find that smaller $β$ may lead to better peak performance at the cost of learning stability.