论文标题
Essop:高效且可扩展的随机外产品体系结构,用于深度学习
ESSOP: Efficient and Scalable Stochastic Outer Product Architecture for Deep Learning
论文作者
论文摘要
深度神经网络(DNN)在各种认知任务中超过了人类水平的准确性,但在DNN培训中占了大量记忆/时间要求。这限制了他们在能源和内存有限的应用程序中的部署,需要实时学习。矩阵矢量乘法(MVM)和矢量矢量外产品(VVOP)是与DNN培训相关的两个最昂贵的操作。已经证明了提高硬件中MVM计算效率的策略,对训练准确性的影响很小。但是,即使采用上述策略,VVOP计算仍然相对较少探索的瓶颈。已经提出了随机计算(SC)来提高VVOP计算的效率,但在具有有界激活函数的相对较浅的网络上,激活梯度的浮点(FP)缩放。在本文中,我们提出了Essop,这是一种基于SC范式的高效且可扩展的随机外产品结构。我们介绍了有效的技术,以概括SC,以在具有无界激活功能(例如Relu)的DNN中进行重量更新计算,这是许多最先进的网络所需的。我们的体系结构通过重新使用随机数并通过位偏移缩放来替换某些FP乘法操作来降低计算成本。我们表明,可以在CIFAR-10数据集上使用ESSOP训练具有33个卷积层的Resnet-32网络,并且可以通过CIFAR-10数据集进行训练,以实现基线可比精度。 ESSOP在14NM技术节点上的硬件设计表明,与高度管道的FP16乘数设计相比,ESSOP的能源和面积效率分别为82.2%和93.7%,用于外部产品计算。
Deep neural networks (DNNs) have surpassed human-level accuracy in a variety of cognitive tasks but at the cost of significant memory/time requirements in DNN training. This limits their deployment in energy and memory limited applications that require real-time learning. Matrix-vector multiplications (MVM) and vector-vector outer product (VVOP) are the two most expensive operations associated with the training of DNNs. Strategies to improve the efficiency of MVM computation in hardware have been demonstrated with minimal impact on training accuracy. However, the VVOP computation remains a relatively less explored bottleneck even with the aforementioned strategies. Stochastic computing (SC) has been proposed to improve the efficiency of VVOP computation but on relatively shallow networks with bounded activation functions and floating-point (FP) scaling of activation gradients. In this paper, we propose ESSOP, an efficient and scalable stochastic outer product architecture based on the SC paradigm. We introduce efficient techniques to generalize SC for weight update computation in DNNs with the unbounded activation functions (e.g., ReLU), required by many state-of-the-art networks. Our architecture reduces the computational cost by re-using random numbers and replacing certain FP multiplication operations by bit shift scaling. We show that the ResNet-32 network with 33 convolution layers and a fully-connected layer can be trained with ESSOP on the CIFAR-10 dataset to achieve baseline comparable accuracy. Hardware design of ESSOP at 14nm technology node shows that, compared to a highly pipelined FP16 multiplier design, ESSOP is 82.2% and 93.7% better in energy and area efficiency respectively for outer product computation.