论文标题
S4:高表象,高性能AI加速器
S4: a High-sparsity, High-performance AI Accelerator
论文作者
论文摘要
利用稀疏性神经网络的稀疏性已成为减少记忆足迹,I/O成本和计算工作量的最潜在方法之一。而且,由于已经考虑了较大的模型尺寸以及预训练巨型模型的趋势,因此可以利用的稀疏度已变得更高。另一方面,与已广泛支持的选项相比,大多数计算平台中不支持通过高度稀疏性加速。在这项工作中,我们介绍了第一个商业硬件平台,支持高度稀疏性加速度高达32次-S4。结合最先进的稀疏修剪技术,我们在主流推理平台(如NVIDIA T4)上展示了S4上的几次实用推理。我们还表明,在实践中,较大尺寸的稀疏模型比较小尺寸的密集模型可以实现更高的精度和更高的吞吐量。
Exploiting sparsity underlying neural networks has become one of the most potential methodologies to reduce the memory footprint, I/O cost, and computation workloads during inference. And the degree of sparsity one can exploit has become higher as larger model sizes have been considered along with the trend of pre-training giant models. On the other hand, compared with quantization that has been a widely supported option, acceleration through high-degree sparsity is not supported in most computing platforms. In this work, we introduce the first commercial hardware platform supporting high-degree sparsity acceleration up to 32 times -- S4. Combined with state-of-the-art sparse pruning techniques, we demonstrate several-times practical inference speedup on S4 over mainstream inference platforms such as Nvidia T4. We also show that in practice a sparse model of larger size can achieve both higher accuracy and higher throughput on S4 than a dense model of smaller size.