可重新配置的多手和加法器的设计，用于大规模平行处理

论文标题

可重新配置的多手和加法器的设计，用于大规模平行处理

Design of Reconfigurable Multi-Operand Adder for Massively Parallel Processing

论文作者

Mayannavar, Shilpa, Wali, Uday

论文摘要

本文介绍了可在深度学习系统中使用的可重构组合多手术和加法器的系统研究和实施。随着操作数的数量，携带变化的大小，因此需要可靠的算法来估算最佳的可重构多手术和加法器的确切数量。与使用两个操作数加法器的顺序实现相比，组合多手术和加法器的组合可以更快。此类加法器的用例发生在现代处理器中，用于深层神经网络。此类处理器需要芯片上大量并行计算资源。本文提出了一种估计上限尺寸的方法。计算多手术和加法操作所需的确切携带位数的方法。提出了一个快速的组合平行4-OPERAND ADDER模块。还描述了一种重新配置这些加法模块以实现较大加法器的算法。此外，本文提出了两个紧凑但较慢的迭代结构，可以实现多手术和添加，一次迭代一次列表，直到涵盖整个单词。这种连续/迭代操作很慢，但是占据了很小的空间，而平行操作很快，但在芯片上使用大型硅区域。有趣的是，两个架构的面积比率可以倾斜，而不是较慢，较小和大的单元，而不是较少数量的快速和大型计算单元。本文中呈现的引理可用于确定这种倾斜时的状况。可能，这可以节省硅空间并增加芯片的高性能计算。已经介绍了16个操作数加法器的仿真结果，并使用一组4手术和加法器用于神经网络。仿真结果表明，随着操作或操作数的增加，性能增益会提高。

The paper presents a systematic study and implementation of a reconfigurable combinatorial multi-operand adder for use in Deep Learning systems. The size of carry changes with the number of operands and hence a reliable algorithm to estimate exact number of carry bits is needed for optimal implementation of a reconfigurable multi-operand adder. A combinatorial multi-operand adder can be faster compared to a sequential implementation using a two operand adder. Use cases for such adders occur in modern processors for deep neural networks. Such processors require massively parallel computing resources on chip. This paper presents a method to estimate the upper bound on the size of carry. A method to compute the exact number of carry bits required for a multi-operand addition operation. A fast combinatorial parallel 4-operand adder module is presented. An algorithm to reconfigure these adder modules to implement larger adders is also described. Further, the paper presents two compact but slower iterative structures that implement multi-operand addition, iterating with one column at a time till the entire word is covered. Such serial/iterative operations are slow but occupy small space while parallel operations are fast but use large silicon area on chip. Interestingly, the area-to-throughput ratio of two architectures can tilt in favor of slower, smaller and large number units instead of the fewer numbers of fast and large compute units. A lemma presented in the paper may be used to identify the condition when such tilt occurs. Potentially, this can save silicon space and increase the throughput of chips for high performance computing. Simulation results of a 16 operand adder and using an set of 4-operand adders for use in neural networks have been presented. Simulation results show that performance gain improves as the number of operations or operands increases.

下载PDF全文

下载文献需遵守相关版权规定

论文标题