论文标题
张量火车算术的平行算法
Parallel Algorithms for Tensor Train Arithmetic
论文作者
论文摘要
我们提出有效且可扩展的并行算法,用于在张量列(TT)格式中代表的低率张量进行数学操作。我们考虑添加,元素乘法,计算规范和内部产品,正交化和舍入的算法(等级截断)。这些是用于利用TT结构的迭代Krylov求解器等应用的内核操作。并行算法设计用于分布式内存计算,我们使用数据分布和策略,该数据分布和策略并行将TT格式中各个内核的计算并行。我们分析了所提出的算法的计算和通信成本以显示其可扩展性,并提出了数值实验,这些实验证明了它们在共享内存和分布式内存平行系统上均具有效率。例如,我们观察到比现有的MATLAB TT-TOLBOX更好的单核性能在四舍五入2GB TT张量时,我们的实现使用单个节点的所有40个内核实现了$ 34 \ times $速度。对于所有数学操作,我们还显示了较大的TT张量最高可达10,000多个核心的线性平行缩放。
We present efficient and scalable parallel algorithms for performing mathematical operations for low-rank tensors represented in the tensor train (TT) format. We consider algorithms for addition, elementwise multiplication, computing norms and inner products, orthogonalization, and rounding (rank truncation). These are the kernel operations for applications such as iterative Krylov solvers that exploit the TT structure. The parallel algorithms are designed for distributed-memory computation, and we use a data distribution and strategy that parallelizes computations for individual cores within the TT format. We analyze the computation and communication costs of the proposed algorithms to show their scalability, and we present numerical experiments that demonstrate their efficiency on both shared-memory and distributed-memory parallel systems. For example, we observe better single-core performance than the existing MATLAB TT-Toolbox in rounding a 2GB TT tensor, and our implementation achieves a $34\times$ speedup using all 40 cores of a single node. We also show nearly linear parallel scaling on larger TT tensors up to over 10,000 cores for all mathematical operations.