论文标题
螺纹平行的雅各比单数值分解法的矢量化方法
Vectorization of a thread-parallel Jacobi singular value decomposition method
论文作者
论文摘要
(一批)居民二级矩阵的特征值分解(EVD)在许多数值算法中具有作用,其中单方面的jacobi方法用于单数值分解(SVD)是主要示例。在本文中,批处理的EVD通过矢量友好的数据布局和Intel CPU的AVX-512 SIMD指令,以及由Lapack sequential Xgesvj例程的启发。这些矢量化的构件应适用于支持类似向量操作的其他平台。对于批处理的EVD,顺序或螺纹,保证了无条件的数值可重复性,并且对于列的列转换,就像缩放的点产量一样,目前是顺序的,但如果需要嵌套并行性,则可以螺纹。提出的EVD或整个SVD可能不会发生结果的溢出。所提出的EVD的测得的精度通常超过Lapack的XLAEV2例程。批处理EVD优于XLAEV2调用的匹配序列,但并行SVD的加速度适中,但可以改进,并且已经有足够的线程有益。无论其数量多少,提出的SVD方法都会给出相同的结果,但准确性比XGESVJ低一些。
The eigenvalue decomposition (EVD) of (a batch of) Hermitian matrices of order two has a role in many numerical algorithms, of which the one-sided Jacobi method for the singular value decomposition (SVD) is the prime example. In this paper the batched EVD is vectorized, with a vector-friendly data layout and the AVX-512 SIMD instructions of Intel CPUs, alongside other key components of a real and a complex OpenMP-parallel Jacobi-type SVD method, inspired by the sequential xGESVJ routines from LAPACK. These vectorized building blocks should be portable to other platforms that support similar vector operations. Unconditional numerical reproducibility is guaranteed for the batched EVD, sequential or threaded, and for the column transformations, that are, like the scaled dot-products, presently sequential but can be threaded if nested parallelism is desired. No avoidable overflow of the results can occur with the proposed EVD or the whole SVD. The measured accuracy of the proposed EVD often surpasses that of the xLAEV2 routines from LAPACK. While the batched EVD outperforms the matching sequence of xLAEV2 calls, speedup of the parallel SVD is modest but can be improved and is already beneficial with enough threads. Regardless of their number, the proposed SVD method gives identical results, but of somewhat lower accuracy than xGESVJ.