论文标题

关于对角线空间模型的参数化和初始化

On the Parameterization and Initialization of Diagonal State Space Models

论文作者

Gu, Albert, Gupta, Ankit, Goel, Karan, Ré, Christopher

论文摘要

状态空间模型(SSM)最近已被证明是深度学习层非常有效的,它是RNN,CNN或变形金刚等序列模型的有前途的替代方案。第一个显示这种潜力的版本是S4模型,它通过使用称为河马矩阵的规定状态矩阵对涉及长期依赖的任务特别有效。尽管这具有可解释的数学机制来建模长期依赖性,但它引入了一种自定义表示和算法,可能很难实现。另一方面,最新的S4变体称为DSS,表明将状态矩阵完全对角线限制在使用基于近似S4矩阵的特定初始化时,仍然可以保留原始模型的性能。这项工作旨在系统地了解如何参数化和初始化此类对角线状态空间模型。虽然从经典的结果来看,几乎所有SSM都具有等效的对角线形式,但我们表明初始化对于性能至关重要。我们通过证明S4矩阵的对角线限制出人意料地恢复了无限状态维度极限的相同内核来解释为什么DSS在数学上起作用。我们还系统地描述了参数化和计算对角线SSM的各种设计选择,并进行了受控的经验研究,以消除这些选择的效果。我们的最终模型S4D是S4的简单对角线版本,其内核计算仅需要2行代码,并且几乎在几乎所有设置中都与S4相当地执行,并在图像,音频和医疗时间序列域的最新结果中,在长距离范围Benchmark上平均为85 \%。

State space models (SSM) have recently been shown to be very effective as a deep learning layer as a promising alternative to sequence models such as RNNs, CNNs, or Transformers. The first version to show this potential was the S4 model, which is particularly effective on tasks involving long-range dependencies by using a prescribed state matrix called the HiPPO matrix. While this has an interpretable mathematical mechanism for modeling long dependencies, it introduces a custom representation and algorithm that can be difficult to implement. On the other hand, a recent variant of S4 called DSS showed that restricting the state matrix to be fully diagonal can still preserve the performance of the original model when using a specific initialization based on approximating S4's matrix. This work seeks to systematically understand how to parameterize and initialize such diagonal state space models. While it follows from classical results that almost all SSMs have an equivalent diagonal form, we show that the initialization is critical for performance. We explain why DSS works mathematically, by showing that the diagonal restriction of S4's matrix surprisingly recovers the same kernel in the limit of infinite state dimension. We also systematically describe various design choices in parameterizing and computing diagonal SSMs, and perform a controlled empirical study ablating the effects of these choices. Our final model S4D is a simple diagonal version of S4 whose kernel computation requires just 2 lines of code and performs comparably to S4 in almost all settings, with state-of-the-art results for image, audio, and medical time-series domains, and averaging 85\% on the Long Range Arena benchmark.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源