论文标题
您的扩散模型秘密地知道数据歧管的维度
Your diffusion model secretly knows the dimension of the data manifold
论文作者
论文摘要
在这项工作中,我们提出了一个新的框架,用于使用训练有素的扩散模型来估计数据歧管的维度。扩散模型近似分数函数,即目标分布的噪声折扣版本的对数密度的梯度,以换成不同的损坏水平。我们证明,如果数据集中在嵌入高维环境空间中的歧管周围,那么随着损坏的水平降低,得分函数将指向歧管,因为这个方向成为最大可能性增加的方向。因此,对于少量损坏,扩散模型使我们可以访问数据歧管正常捆绑包的近似值。这使我们能够估计切线空间的维度,因此,数据歧管的固有维度。据我们所知,我们的方法是基于扩散模型的数据歧管维度的第一个估计器,并且在欧几里得和图像数据上的受控实验中,它的表现优于建立的统计估计器。
In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A diffusion model approximates the score function i.e. the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. We prove that, if the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximal likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. To the best of our knowledge, our method is the first estimator of the data manifold dimension based on diffusion models and it outperforms well established statistical estimators in controlled experiments on both Euclidean and image data.