在高维潜在空间中的可靠量度

论文标题

在高维潜在空间中的可靠量度

Reliable Measures of Spread in High Dimensional Latent Spaces

论文作者

Marbut, Anna C., McKinney-Bock, Katy, Wheeler, Travis J.

论文摘要

了解自然语言处理模型的潜在空间的几何特性可以操纵这些属性，从而改善下游任务的性能。这样的属性之一就是在模型的潜在空间中传播的数据量，或者是使用可用的潜在空间的完全完全。在这项工作中，我们定义数据扩展并证明了数据传播，平均余弦相似性和分区功能最小/最大比率I（v）的常用度量，不提供可靠的指标来比较模型跨模型的潜在空间的使用。我们提出并检查了八种数据扩散的替代措施，除了将当前指标应用于七个合成数据分布时，所有这些措施都比这些当前指标有所改善。在我们提出的措施中，我们建议一种基于主要成分的措施和一种基于熵的措施，可提供可靠的，相对的扩散度量，可用于比较不同大小和维度的模型。

Understanding geometric properties of natural language processing models' latent spaces allows the manipulation of these properties for improved performance on downstream tasks. One such property is the amount of data spread in a model's latent space, or how fully the available latent space is being used. In this work, we define data spread and demonstrate that the commonly used measures of data spread, Average Cosine Similarity and a partition function min/max ratio I(V), do not provide reliable metrics to compare the use of latent space across models. We propose and examine eight alternative measures of data spread, all but one of which improve over these current metrics when applied to seven synthetic data distributions. Of our proposed measures, we recommend one principal component-based measure and one entropy-based measure that provide reliable, relative measures of spread and can be used to compare models of different sizes and dimensionalities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题