使用嵌入式密度异常值的快速数据驱动群集数量的群集数估计

论文标题

使用嵌入式密度异常值的快速数据驱动群集数量的群集数估计

Fast Data Driven Estimation of Cluster Number in Multiplex Images using Embedded Density Outliers

论文作者

Thomas, Spencer A.

论文摘要

化学成像技术的使用正在成为病理学传统方法的常规伴奏。重要的技术进步开发了这些下一代技术，以提供丰富的空间分辨，多维化学图像。数字病理学的兴起显着增强了这些成像方式与光学显微镜和免疫组织化学的协同作用，从而增强了我们对疾病生物学机制和进展的理解。诸如成像质量细胞术之类的技术提供了与数字病理技术结合使用的特定组件的多维（多重）图像。这些强大的技术产生了大量的高维数据，在数据分析中引起了重大挑战。无监督的方法（例如聚类）是分析这些数据的有吸引力的方法，但是，它们需要选择参数，例如簇数。在这里，我们提出了一种方法，以自动数据驱动的方式估算簇数，使用深稀疏的自动编码器将数据嵌入较低的维空间中。我们计算嵌入式空间中区域的密度，其中大多数是空的，使高密度区域能够被检测为离群值，并提供了簇数量的估计值。该框架提供了一种完全无监督和数据驱动的方法来分析多维数据。在这项工作中，我们使用45个多重成像质量细胞仪数据集演示了我们的方法。此外，我们的模型仅使用其中一个数据集训练，并且将学习的嵌入应用于其余44张图像，从而提供了有效的数据分析过程。最后，我们证明了我们方法的高计算效率，这比通过计算总和平方距离作为群集数的函数估算的速度要快。

The usage of chemical imaging technologies is becoming a routine accompaniment to traditional methods in pathology. Significant technological advances have developed these next generation techniques to provide rich, spatially resolved, multidimensional chemical images. The rise of digital pathology has significantly enhanced the synergy of these imaging modalities with optical microscopy and immunohistochemistry, enhancing our understanding of the biological mechanisms and progression of diseases. Techniques such as imaging mass cytometry provide labelled multidimensional (multiplex) images of specific components used in conjunction with digital pathology techniques. These powerful techniques generate a wealth of high dimensional data that create significant challenges in data analysis. Unsupervised methods such as clustering are an attractive way to analyse these data, however, they require the selection of parameters such as the number of clusters. Here we propose a methodology to estimate the number of clusters in an automatic data-driven manner using a deep sparse autoencoder to embed the data into a lower dimensional space. We compute the density of regions in the embedded space, the majority of which are empty, enabling the high density regions to be detected as outliers and provide an estimate for the number of clusters. This framework provides a fully unsupervised and data-driven method to analyse multidimensional data. In this work we demonstrate our method using 45 multiplex imaging mass cytometry datasets. Moreover, our model is trained using only one of the datasets and the learned embedding is applied to the remaining 44 images providing an efficient process for data analysis. Finally, we demonstrate the high computational efficiency of our method which is two orders of magnitude faster than estimating via computing the sum squared distances as a function of cluster number.

下载PDF全文

下载文献需遵守相关版权规定

论文标题