多个假设测试以估计稀疏随机块模型中的社区数量

论文标题

多个假设测试以估计稀疏随机块模型中的社区数量

Multiple Hypothesis Testing To Estimate The Number of Communities in Sparse Stochastic Block Models

论文作者

Jha, Chetkar, Li, Mingyao, Barnett, Ian

论文摘要

基于网络的聚类方法通常要求指定\ emph {a先验}的社区数量。此外，大多数现有的估计社区数量的现有方法假设社区数量是固定的，而不是随着网络尺寸$ n $而扩展的。假设社区数量随着网络尺寸$ n $增加的几种方法仅在网络的平均度$ d $增长至少在$ O（n）$（即密集的情况下）或位于狭窄范围内的速度时才有效。这在聚集大规模网络数据方面提出了一个挑战，尤其是当网络的平均度$ d $增长速度慢于$ O（n）$（即稀疏案例）时。为了解决这个问题，我们提出了一个新的顺序程序，利用多个假设检验和ErdösRényi图的光谱特性来估计稀疏随机块模型（SBMS）中的社区数量。我们证明了对于稀疏参数的稀疏SBM方法的一致性。结果，我们发现我们的方法可以估计$ k^{（n）} _ {\ star} $，$ k^{（n）} _ {\ star} $以$ o（n^{（1-3γ）/（4-3γ）/（4-3γ）} $，$ o（n^{（4-3γ）} $，$ d = o（n）$ d = o（n^= o）的速率增加的速率增加。此外，我们证明我们的方法可以作为估计二进制树随机块模型中社区数量的停止规则。我们在六个参考单细胞RNA测序数据集上对其他竞争方法进行基准测试方法。最后，我们通过数值模拟来证明我们的方法的有用性，并通过将其用于聚类实际单细胞RNA序列数据集。

Network-based clustering methods frequently require the number of communities to be specified \emph{a priori}. Moreover, most of the existing methods for estimating the number of communities assume the number of communities to be fixed and not scale with the network size $n$. The few methods that assume the number of communities to increase with the network size $n$ are only valid when the average degree $d$ of a network grows at least as fast as $O(n)$ (i.e., the dense case) or lies within a narrow range. This presents a challenge in clustering large-scale network data, particularly when the average degree $d$ of a network grows slower than the rate of $O(n)$ (i.e., the sparse case). To address this problem, we proposed a new sequential procedure utilizing multiple hypothesis tests and the spectral properties of Erdös Rényi graphs for estimating the number of communities in sparse stochastic block models (SBMs). We prove the consistency of our method for sparse SBMs for a broad range of the sparsity parameter. As a consequence, we discover that our method can estimate the number of communities $K^{(n)}_{\star}$ with $K^{(n)}_{\star}$ increasing at the rate as high as $O(n^{(1 - 3γ)/(4 - 3γ)})$, where $d = O(n^{1 - γ})$. Moreover, we show that our method can be adapted as a stopping rule in estimating the number of communities in binary tree stochastic block models. We benchmark the performance of our method against other competing methods on six reference single-cell RNA sequencing datasets. Finally, we demonstrate the usefulness of our method through numerical simulations and by using it for clustering real single-cell RNA-sequencing datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题