论文标题
通过非重叠变量分区进行缩放模式挖掘
Scaling pattern mining through non-overlapping variable partitioning
论文作者
论文摘要
双簇算法在生物技术和生物医学领域中起着核心作用。所提取的知识支持提取推定的监管模块,这对于理解疾病,协助治疗研究和推进生物学知识至关重要。但是,鉴于双簇任务的NP坚硬性质,具有最佳保证的算法在存在高维数据的情况下往往会缩放较差。为此,我们提出了一条用于基于聚类的垂直分区的管道,该管道同时考虑并行化和跨分区模式合并需求。给定特定类型的模式相干性,这些簇是基于变量形成这些模式的可能性而构建的。随后,然后将每个群集的提取模式合并为最终的封闭模式。使用五个已发布的数据集评估此方法。结果表明,在一些测试的数据中,执行时间根据形成特定类型的模式的可能性聚集在一起时,在统计上具有显着的改进,而不是基于差异或随机性的分区。这项工作为沿模式挖掘和双簇算法的不同阶段的垂直分区标准的效率影响提供了迈出的一步。 可用性:所有代码均在https://github.com/jupitersmight/pattern_merge下免费获得。
Biclustering algorithms play a central role in the biotechnological and biomedical domains. The knowledge extracted supports the extraction of putative regulatory modules, essential to understanding diseases, aiding therapy research, and advancing biological knowledge. However, given the NP-hard nature of the biclustering task, algorithms with optimality guarantees tend to scale poorly in the presence of high-dimensionality data. To this end, we propose a pipeline for clustering-based vertical partitioning that takes into consideration both parallelization and cross-partition pattern merging needs. Given a specific type of pattern coherence, these clusters are built based on the likelihood that variables form those patterns. Subsequently, the extracted patterns per cluster are then merged together into a final set of closed patterns. This approach is evaluated using five published datasets. Results show that in some of the tested data, execution times yield statistically significant improvements when variables are clustered together based on the likelihood to form specific types of patterns, as opposed to partitions based on dissimilarity or randomness. This work offers a departuring step on the efficiency impact of vertical partitioning criteria along the different stages of pattern mining and biclustering algorithms. Availability: All the code is freely available at https://github.com/JupitersMight/pattern_merge under the MIT license.