论文标题
基因组数据的基准数据库性能
Benchmarking database performance for genomic data
论文作者
论文摘要
基因组区域代表基因注释,转录因子结合位点和表观遗传修饰等特征。进行各种基因组操作,例如识别重叠/非重叠区域或最近的基因注释是常见的研究需求。数据可以保存在数据库系统中以易于管理,但是,目前尚无全面的数据库内置算法来识别重叠区域。因此,我已经开发了一种基于区域映射(REGMAP)基于SQL的算法来执行基因组操作,并根据不同数据库的性能进行了基准测试。基准测试确定,PostgreSQL提取物重叠区比MySQL快得多。尽管两个数据库的一般搜索能力几乎是等效的,但PostgreSQL中的插入和数据上传也更好。此外,使用算法配对,据报道,从以前的出版物中收集的> 1000个转录因子结合位点和组蛋白标记的重叠,发现HNF4G与粘蛋白亚基sTag1(SA1)显着共存。
Genomic regions represent features such as gene annotations, transcription factor binding sites and epigenetic modifications. Performing various genomic operations such as identifying overlapping/non-overlapping regions or nearest gene annotations are common research needs. The data can be saved in a database system for easy management, however, there is no comprehensive database built-in algorithm at present to identify overlapping regions. Therefore I have developed a region-mapping (RegMap) SQL-based algorithm to perform genomic operations and have benchmarked the performance of different databases. Benchmarking identified that PostgreSQL extracts overlapping regions much faster than MySQL. Insertion and data uploads in PostgreSQL were also better, although general searching capability of both databases was almost equivalent. In addition, using the algorithm pair-wise, overlaps of >1000 datasets of transcription factor binding sites and histone marks, collected from previous publications, were reported and it was found that HNF4G significantly co-locates with cohesin subunit STAG1 (SA1).