论文标题
强大而可扩展的内容和结构索引(扩展版本)
Robust and Scalable Content-and-Structure Indexing (Extended Version)
论文作者
论文摘要
半结构化层次数据的频繁查询是内容和结构的查询(CAS)查询,这些查询根据其在层次结构中的位置过滤数据项及其对某些属性的值。我们提出了可靠,可扩展的内容和结构(RSCAS)索引,以有效地回答大型半结构数据的CAS查询。为了获得与具有不同选择性的查询的索引,我们引入了一种新颖的动态交织,以平衡的方式将复合键的路径和价值维度融合在一起。我们将交织的密钥存储在基于TRIE的RSCA指数中,该键有效地支持广泛的CAS查询,包括带有通配符和后代轴的查询。我们将RSCA作为对数结构合并(LSM)树实现,以将其扩展到具有较高插入率的数据密集型应用程序。我们通过索引来自软件遗产(SWH)档案的数据(这是世界上最大,公共可用的源代码档案库中的数据)来说明RSCAS的鲁棒性和可扩展性。
Frequent queries on semi-structured hierarchical data are Content-and-Structure (CAS) queries that filter data items based on their location in the hierarchical structure and their value for some attribute. We propose the Robust and Scalable Content-and-Structure (RSCAS) index to efficiently answer CAS queries on big semi-structured data. To get an index that is robust against queries with varying selectivities we introduce a novel dynamic interleaving that merges the path and value dimensions of composite keys in a balanced manner. We store interleaved keys in our trie-based RSCAS index, which efficiently supports a wide range of CAS queries, including queries with wildcards and descendant axes. We implement RSCAS as a log-structured merge (LSM) tree to scale it to data-intensive applications with a high insertion rate. We illustrate RSCAS's robustness and scalability by indexing data from the Software Heritage (SWH) archive, which is the world's largest, publicly-available source code archive.