论文标题
可扩展的数据系列子序列与Ulisse匹配
Scalable Data Series Subsequence Matching with ULISSE
论文作者
论文摘要
数据系列相似性搜索是一个重要的操作,也是与数据系列集合有关的几个分析任务和应用程序的核心。尽管数据系列索引实现了快速相似性搜索,但所有现有索引只能回答单个长度的查询(固定在索引构造时间),这是一个严重的限制。在这项工作中,我们提出了Ulisse,这是第一个数据系列索引结构,旨在回答可变长度的相似性搜索查询(在某些范围内)。我们的贡献是两个方面。首先,我们介绍了一种新颖的表示技术,该技术有效而简洁地总结了不同长度的多个序列。基于提出的索引,我们描述了有效的算法,用于近似和精确的相似性搜索,结合了基于磁盘的索引访问和内存顺序扫描。我们的方法支持非Z级别化和Z级别化的序列,并且可以在欧几里得距离和动态时间扭曲的情况下无需使用,以回答K-NN和Epsilon-range查询。我们使用几个合成和真实数据集对我们的方法进行实验评估。结果表明,与竞争方法相比,乌里斯(Ulisse)是几次,并且在时空成本和时间成本方面效率更高。 (发表在VLDBJ 2020中的论文)
Data series similarity search is an important operation and at the core of several analysis tasks and applications related to data series collections. Despite the fact that data series indexes enable fast similarity search, all existing indexes can only answer queries of a single length (fixed at index construction time), which is a severe limitation. In this work, we propose ULISSE, the first data series index structure designed for answering similarity search queries of variable length (within some range). Our contribution is two-fold. First, we introduce a novel representation technique, which effectively and succinctly summarizes multiple sequences of different length. Based on the proposed index, we describe efficient algorithms for approximate and exact similarity search, combining disk based index visits and in-memory sequential scans. Our approach supports non Z-normalized and Z-normalized sequences, and can be used with no changes with both Euclidean Distance and Dynamic Time Warping, for answering both k-NN and epsilon-range queries. We experimentally evaluate our approach using several synthetic and real datasets. The results show that ULISSE is several times, and up to orders of magnitude more efficient in terms of both space and time cost, when compared to competing approaches. (Paper published in VLDBJ 2020)