论文标题
编辑距离的卷积嵌入
Convolutional Embedding for Edit Distance
论文作者
论文摘要
基于编辑距离的字符串相似性搜索具有许多应用程序,例如咒语校正,数据删除和序列对齐。但是,已知计算编辑距离具有很高的复杂性,这使得字符串相似性搜索对大型数据集有挑战。在本文中,我们提出了一条深度学习管道(称为CNN-ED),该管道将编辑距离嵌入到欧几里得距离中,以进行快速近似相似性搜索。卷积神经网络(CNN)用于生成字符串数据集的固定长度矢量嵌入,损耗函数是三胞胎损耗和近似误差的组合。为了证明我们选择使用CNN代替其他结构(例如RNN)作为模型合理的合理性,进行了理论分析以表明我们CNN模型中的某些基本操作可以保留编辑距离。实验结果表明,CNN-ED优于数据独立的CGK嵌入和基于RNN的GRU嵌入,从精度和效率上却大大的余量。我们还表明,使用基于CNN的嵌入,有时通过数量级,可以使用基于CNN的嵌入来显着加速字符串相似性搜索。
Edit-distance-based string similarity search has many applications such as spell correction, data de-duplication, and sequence alignment. However, computing edit distance is known to have high complexity, which makes string similarity search challenging for large datasets. In this paper, we propose a deep learning pipeline (called CNN-ED) that embeds edit distance into Euclidean distance for fast approximate similarity search. A convolutional neural network (CNN) is used to generate fixed-length vector embeddings for a dataset of strings and the loss function is a combination of the triplet loss and the approximation error. To justify our choice of using CNN instead of other structures (e.g., RNN) as the model, theoretical analysis is conducted to show that some basic operations in our CNN model preserve edit distance. Experimental results show that CNN-ED outperforms data-independent CGK embedding and RNN-based GRU embedding in terms of both accuracy and efficiency by a large margin. We also show that string similarity search can be significantly accelerated using CNN-based embeddings, sometimes by orders of magnitude.