论文标题
sentpwnet:一个特定于任务句子的统一句子对加权网络
SentPWNet: A Unified Sentence Pair Weighting Network for Task-specific Sentence Embedding
论文作者
论文摘要
基于成对的公制学习已被广泛采用,以学习嵌入在许多NLP任务中的句子,例如由于其计算效率而嵌入语义文本相似性。大多数现有作品都采用了序列编码器模型,并利用了有限的句子对,基于成对的损失来学习区分句子表示。但是,众所周知,当采样句子对偏离所有句子对的真实分布时,句子表示形式可能会偏差。在本文中,我们的理论分析表明,现有的作品严重遭受了良好的对采样和实例加权策略。我们提出了一个统一的位置加权和学习框架来学习特定于任务的句子嵌入,而不是一对选择和学习均等的选择和学习。我们的模型Sentpwnet利用每个句子的相邻空间分布作为位置权重,以表示句子对的信息水平。在每回合中,将更新此类权重,并确保模型继续学习最有用的句子对。在四个公共可用数据集和一个自行收集的地方搜索基准中,有140万个位置的广泛实验清楚地表明,我们的模型始终以可比的效率优于现有的句子嵌入方法。
Pair-based metric learning has been widely adopted to learn sentence embedding in many NLP tasks such as semantic text similarity due to its efficiency in computation. Most existing works employed a sequence encoder model and utilized limited sentence pairs with a pair-based loss to learn discriminating sentence representation. However, it is known that the sentence representation can be biased when the sampled sentence pairs deviate from the true distribution of all sentence pairs. In this paper, our theoretical analysis shows that existing works severely suffered from a good pair sampling and instance weighting strategy. Instead of one time pair selection and learning on equal weighted pairs, we propose a unified locality weighting and learning framework to learn task-specific sentence embedding. Our model, SentPWNet, exploits the neighboring spatial distribution of each sentence as locality weight to indicate the informative level of sentence pair. Such weight is updated along with pair-loss optimization in each round, ensuring the model keep learning the most informative sentence pairs. Extensive experiments on four public available datasets and a self-collected place search benchmark with 1.4 million places clearly demonstrate that our model consistently outperforms existing sentence embedding methods with comparable efficiency.