通过弱监督对比预训练的文本嵌入

论文标题

通过弱监督对比预训练的文本嵌入

Text Embeddings by Weakly-Supervised Contrastive Pre-training

论文作者

Wang, Liang, Yang, Nan, Huang, Xiaolong, Jiao, Binxing, Yang, Linjun, Jiang, Daxin, Majumder, Rangan, Wei, Furu

论文摘要

本文介绍了E5，这是一个最先进的文本嵌入式家族，可以很好地转移到各种任务。该模型以对比方式与我们精选的大规模文本对数据集（称为CCPAIRS）的弱监督信号进行了训练。 E5可以轻松用作通用嵌入模型，用于任何任务，需要单矢量表示诸如检索，聚类和分类之类的文本，并在零拍和微调的设置中实现强劲的性能。我们对贝尔和MTEB基准的56个数据集进行了广泛的评估。对于零击设置，E5是第一个在贝尔检索基准测试中优于强大的BM25基线而无需使用任何标记数据的模型。经过微调后，E5在MTEB基准测试中获得了最佳结果，击败了具有40倍参数的现有嵌入式模型。

This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题