论文标题

语言不合时宜的bert句子嵌入

Language-agnostic BERT Sentence Embedding

论文作者

Feng, Fangxiaoyu, Yang, Yinfei, Cer, Daniel, Arivazhagan, Naveen, Wang, Wei

论文摘要

尽管BERT是一种学习单语句子嵌入语义相似性和基于嵌入的转移学习的有效方法(Reimers和Gurevych,2019年),但基于BERT的跨语言句子嵌入尚未探索。我们通过结合学习单语言和跨语性表示的最佳方法,包括:掩盖语言建模(MLM),翻译语言建模(TLM)(Conneau and Lample,2019),Dual Dualer Translation Cransping Cranking(Guo等人,2018年)和添加MARGAMAX(yang softmax)(yang softmammax)(yang softmax),我们会系统地研究用于学习多语言句子嵌入的方法。我们表明,引入预先训练的多语言语言模型会大大减少实现良好性能所需的并行训练数据的数量。创建这些方法中的最好的方法会产生一个模型,该模型可在Tatoeba上获得112种语言的Bi-Text检索准确性,远高于Artetxe和Schwenk(2019b)实现的65.5%,同时仍在竞争性地执行单语言转移学习基准(Conneau and Kiela和Kiela,2018年)。使用我们的最佳模型从Common Crawl开采的并行数据显示,可以训练EN-ZH和EN-DE的竞争性NMT模型。我们在https://tfhub.dev/google/labse上公开发布了109多种语言的最佳多语言句子嵌入模型。

While BERT is an effective method for learning monolingual sentence embeddings for semantic similarity and embedding based transfer learning (Reimers and Gurevych, 2019), BERT based cross-lingual sentence embeddings have yet to be explored. We systematically investigate methods for learning multilingual sentence embeddings by combining the best methods for learning monolingual and cross-lingual representations including: masked language modeling (MLM), translation language modeling (TLM) (Conneau and Lample, 2019), dual encoder translation ranking (Guo et al., 2018), and additive margin softmax (Yang et al., 2019a). We show that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%. Composing the best of these methods produces a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba, well above the 65.5% achieved by Artetxe and Schwenk (2019b), while still performing competitively on monolingual transfer learning benchmarks (Conneau and Kiela, 2018). Parallel data mined from CommonCrawl using our best model is shown to train competitive NMT models for en-zh and en-de. We publicly release our best multilingual sentence embedding model for 109+ languages at https://tfhub.dev/google/LaBSE.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源