培训技术来定位医学BERT并增强生物医学BERT

论文标题

培训技术来定位医学BERT并增强生物医学BERT

Pre-training technique to localize medical BERT and enhance biomedical BERT

论文作者

Wada, Shoya, Takeda, Toshihiro, Manabe, Shiro, Konishi, Shozo, Kamohara, Jun, Matsumura, Yasushi

论文摘要

在原始文本上进行的大规模训练大规模神经语言模型为改善自然语言处理（NLP）的转移学习做出了重大贡献。随着引入基于变压器的语言模型，例如来自变形金刚（BERT）的双向编码器表示，NLP的自由文本中信息提取的性能都显着改善了通用域和医疗领域。但是，很难训练特定的BERT模型，这些模型在很少有高质量和大尺寸的公开数据库中表现良好。我们假设可以通过对特定于域的语料库采样并将其用于以平衡的方式与较大的语料库进行预训练来解决此问题。我们提出的方法包括一个单一的干预措施，其中一种选择：在上采样和放大词汇后同时进行预训练。我们进行了三个实验，并评估了所得产品。我们证实，我们的日本医学BERT在医疗文档分类任务方面优于传统基线和其他BERT模型，并且我们的英语BERT使用一般和医疗域的Corpora进行了预先培训，就生物医学语言理解评估（blue）基准而言，我们的英语BERT在实际使用方面表现得很好。此外，我们增强的生物医学BERT模型，其中未在预训练期间使用临床注释，表明蓝色基准的临床和生物医学评分均高于未经我们提出的方法训练的消融模型的临床和生物医学评分。通过适合目标任务的语料库得出的上采样实例的均衡预训练使我们能够构建高性能的BERT模型。

Pre-training large-scale neural language models on raw texts has made a significant contribution to improving transfer learning in natural language processing (NLP). With the introduction of transformer-based language models, such as bidirectional encoder representations from transformers (BERT), the performance of information extraction from a free text by NLP has significantly improved for both the general domain and medical domain; however, it is difficult to train specific BERT models that perform well for domains in which there are few publicly available databases of high quality and large size. We hypothesized that this problem can be addressed by up-sampling a domain-specific corpus and using it for pre-training with a larger corpus in a balanced manner. Our proposed method consists of a single intervention with one option: simultaneous pre-training after up-sampling and amplified vocabulary. We conducted three experiments and evaluated the resulting products. We confirmed that our Japanese medical BERT outperformed conventional baselines and the other BERT models in terms of the medical document classification task and that our English BERT pre-trained using both the general and medical-domain corpora performed sufficiently well for practical use in terms of the biomedical language understanding evaluation (BLUE) benchmark. Moreover, our enhanced biomedical BERT model, in which clinical notes were not used during pre-training, showed that both the clinical and biomedical scores of the BLUE benchmark were 0.3 points above that of the ablation model trained without our proposed method. Well-balanced pre-training by up-sampling instances derived from a corpus appropriate for the target task allows us to construct a high-performance BERT model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题