论文标题
基于变压器的生物医学模型的本地化内域改编
Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models
论文作者
论文摘要
在数字医疗保健时代,医院每天产生的大量文本信息构成了必不可少但未使用的资产,可以通过特定于任务的,精细的生物医学语言表示模型来利用,从而改善了患者护理和管理。对于这样的专业领域,先前的研究表明,宽覆盖检查点的微调模型在很大程度上可以使其他培训回合优于大规模内域资源。但是,对于诸如意大利语(例如意大利语),这些资源通常无法到达,阻止当地医疗机构采用内域的适应性。为了减少这一差距,我们的工作调查了两种可访问的方法,以使用英语以外的其他语言来得出生物医学模型,以意大利语为具体用例:一种基于英语资源的神经机器翻译,更喜欢质量数量;另一个基于意大利语本质上写的高级,狭窄的语料库,因此宁愿质量而不是数量。我们的研究表明,与生物医学适应的数据质量相比,数据数量更难约束,但是即使处理相对大小有限的语料库,高质量数据的串联也可以改善模型性能。从我们的调查中发表的模型有可能为意大利医院和学术界释放重要的研究机会。最后,从研究中汲取的一组教训构成了对建立生物医学模型的解决方案的宝贵见解,该模型可推广到其他资源较低的语言和不同的领域设置。
In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with task-specific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality data can improve model performance even when dealing with relatively size-limited corpora. The models published from our investigations have the potential to unlock important research opportunities for Italian hospitals and academia. Finally, the set of lessons learned from the study constitutes valuable insights towards a solution to build biomedical language models that are generalizable to other less-resourced languages and different domain settings.