论文标题

从复杂的科学文本和微调大语言模型中提取结构化信息

Structured information extraction from complex scientific text with fine-tuned large language models

论文作者

Dunn, Alexander, Dagdelen, John, Walker, Nicholas, Lee, Sanghoon, Rosen, Andrew S., Ceder, Gerbrand, Persson, Kristin, Jain, Anubhav

论文摘要

从非结构化文本中智能提取和联系复杂的科学信息是一项艰巨的努力,特别是对于那些没有自然语言处理的经验的人。在这里,我们为联合命名实体识别和关系提取的简单顺序对序列方法进行了科学文本中复杂层次信息的提取。该方法利用了预先训练的大语言模型(LLM)GPT-3,该模型在大约500对提示(输入)和完成(输出)上进行了微调。信息是从单个句子或摘要/段落中的句子中提取的,并且可以将输出作为简单的英语句子或更结构化的格式返回,例如JSON对象的列表。我们证明,以这种方式训练的LLM能够准确地提取材料化学三个代表性任务的复杂科学知识的有用记录:将掺杂剂与宿主材料,分类金属有机框架以及一般化学/相/相/形态/形态/应用信息提取相关联。这种方法代表了一种简单,易于访问且高度易于的途径,用于从非结构化文本中提取的大型结构化知识数据库。在线演示可在http://www.matscholar.com/info-cretaction上找到。

Intelligently extracting and linking complex scientific information from unstructured text is a challenging endeavor particularly for those inexperienced with natural language processing. Here, we present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction for complex hierarchical information in scientific text. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts (inputs) and completions (outputs). Information is extracted either from single sentences or across sentences in abstracts/passages, and the output can be returned as simple English sentences or a more structured format, such as a list of JSON objects. We demonstrate that LLMs trained in this way are capable of accurately extracting useful records of complex scientific knowledge for three representative tasks in materials chemistry: linking dopants with their host materials, cataloging metal-organic frameworks, and general chemistry/phase/morphology/application information extraction. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text. An online demo is available at http://www.matscholar.com/info-extraction.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源