作品集：一阶逻辑的自然语言推理

论文标题

作品集：一阶逻辑的自然语言推理

FOLIO: Natural Language Reasoning with First-Order Logic

论文作者

Han, Simeng, Schoelkopf, Hailey, Zhao, Yilun, Qi, Zhenting, Riddell, Martin, Zhou, Wenfei, Coady, James, Peng, David, Qiao, Yujie, Benson, Luke, Sun, Lucy, Wardle-Solano, Alex, Szabo, Hannah, Zubova, Ekaterina, Burtell, Matthew, Fan, Jonathan, Liu, Yixin, Wong, Brian, Sailor, Malcolm, Ni, Ansong, Nan, Linyong, Kasai, Jungo, Yu, Tao, Zhang, Rui, Fabbri, Alexander R., Kryscinski, Wojciech, Yavuz, Semih, Liu, Ye, Lin, Xi Victoria, Joty, Shafiq, Zhou, Yingbo, Xiong, Caiming, Ying, Rex, Cohan, Arman, Radev, Dragomir

论文摘要

大型语言模型（LLM）在各种自然语言理解任务上取得了出色的表现。但是，现有的基准在衡量模型的复杂逻辑推理能力方面不足。我们介绍了一项具有一阶逻辑（FOL）注释的自然语言推理（NL）的人类向注释，逻辑上复杂和多样化的数据集。对开本由1,430个示例（独特的结论）组成，每个例子与487组前提中的一组搭配，用于演绎理由，以理解每个结论的有效性。前提和结论的逻辑正确性可以通过其fol注释来确保，这些注释会自动由FOL推理引擎验证。除了主要的NL推理任务外，对开本中的NL-FOL对构成了一个新的NL-FOL翻译数据集。我们对广泛的实验系统地评估了对中型语言模型进行微调的FOL推理能力。对于NL推理和NL-FOL翻译，我们基准了多种最先进的语言模型。我们的结果表明，一部分作品集对公共可用的最有能力的{大语言模型（LLM）}的挑战提出了挑战，GPT-4。

Large language models (LLMs) have achieved remarkable performance on a variety of natural language understanding tasks. However, existing benchmarks are inadequate in measuring the complex logical reasoning capabilities of a model. We present FOLIO, a human-annotated, logically complex and diverse dataset for reasoning in natural language (NL), equipped with first-order logic (FOL) annotations. FOLIO consists of 1,430 examples (unique conclusions), each paired with one of 487 sets of premises used to deductively reason for the validity of each conclusion. The logical correctness of the premises and conclusions is ensured by their FOL annotations, which are automatically verified by an FOL inference engine. In addition to the main NL reasoning task, NL-FOL pairs in FOLIO constitute a new NL-FOL translation dataset. Our experiments on FOLIO systematically evaluate the FOL reasoning ability of supervised fine-tuning on medium-sized language models. For both NL reasoning and NL-FOL translation, we benchmark multiple state-of-the-art language models. Our results show that a subset of FOLIO presents a challenge for one of the most capable {Large Language Model (LLM)} publicly available, GPT-4.

下载PDF全文

下载文献需遵守相关版权规定

论文标题