常识知识对抗数据集，挑战伊莱克特拉

论文标题

常识知识对抗数据集，挑战伊莱克特拉

Commonsense knowledge adversarial dataset that challenges ELECTRA

论文作者

Lin, Gongqi, Miao, Yuan, Yang, Xiaoyong, Ou, Wenwu, Cui, Lizhen, Guo, Wei, Miao, Chunyan

论文摘要

常识知识对于人类阅读理解至关重要。尽管机器理解近年来取得了重大进展，但处理常识知识的能力仍然有限。同义词是使用最广泛的常识性知识之一。构建对抗数据集是找到机器理解模型弱点并支持解决方案设计的重要方法。为了调查机器理解模型处理常识知识的能力，我们创建了一个问答数据集，并以同义词（QADS）的常识来回答数据集。 QAD是通过应用常识性同义词来基于Squad 2.0产生的问题。同义词从WordNet提取。单词通常具有多种含义和同义词。我们使用增强的Lesk算法来执行单词感官歧义，以识别上下文的同义词。 Electra在2019年实现了Squad 2.0数据集的最新结果。通过尺度，Electra可以实现与Bert相似的性能。但是，QADS表明，Electra几乎没有能力处理同义词的常识性知识。在我们的实验中，Electra-Mall可以在Squad 2.0上获得70％的精度，但QAD只能达到20％。 Electra-Large表现不佳。它在Squad 2.0上的准确性为88％，但QADS显着下降至26％。在我们较早的实验中，伯特虽然在QAD上也遭受了严重失败，但并不像Electra那样糟糕。结果表明，即使表现出色的NLP模型也几乎没有能力处理常识知识，这对于阅读理解至关重要。

Commonsense knowledge is critical in human reading comprehension. While machine comprehension has made significant progress in recent years, the ability in handling commonsense knowledge remains limited. Synonyms are one of the most widely used commonsense knowledge. Constructing adversarial dataset is an important approach to find weak points of machine comprehension models and support the design of solutions. To investigate machine comprehension models' ability in handling the commonsense knowledge, we created a Question and Answer Dataset with common knowledge of Synonyms (QADS). QADS are questions generated based on SQuAD 2.0 by applying commonsense knowledge of synonyms. The synonyms are extracted from WordNet. Words often have multiple meanings and synonyms. We used an enhanced Lesk algorithm to perform word sense disambiguation to identify synonyms for the context. ELECTRA achieves the state-of-art result on the SQuAD 2.0 dataset in 2019. With scale, ELECTRA can achieve similar performance as BERT does. However, QADS shows that ELECTRA has little ability to handle commonsense knowledge of synonyms. In our experiment, ELECTRA-small can achieve 70% accuracy on SQuAD 2.0, but only 20% on QADS. ELECTRA-large did not perform much better. Its accuracy on SQuAD 2.0 is 88% but dropped significantly to 26% on QADS. In our earlier experiments, BERT, although also failed badly on QADS, was not as bad as ELECTRA. The result shows that even top-performing NLP models have little ability to handle commonsense knowledge which is essential in reading comprehension.

下载PDF全文

下载文献需遵守相关版权规定

论文标题