论文标题
深度学习的解释识别:数据集和方法的综述
Paraphrase Identification with Deep Learning: A Review of Datasets and Methods
论文作者
论文摘要
自然语言处理(NLP)技术的快速进步导致了文本生成工具(例如Chatgpt和Claude)的广泛可用性和有效性。尽管这些技术非常有用,但如果使用各种媒体形式的信誉,则这些技术被用作抄写pla窃,这是科学文献和一般文本媒体中最微妙的内容滥用形式之一。尽管已经开发了用于释义识别的自动化方法,但由于用于训练这些方法的数据集的不一致性质,检测这种类型的窃仍然具有挑战性。在本文中,我们研究了传统和当代的释义识别方法,并研究了流行数据集中某些释义类型的代表性不足,包括用于培训大语言模型(LLMS)的释义类型,会影响检测抄袭的能力。我们介绍并验证了一种新的精制类型学(用于释义,改进的释义类型学定义),以更好地了解释义类型表示的差异。最后,我们为未来的研究和数据集开发提出了新的方向,以增强基于AI的解释检测。
The rapid progress of Natural Language Processing (NLP) technologies has led to the widespread availability and effectiveness of text generation tools such as ChatGPT and Claude. While highly useful, these technologies also pose significant risks to the credibility of various media forms if they are employed for paraphrased plagiarism -- one of the most subtle forms of content misuse in scientific literature and general text media. Although automated methods for paraphrase identification have been developed, detecting this type of plagiarism remains challenging due to the inconsistent nature of the datasets used to train these methods. In this article, we examine traditional and contemporary approaches to paraphrase identification, investigating how the under-representation of certain paraphrase types in popular datasets, including those used to train Large Language Models (LLMs), affects the ability to detect plagiarism. We introduce and validate a new refined typology for paraphrases (ReParaphrased, REfined PARAPHRASE typology definitions) to better understand the disparities in paraphrase type representation. Lastly, we propose new directions for future research and dataset development to enhance AI-based paraphrase detection.