部分可观测时空混沌系统的无模型预测

论文标题

部分可观测时空混沌系统的无模型预测

BERTuit: Understanding Spanish language in Twitter through a native transformer

论文作者

Huertas-Tato, Javier, Martin, Alejandro, Camacho, David

论文摘要

BERT，ROBERTA或GPT-3等复杂的基于注意力的语言模型的外观允许在许多场景中解决高度复杂的任务。但是，当应用于特定域时，这些模型会遇到相当大的困难。诸如Twitter之类的社交网络就是这种情况，Twitter是一种不断变化的信息流，以非正式和复杂的语言编写的信息流，鉴于人类的重要作用，每个信息都需要仔细评估，即使人类也可以理解。通过自然语言处理解决该领域的任务涉及严重的挑战。当将强大的最先进的多语言模型应用于这种情况下，特定语言的细微差别用来迷失在翻译中。为了面对这些挑战，我们提出了\ textbf {bertuit}，这是迄今为止针对西班牙语提出的较大变压器，使用Roberta优化的230m西班牙推文的大量数据集进行了预培训。我们的动机是提供一个强大的资源，以更好地了解西班牙Twitter，并用于专注于该社交网络的应用程序，特别强调致力于解决此平台中错误信息的传播的解决方案。对Bertuit进行了多个任务评估，并与M-Bert，XLM-Roberta和XLM-T进行了比较，这是非常具有竞争性的多语言变压器。在这种情况下，我们的方法的实用性显示为应用：一种可视化骗局和分析的作者群体传播虚假信息的零拍方法。错误的信息在英语以外的其他语言等平台上大致传播，这意味着在英语说话之外转移时，变形金刚的表现可能会受到影响。

The appearance of complex attention-based language models such as BERT, Roberta or GPT-3 has allowed to address highly complex tasks in a plethora of scenarios. However, when applied to specific domains, these models encounter considerable difficulties. This is the case of Social Networks such as Twitter, an ever-changing stream of information written with informal and complex language, where each message requires careful evaluation to be understood even by humans given the important role that context plays. Addressing tasks in this domain through Natural Language Processing involves severe challenges. When powerful state-of-the-art multilingual language models are applied to this scenario, language specific nuances use to get lost in translation. To face these challenges we present \textbf{BERTuit}, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets using RoBERTa optimization. Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network, with special emphasis on solutions devoted to tackle the spreading of misinformation in this platform. BERTuit is evaluated on several tasks and compared against M-BERT, XLM-RoBERTa and XLM-T, very competitive multilingual transformers. The utility of our approach is shown with applications, in this case: a zero-shot methodology to visualize groups of hoaxes and profiling authors spreading disinformation. Misinformation spreads wildly on platforms such as Twitter in languages other than English, meaning performance of transformers may suffer when transferred outside English speaking communities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题