论文标题
预测新闻的方法 - 与Bert的精确多LSTM网络
Approach to Predicting News -- A Precise Multi-LSTM Network With BERT
论文作者
论文摘要
民主的品种(V-DEM)是一种概念化和衡量民主和政治的新方法。它拥有200个国家的信息,是政治学最大的数据库之一。根据2019年V-DEM年度民主报告,台湾是从外国政府传播错误信息的两个国家之一。它还表明,“化妆新闻”在台湾社会中引起了极大的困惑,并对全球稳定产生了严重影响。尽管有几种应用有助于区分虚假信息,但我们发现,对新闻进行分类的预处理仍由人工劳动完成。但是,人工劳动可能会导致错误,并且不能长时间工作。在近几十年中,对自动机器的需求不断增长,虽然机器可以做得像人类甚至更好,但使用机器可以减轻人类的负担并降低成本。因此,在这项工作中,我们建立了一个预测模型来对新闻类别进行分类。我们使用的语料库包含28358新闻和200个新闻,从在线报纸《自由时报》网站(LTN)网站上删除,其中包括8个类别:技术,娱乐,时尚,政治,体育,体育,国际,金融和健康。首先,我们将来自变压器(BERT)的双向编码器表示用于单词嵌入,这些嵌入将每个汉字转换为(1,768)向量。然后,我们使用长的短期内存(LSTM)层将嵌入单词嵌入到句子嵌入中,并添加另一个LSTM层将它们转换为文档嵌入。每个文档嵌入是最终预测模型的输入,其中包含两个密集的层和一个激活层。并且每个文档嵌入使用8个实数转换为1个矢量,然后最高的文档将对应于8个新闻类别,其精度最高为99%。
Varieties of Democracy (V-Dem) is a new approach to conceptualizing and measuring democracy and politics. It has information for 200 countries and is one of the biggest databases for political science. According to the V-Dem annual democracy report 2019, Taiwan is one of the two countries that got disseminated false information from foreign governments the most. It also shows that the "made-up news" has caused a great deal of confusion in Taiwanese society and has serious impacts on global stability. Although there are several applications helping distinguish the false information, we found out that the pre-processing of categorizing the news is still done by human labor. However, human labor may cause mistakes and cannot work for a long time. The growing demands for automatic machines in the near decades show that while the machine can do as good as humans or even better, using machines can reduce humans' burden and cut down costs. Therefore, in this work, we build a predictive model to classify the category of news. The corpora we used contains 28358 news and 200 news scraped from the online newspaper Liberty Times Net (LTN) website and includes 8 categories: Technology, Entertainment, Fashion, Politics, Sports, International, Finance, and Health. At first, we use Bidirectional Encoder Representations from Transformers (BERT) for word embeddings which transform each Chinese character into a (1,768) vector. Then, we use a Long Short-Term Memory (LSTM) layer to transform word embeddings into sentence embeddings and add another LSTM layer to transform them into document embeddings. Each document embedding is an input for the final predicting model, which contains two Dense layers and one Activation layer. And each document embedding is transformed into 1 vector with 8 real numbers, then the highest one will correspond to the 8 news categories with up to 99% accuracy.