推论统计学家的稀有事实预测建模简介 - 在预测突破专利中的动手应用

论文标题

推论统计学家的稀有事实预测建模简介 - 在预测突破专利中的动手应用

Introduction to Rare-Event Predictive Modeling for Inferential Statisticians -- A Hands-On Application in the Prediction of Breakthrough Patents

论文作者

Hain, Daniel, Jurowetzki, Roman

论文摘要

近年来，定量方法的实质性开发，主要由计算机科学界领导，目的是开发更好的机器学习应用程序，主要集中于预测建模。但是，到目前为止，经济，管理和技术预测研究一直不愿应用预测建模技术和工作流程。在本文中，我们引入了一种机器学习方法（ML）方法，用于定量分析，旨在优化预测性能，并将其与标准实践推断统计数据进行对比，该方法的重点是产生良好的参数估计。我们乍看之下，在此背景下，我们讨论了两个领域之间的潜在协同作用。我们讨论了预测建模中的基本概念，例如样本外模型验证，可变和模型选择，概括和超参数调谐程序。我们为定量社会科学受众提供动手预测建模介绍，同时旨在使计算机科学术语神秘化。我们使用专利质量估计的说明性示例 - 这应该是科学计量学社区中感兴趣的一个熟悉的话题 - 通过各种模型类别和程序来指导读者进行数据预处理，建模和验证。我们从更熟悉的易于解释的模型类（Logit和弹性网）开始，继续采用不太熟悉的非参数方法（分类树，随机森林，梯度增强的树木），最后呈现人造神经网络架构，首先是一个简单的进料，然后是深度自动式自动编码器，用于稀有稀有的预测。

Recent years have seen a substantial development of quantitative methods, mostly led by the computer science community with the goal of developing better machine learning applications, mainly focused on predictive modeling. However, economic, management, and technology forecasting research has so far been hesitant to apply predictive modeling techniques and workflows. In this paper, we introduce a machine learning (ML) approach to quantitative analysis geared towards optimizing the predictive performance, contrasting it with standard practices inferential statistics, which focus on producing good parameter estimates. We discuss the potential synergies between the two fields against the backdrop of this, at first glance, target-incompatibility. We discuss fundamental concepts in predictive modeling, such as out-of-sample model validation, variable and model selection, generalization, and hyperparameter tuning procedures. We are providing a hands-on predictive modeling introduction for a quantitative social science audience while aiming at demystifying computer science jargon. We use the illustrative example of patent quality estimation - which should be a familiar topic of interest in the Scientometrics community - guiding the reader through various model classes and procedures for data pre-processing, modeling, and validation. We start off with more familiar easy to interpret model classes (Logit and Elastic Nets), continues with less familiar non-parametric approaches (Classification Trees, Random Forest, Gradient Boosted Trees), and finally presents artificial neural network architectures, first a simple feed-forward and then a deep autoencoder geared towards rare-event prediction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题