论文标题
修剪的瓦斯坦斯坦指数生成模型和wigpy包装
Pruned Wasserstein Index Generation Model and wigpy Package
论文作者
论文摘要
Wasserstein指数生成模型(WIG)的最新建议显示了自动生成指标的新方向。但是,实际上,适合大型数据集的原因有两个原因。首先,臭名昭著的距离众所周知,计算和尺寸严重遭受了严重的损失。其次,它需要计算一个完整的$ n \ times n $矩阵以适合内存,其中$ n $是词汇的维度。当维度太大时,甚至根本不可能计算。我在此提出了一种基于套索的收缩方法,以减少词汇的维度,作为拟合假发模型之前的预处理步骤。从Word2Vec模型中获取嵌入一词后,我们可以通过$ k $ -MEANS聚类将这些高维矢量聚集,并在每个群集中选择最频繁的令牌以形成“基本词汇”。然后将非基本令牌在基础令牌的向量上进行回归以获得转换权重,因此我们只能通过“基础令牌”来表示整个词汇。这种称为修剪假发(PWIG)的变体将使我们能够随意缩小词汇尺寸,但仍然可以实现高精度。我还提供了Python中的\ textit {wigpy}模块,以两种风味进行计算。展示了对经济政策不确定性(EPU)指数的应用,以与现有的产生时间序列索引的方法进行比较。
Recent proposal of Wasserstein Index Generation model (WIG) has shown a new direction for automatically generating indices. However, it is challenging in practice to fit large datasets for two reasons. First, the Sinkhorn distance is notoriously expensive to compute and suffers from dimensionality severely. Second, it requires to compute a full $N\times N$ matrix to be fit into memory, where $N$ is the dimension of vocabulary. When the dimensionality is too large, it is even impossible to compute at all. I hereby propose a Lasso-based shrinkage method to reduce dimensionality for the vocabulary as a pre-processing step prior to fitting the WIG model. After we get the word embedding from Word2Vec model, we could cluster these high-dimensional vectors by $k$-means clustering, and pick most frequent tokens within each cluster to form the "base vocabulary". Non-base tokens are then regressed on the vectors of base token to get a transformation weight and we could thus represent the whole vocabulary by only the "base tokens". This variant, called pruned WIG (pWIG), will enable us to shrink vocabulary dimension at will but could still achieve high accuracy. I also provide a \textit{wigpy} module in Python to carry out computation in both flavor. Application to Economic Policy Uncertainty (EPU) index is showcased as comparison with existing methods of generating time-series sentiment indices.