论文标题
一种用于学习较小数据的较小表示的算法
An Algorithm for Learning Smaller Representations of Models With Scarce Data
论文作者
论文摘要
当数据集无法完全代表所解决的问题时,我们提供了一种用于解决二进制分类问题的算法,并且无法获得更多数据。它依靠训练有素的模型,具有宽松的精度约束,在搜索空间$θ$上进行了迭代的超参数搜索和跨性程序,以及数据生成功能。我们的算法是通过重建同源性的算法来实现的,而这种流形在其基础分布的支持下。我们提供了理想条件下的正确性和运行时复杂性的分析,并扩展到深神经网络。在前一种情况下,如果$ \ \sizeθ$是搜索空间中的超参数集数量,则该算法将返回一个解决方案,该解决方案的最高为$ 2(1- {2^{ - \ \sizeθ}})$倍$倍,而不是简单地训练$θ$并选择最佳模型。作为我们分析的一部分,我们还证明,数据集的开放保险与基础概率分布的支持(如果并且仅说数据集是可以学习的)的同源性。后一个结果是解释数据扩展技术有效性的正式论点。
We present an algorithm for solving binary classification problems when the dataset is not fully representative of the problem being solved, and obtaining more data is not possible. It relies on a trained model with loose accuracy constraints, an iterative hyperparameter searching-and-pruning procedure over a search space $Θ$, and a data-generating function. Our algorithm works by reconstructing up to homology the manifold on which lies the support of the underlying distribution. We provide an analysis on correctness and runtime complexity under ideal conditions and an extension to deep neural networks. In the former case, if $\sizeΘ$ is the number of hyperparameter sets in the search space, this algorithm returns a solution that is up to $2(1 - {2^{-\sizeΘ}})$ times better than simply training with an enumeration of $Θ$ and picking the best model. As part of our analysis we also prove that an open cover of a dataset has the same homology as the manifold on which lies the support of the underlying probability distribution, if and only said dataset is learnable. This latter result acts as a formal argument to explain the effectiveness of data expansion techniques.