论文标题

信息FOMO:不健康的害怕错过信息。一种删除更健康模型的误导数据的方法

Information FOMO: The unhealthy fear of missing out on information. A method for removing misleading data for healthier models

论文作者

Pickering, Ethan, Sapsis, Themistoklis P.

论文摘要

误导性或不必要的数据可能会对机器学习(ML)模型的健康或准确性产生外观影响。我们提出了一种类似于贝叶斯实验设计的贝叶斯顺序选择方法,该方法在数据集中标识了至关重要的信息,同时忽略了误导性或带来不必要的复杂性的数据。我们的方法改善了样本误差的收敛,并消除了更多数据导致替代模型的性能和不稳定性的实例,通常称为样本``double Descent''。我们发现这些不稳定性是基础地图复杂性的结果,并与极端事件和重型尾巴相关联。 我们的方法具有两个关键功能。首先,选择算法动态耦合所选模型和数据。选择数据是根据其优点来改进所选模型的优点,而不是严格与其他数据进行比较。其次,该方法的自然融合消除了将数据分为培训,测试和验证集的需求。取而代之的是,选择度量固有地通过模型的全局统计数据来评估测试和验证错误。这样可以确保在测试或验证中永远不会浪费关键信息。使用高斯过程回归和深度神经网络替代模型应用该方法。

Misleading or unnecessary data can have out-sized impacts on the health or accuracy of Machine Learning (ML) models. We present a Bayesian sequential selection method, akin to Bayesian experimental design, that identifies critically important information within a dataset, while ignoring data that is either misleading or brings unnecessary complexity to the surrogate model of choice. Our method improves sample-wise error convergence and eliminates instances where more data leads to worse performance and instabilities of the surrogate model, often termed sample-wise ``double descent''. We find these instabilities are a result of the complexity of the underlying map and linked to extreme events and heavy tails. Our approach has two key features. First, the selection algorithm dynamically couples the chosen model and data. Data is chosen based on its merits towards improving the selected model, rather than being compared strictly against other data. Second, a natural convergence of the method removes the need for dividing the data into training, testing, and validation sets. Instead, the selection metric inherently assesses testing and validation error through global statistics of the model. This ensures that key information is never wasted in testing or validation. The method is applied using both Gaussian process regression and deep neural network surrogate models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源