论文标题

自动填充:(快速)学习综合数据生成

AutoSimulate: (Quickly) Learning Synthetic Data Generation

论文作者

Behl, Harkirat Singh, Baydin, Atılım Güneş, Gal, Ran, Torr, Philip H. S., Vineet, Vibhav

论文摘要

在许多机器学习问题中,模拟越来越多地用于生成大型标记的数据集。最近的方法集中于调整模拟器参数,目的是最大程度地提高验证任务的准确性,通常依赖于增强类似梯度估计器。但是,这些方法非常昂贵,因为它们将整个数据生成,模型培训和验证管道视为黑框,并且需要在每次迭代时进行多次昂贵的客观评估。我们基于目标的新可区分近似,提出了一种有效的替代方法,以实现最佳合成数据生成。这使我们能够优化模拟器,该模拟器可能是非差异的,每次迭代中只需要一个客观的评估,并需要一个小开销。我们在最先进的逼真的渲染器上证明了该建议的方法可以更快地找到最佳数据分配(最高$ 50 \ times $),并且在现实世界测试数据集中,培训数据的生成(最高$ 30 \ times $ $)和更高的准确性($+8.7 \%\%$)。

Simulation is increasingly being used for generating large labelled datasets in many machine learning problems. Recent methods have focused on adjusting simulator parameters with the goal of maximising accuracy on a validation task, usually relying on REINFORCE-like gradient estimators. However these approaches are very expensive as they treat the entire data generation, model training, and validation pipeline as a black-box and require multiple costly objective evaluations at each iteration. We propose an efficient alternative for optimal synthetic data generation, based on a novel differentiable approximation of the objective. This allows us to optimize the simulator, which may be non-differentiable, requiring only one objective evaluation at each iteration with a little overhead. We demonstrate on a state-of-the-art photorealistic renderer that the proposed method finds the optimal data distribution faster (up to $50\times$), with significantly reduced training data generation (up to $30\times$) and better accuracy ($+8.7\%$) on real-world test datasets than previous methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源