论文标题
用于光谱数据的机器学习的通用合成数据集
A universal synthetic dataset for machine learning on spectroscopic data
论文作者
论文摘要
为了帮助开发用于光谱数据自动分类的机器学习方法,我们生成了一个通用的合成数据集,可用于模型验证。该数据集包含人工光谱,旨在表示来自X射线衍射,核磁共振和拉曼光谱的技术的实验测量。数据集生成过程具有可自定义的参数,例如扫描长度和峰值计数,可以调整这些参数以适应手头的问题。作为初始基准,我们模拟了一个基于500个独特类的数据集,该数据集包含35,000个光谱。为了自动化此数据的分类,评估了八个不同的机器学习架构。从结果来看,我们阐明了哪些因素对于在分类任务中实现最佳性能至关重要。公开使用用于生成合成光谱的脚本以及我们的基准数据集和评估程序,以帮助开发改进的机器学习模型以进行光谱分析。
To assist in the development of machine learning methods for automated classification of spectroscopic data, we have generated a universal synthetic dataset that can be used for model validation. This dataset contains artificial spectra designed to represent experimental measurements from techniques including X-ray diffraction, nuclear magnetic resonance, and Raman spectroscopy. The dataset generation process features customizable parameters, such as scan length and peak count, which can be adjusted to fit the problem at hand. As an initial benchmark, we simulated a dataset containing 35,000 spectra based on 500 unique classes. To automate the classification of this data, eight different machine learning architectures were evaluated. From the results, we shed light on which factors are most critical to achieve optimal performance for the classification task. The scripts used to generate synthetic spectra, as well as our benchmark dataset and evaluation routines, are made publicly available to aid in the development of improved machine learning models for spectroscopic analysis.