论文标题
一种随机的洗牌方法扩展狭窄的数据集并克服临床研究中的相关挑战:心力衰竭队列示例
A random shuffle method to expand a narrow dataset and overcome the associated challenges in a clinical study: a heart failure cohort example
论文作者
论文摘要
心力衰竭(HF)影响全球至少2600万人,因此预测HF患者不良事件是临床数据科学的主要目标。但是,由于患者招募和较长的随访时间的困难,实现大型样本量有时会构成挑战,从而增加了数据丢失的问题。为了克服狭窄的数据集基数问题(在临床数据集中,基数是该数据集中的患者数量),因此,人口增强算法至关重要。这项研究的目的是设计一种随机的洗牌方法,以在统计学上是合法的,而无需特定的假设和回归模型,以增强HF数据集的基数。关于预测临床条件和终点方面的正确性,使用了已建立的随机重复测量方法来验证了基数增强。特别是,使用机器学习和回归模型来突出增强数据集的好处。提出的随机洗牌方法能够增强HF数据集的基数(在数据集预处理前的711名患者)大约10次和大约21次,然后进行随机重复测量方法。我们认为,当缺少数据和狭窄的数据集基数时,随机洗牌方法可以在心血管领域和其他数据科学问题中使用。
Heart failure (HF) affects at least 26 million people worldwide, so predicting adverse events in HF patients represents a major target of clinical data science. However, achieving large sample sizes sometimes represents a challenge due to difficulties in patient recruiting and long follow-up times, increasing the problem of missing data. To overcome the issue of a narrow dataset cardinality (in a clinical dataset, the cardinality is the number of patients in that dataset), population-enhancing algorithms are therefore crucial. The aim of this study was to design a random shuffle method to enhance the cardinality of an HF dataset while it is statistically legitimate, without the need of specific hypotheses and regression models. The cardinality enhancement was validated against an established random repeated-measures method with regard to the correctness in predicting clinical conditions and endpoints. In particular, machine learning and regression models were employed to highlight the benefits of the enhanced datasets. The proposed random shuffle method was able to enhance the HF dataset cardinality (711 patients before dataset preprocessing) circa 10 times and circa 21 times when followed by a random repeated-measures approach. We believe that the random shuffle method could be used in the cardiovascular field and in other data science problems when missing data and the narrow dataset cardinality represent an issue.