论文标题
ZIPF法律,权力法和医疗出院报告中的对数正态分布的经验分析
Empirical Analysis of Zipf's Law, Power Law, and Lognormal Distributions in Medical Discharge Reports
论文作者
论文摘要
贝叶斯建模和统计文本分析依靠知情的概率先验来鼓励良好的解决方案。本文经验分析了医疗排放报告中的文本是否遵循ZIPF定律,这是语言通常假定的统计属性,其中单词频率遵循离散的幂律分布。我们从模仿III数据集检查了20,000个医疗排放报告。方法包括将放电报告拆分为令牌,计数令牌频率,将功率定律分布拟合到数据,以及测试是否具有较高的型分布(含量,指数,伸展指数和截断的功率定律)是否具有与数据相适应的匹配。结果表明,截短的功率定律和对数正常分布最适合放电报告。我们的发现表明,出院报告文本的贝叶斯建模和统计文本分析将受益于使用截短的功率定律和对数正态概率先验。
Bayesian modelling and statistical text analysis rely on informed probability priors to encourage good solutions. This paper empirically analyses whether text in medical discharge reports follow Zipf's law, a commonly assumed statistical property of language where word frequency follows a discrete power law distribution. We examined 20,000 medical discharge reports from the MIMIC-III dataset. Methods included splitting the discharge reports into tokens, counting token frequency, fitting power law distributions to the data, and testing whether alternative distributions--lognormal, exponential, stretched exponential, and truncated power law--provided superior fits to the data. Results show that discharge reports are best fit by the truncated power law and lognormal distributions. Our findings suggest that Bayesian modelling and statistical text analysis of discharge report text would benefit from using truncated power law and lognormal probability priors.