TABTEXT：一种灵活而上下文的表格数据表示方法

论文标题

TABTEXT：一种灵活而上下文的表格数据表示方法

TabText: A Flexible and Contextual Approach to Tabular Data Representation

论文作者

Carballo, Kimberly Villalobos, Na, Liangyuan, Ma, Yu, Boussioux, Léonard, Zeng, Cynthia, Soenksen, Luis R., Bertsimas, Dimitris

论文摘要

表格数据对于在各个行业中应用机器学习任务至关重要。但是，传统的数据处理方法并未完全利用表中可用的所有信息，而忽略了重要的上下文信息，例如列标题描述。此外，将预处理数据纳入表格格式可以是模型开发中的劳动密集型瓶颈。这项工作介绍了TabText，这是一种处理和功能提取框架，从表格数据结构中提取上下文信息。 TabText通过将内容转换为语言并利用预先训练的大型语言模型（LLMS）来解决处理困难。我们在九项医疗保健预测任务上评估了我们的框架，包括患者出院，入院和死亡率。我们表明，1）应用TabText框架可以使具有最小数据预处理的高性能和简单的机器学习基线模型以及2）增强具有TABTEXT表示的预处理的表格数据，从而提高了标准机器学习模型的平均和最差的AUC AUC AUC的性能高达6％。

Tabular data is essential for applying machine learning tasks across various industries. However, traditional data processing methods do not fully utilize all the information available in the tables, ignoring important contextual information such as column header descriptions. In addition, pre-processing data into a tabular format can remain a labor-intensive bottleneck in model development. This work introduces TabText, a processing and feature extraction framework that extracts contextual information from tabular data structures. TabText addresses processing difficulties by converting the content into language and utilizing pre-trained large language models (LLMs). We evaluate our framework on nine healthcare prediction tasks ranging from patient discharge, ICU admission, and mortality. We show that 1) applying our TabText framework enables the generation of high-performing and simple machine learning baseline models with minimal data pre-processing, and 2) augmenting pre-processed tabular data with TabText representations improves the average and worst-case AUC performance of standard machine learning models by as much as 6%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题