通过随机奖励估算的半监督对话政策学习

论文标题

通过随机奖励估算的半监督对话政策学习

Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation

论文作者

Huang, Xinting, Qi, Jianzhong, Sun, Yu, Zhang, Rui

论文摘要

对话策略优化通常会获得反馈，直到以任务为导向的对话系统完成任务。这不足以培训中间对话转弯，因为仅在对话结束时提供监督信号（或奖励）。为了解决这个问题，已经引入了奖励学习，以从最佳政策的国家行动对中学习，以提供转弯的回报。这种方法需要对人类到人类对话（即专家示范）进行完整的国家行动注释，这是劳动密集型的。为了克服这一限制，我们为半监督政策学习提出了一种新颖的奖励学习方法。拟议的方法将动态模型学习为奖励功能，该奖励功能基于或没有注释，基于专家演示的对话进展（即国家行动序列）。动态模型通过预测对话进度是否与专家示范一致来计算奖励。我们进一步建议学习嵌入的动作嵌入，以更好地概括奖励功能。所提出的方法在基准多域数据集Multiwoz上优于竞争政策学习基线。

Dialogue policy optimization often obtains feedback until task completion in task-oriented dialogue systems. This is insufficient for training intermediate dialogue turns since supervision signals (or rewards) are only provided at the end of dialogues. To address this issue, reward learning has been introduced to learn from state-action pairs of an optimal policy to provide turn-by-turn rewards. This approach requires complete state-action annotations of human-to-human dialogues (i.e., expert demonstrations), which is labor intensive. To overcome this limitation, we propose a novel reward learning approach for semi-supervised policy learning. The proposed approach learns a dynamics model as the reward function which models dialogue progress (i.e., state-action sequences) based on expert demonstrations, either with or without annotations. The dynamics model computes rewards by predicting whether the dialogue progress is consistent with expert demonstrations. We further propose to learn action embeddings for a better generalization of the reward function. The proposed approach outperforms competitive policy learning baselines on MultiWOZ, a benchmark multi-domain dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题