Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation

Xinting Huang; Jianzhong Qi; Yu Sun; Rui Zhang

Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation

Xinting Huang, Jianzhong Qi, Yu Sun, Rui Zhang

Abstract Paper Share

Dialogue and Interactive Systems Long Paper

Session 1B: Jul 6 (06:00-07:00 GMT)

Session 3B: Jul 6 (13:00-14:00 GMT)

Abstract: Dialogue policy optimization often obtains feedback until task completion in task-oriented dialogue systems. This is insufficient for training intermediate dialogue turns since supervision signals (or rewards) are only provided at the end of dialogues. To address this issue, reward learning has been introduced to learn from state-action pairs of an optimal policy to provide turn-by-turn rewards. This approach requires complete state-action annotations of human-to-human dialogues (i.e., expert demonstrations), which is labor intensive. To overcome this limitation, we propose a novel reward learning approach for semi-supervised policy learning. The proposed approach learns a dynamics model as the reward function which models dialogue progress (i.e., state-action sequences) based on expert demonstrations, either with or without annotations. The dynamics model computes rewards by predicting whether the dialogue progress is consistent with expert demonstrations. We further propose to learn action embeddings for a better generalization of the reward function. The proposed approach outperforms competitive policy learning baselines on MultiWOZ, a benchmark multi-domain dataset.

You can open the pre-recorded video in a separate window.

NOTE: The SlidesLive video may display a random order of the authors. The correct author list is shown at the top of this webpage.

Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation

Xinting Huang, Jianzhong Qi, Yu Sun, Rui Zhang

Similar Papers

Learning Efficient Dialogue Policy from Demonstrations through Shaping

Huimin Wang, Baolin Peng, Kam-Fai Wong,

Modeling Long Context for Task-Oriented Dialogue State Generation

Jun Quan, Deyi Xiong,

Conversational Graph Grounded Policy Learning for Open-Domain Conversation Generation

Jun Xu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanxiang Che, Ting Liu,

CrossWOZ: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset

Qi Zhu, Kaili Huang, Zheng Zhang, Xiaoyan Zhu, Minlie Huang,