Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation

Xuanli He; Gholamreza Haffari; Mohammad Norouzi

Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation

Xuanli He, Gholamreza Haffari, Mohammad Norouzi

Abstract Paper Share

Machine Translation Long Paper

Session 6A: Jul 7 (05:00-06:00 GMT)

Session 7A: Jul 7 (08:00-09:00 GMT)

Abstract: This paper introduces Dynamic Programming Encoding (DPE), a new segmentation algorithm for tokenizing sentences into subword units. We view the subword segmentation of output sentences as a latent variable that should be marginalized out for learning and inference. A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations with maximum posterior probability. DPE uses a lightweight mixed character-subword transformer as a means of pre-processing parallel data to segment output sentences using dynamic programming. Empirical results on machine translation suggest that DPE is effective for segmenting output sentences and can be combined with BPE dropout for stochastic segmentation of source sentences. DPE achieves an average improvement of 0.9 BLEU over BPE (Sennrich et al., 2016) and an average improvement of 0.55 BLEU over BPE dropout (Provilkov et al., 2019) on several WMT datasets including English <=> (German, Romanian, Estonian, Finnish, Hungarian).

You can open the pre-recorded video in a separate window.

NOTE: The SlidesLive video may display a random order of the authors. The correct author list is shown at the top of this webpage.

Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation

Xuanli He, Gholamreza Haffari, Mohammad Norouzi

Similar Papers

BPE-Dropout: Simple and Effective Subword Regularization

Ivan Provilkov, Dmitrii Emelianenko, Elena Voita,

A Joint Model for Document Segmentation and Segment Labeling

Joe Barrow, Rajiv Jain, Vlad Morariu, Varun Manjunatha, Douglas Oard, Philip Resnik,

Robust Encodings: A Framework for Combating Adversarial Typos

Erik Jones, Robin Jia, Aditi Raghunathan, Percy Liang,

Generalized Entropy Regularization or: There's Nothing Special about Label Smoothing

Clara Meister, Elizabeth Salesky, Ryan Cotterell,