Parallel Sentence Mining by Constrained Decoding

Pinzhen Chen, Nikolay Bogoychev, Kenneth Heafield, Faheem Kirefu

Abstract Paper Share

Machine Translation Short Paper

Session 2B: Jul 6 (09:00-10:00 GMT)
Session 3B: Jul 6 (13:00-14:00 GMT)
Abstract: We present a novel method to extract parallel sentences from two monolingual corpora, using neural machine translation. Our method relies on translating sentences in one corpus, but constraining the decoding by a prefix tree built on the other corpus. We argue that a neural machine translation system by itself can be a sentence similarity scorer and it efficiently approximates pairwise comparison with a modified beam search. When benchmarked on the BUCC shared task, our method achieves results comparable to other submissions.
You can open the pre-recorded video in a separate window.
NOTE: The SlidesLive video may display a random order of the authors. The correct author list is shown at the top of this webpage.

Similar Papers

Parallel Corpus Filtering via Pre-trained Language Models
Boliang Zhang, Ajay Nagesh, Kevin Knight,
A representative figure from paper main.756
Multi-Task Neural Model for Agglutinative Language Translation
Yirong Pan, Xiao Li, Yating Yang, Rui Dong,
A representative figure from paper srw.54
Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining
Ivana Kvapilíková, Mikel Artetxe, Gorka Labaka, Eneko Agirre, Ondřej Bojar,
A representative figure from paper srw.137