ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Marta Bañón; Pinzhen Chen; Barry Haddow; Kenneth Heafield; Hieu Hoang; Miquel Esplà-Gomis; Mikel L. Forcada; Amir Kamran; Faheem Kirefu; Philipp Koehn; Sergio Ortiz Rojas; Leopoldo Pla Sempere; Gema Ramírez-Sánchez; Elsa Sarrías; Marek Strelec; Brian Thompson; William Waites; Dion Wiggins; Jaume Zaragoza

ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Marta Bañón, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel L. Forcada, Amir Kamran, Faheem Kirefu, Philipp Koehn, Sergio Ortiz Rojas, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Elsa Sarrías, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, Jaume Zaragoza

Abstract Paper Share

Resources and Evaluation Long Paper

Session 8A: Jul 7 (12:00-13:00 GMT)

Session 9A: Jul 7 (17:00-18:00 GMT)

Abstract: We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.

You can open the pre-recorded video in a separate window.

NOTE: The SlidesLive video may display a random order of the authors. The correct author list is shown at the top of this webpage.

ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Similar Papers

Parallel Corpus Filtering via Pre-trained Language Models

Boliang Zhang, Ajay Nagesh, Kevin Knight,

Neural CRF Model for Sentence Alignment in Text Simplification

Chao Jiang, Mounica Maddela, Wuwei Lan, Yang Zhong, Wei Xu,

Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining

Ivana Kvapilíková, Mikel Artetxe, Gorka Labaka, Eneko Agirre, Ondřej Bojar,

Effectively Aligning and Filtering Parallel Corpora under Sparse Data Conditions

Steinþór Steingrímsson, Hrafn Loftsson, Andy Way,