ParaCrawl: Web-Scale Acquisition of Parallel Corpora
Marta Bañón, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel L. Forcada, Amir Kamran, Faheem Kirefu, Philipp Koehn, Sergio Ortiz Rojas, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Elsa Sarrías, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, Jaume Zaragoza
Resources and Evaluation Long Paper
  Session 8A: Jul 7
  (12:00-13:00 GMT)
  
  
  
  
  Session 9A: Jul 7
  (17:00-18:00 GMT)
  
  
  
  
        Abstract:
        We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.
      
    
    You can open the
    pre-recorded video
    
    in a separate window.
    
  
  
      NOTE: The SlidesLive video may display a random order of the authors.
      The correct author list is shown at the top of this webpage.
      
    Similar Papers
Neural CRF Model for Sentence Alignment in Text Simplification
Chao Jiang, Mounica Maddela, Wuwei Lan, Yang Zhong, Wei Xu,

Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining
Ivana Kvapilíková, Mikel Artetxe, Gorka Labaka, Eneko Agirre, Ondřej Bojar,

Effectively Aligning and Filtering Parallel Corpora under Sparse Data Conditions
Steinþór Steingrímsson, Hrafn Loftsson, Andy Way,

