Comments
Transcript
Abu-MaTran at WMT 2014 Translation Task:
Abu-MaTran at WMT 2014 Translation Task: Two-step Data Selection and RBMT-Style Synthetic Rules 1 2 1 1 Ferrández-Tordera , 3 2 Raphael Rubino , Antonio Toral , Victor M. Sánchez-Cartagena , Jorge 1 1 Sergio Ortiz-Rojas , Gema Ramı́rez-Sánchez , Felipe Sánchez-Martı́nez , Andy Way 1 Prompsit Language Engineering, Spain, 2Dublin City University, Ireland, 3Universitat d’Alacant, Spain {rrubino,vmsanchez,jferrandez,sortiz,gramirez}@prompsit.com, {atoral,away}@computing.dcu.ie, [email protected] Introductionp Datasets and Toolsp • Abu-MaTran submission for the English–French translation task English French • Parallel data filtering → Bilingual cross-entropy difference Sentences Monolingual Data 120.9 42.2 Parallel Data 109 Corpus Common Crawl Europarl v7 News Commentary v9 UN • Pseudo out-of-domain data filtering → Vocabulary saturation • Hybrid SMT → Automatic translation rules extraction 21.3 3.2 2.0 0.2 12.3 Language Modelsp Words • Individual LMs trained per corpus (using all the 7,000.6 1,856.0 en 549.0 76.0 52.5 4.5 313.4 normalised, tokenised and true-cased, parallel corpora are cleaned. fr 642.5 82.7 56.7 5.3 356.5 • srilm compute-best-mix to interpolate LMs • Perplexity minimisation based on dev sets (08-12) interpolate LMs. 1-gram 2-gram 3-gram 4-gram 5-gram English (pruned) 13.4 198.6 381.2 776.3 1,068.7 French 6.0 75.5 353.2 850.8 1,354.0 • Moses v2.1, mgiza++, tmcombine.py for TMs interpolation and mira for tuning. Table 1 : Millions of sentences and words after data pre-processing Table 2 : Millions of n-grams in the interpolated LMs Synthetic Translation Rulesp Out-of-domain Corpus Sentences (k) BLEUdev Baseline 181.3 27.76 Common Crawl 1,598.7 29.84 Europarl 461.9 28.87 109 Corpus 5,153.0 30.50 UN 1,707.3 29.03 Interpolation 31.37 • Based on development sets (2008-2012) and Apertium RBMT system 10 Bilingual Cross-Entropy Difference Out-of-domain parallel corpora are individually split in two parts based on a threshold on the bilingual cross-entropy difference (bced < 0). Data above the threshold is reduced with vocabulary saturation (2/3 data dropped). Corpus Sentences (k) BLEUdev Baseline 181.3 27.76 Common Crawl 208.3 27.73 Europarl 142.0 27.63 109 Corpus 1,442.4 30.29 UN 642.4 28.91 Interpolation 30.78 • 5-gram LM with modified Kneser-Ney smoothing • kenlm and srilm to train and Translation Modelsp In-domain monolingual and target side of parallel data) • Monolingual and parallel data are 8 • Morphological analysis and POS disambiguation of the parallel data 6 • Words and phrases alignments followed by heuristics to generate rules 4 • Number of rules reduced using integer linear programming search. 2 0 Common Crawl Europarl 10^9 UN -2 -4 0 2k 4k 6k 8k Figure 1 : 10k sentence-pairs sample distribution • newstest2012 used as dev and newstest2013 used as train to pick the Table 4 : Scores on newstest2013 Table 3 : Scores on newstest2013 with and without synthetic rules. More about synthetic rules generation in Sánchez-Cartagena et al., WMT14. Conclusionp • newstest2013 used as dev to pick the best tuning parameters System BLEUdev Baseline 27.76 + pseudo in + pseudo out 31.93 + OSM 31.90 + MERT 200-best 32.21 + Rules 32.10 10k Sentence Pairs Tuning and Decodingp best decoding parameters System BLEUdev Baseline 27.76 Baseline+Rules 28.06 System Bleu13a TER newstest2013 Best tuning cube-pruning (pop-limit 10000) increased table-limit (100) monotonic reordering Best decoding newstest2014 Best decoding Best decoding + Rules • Abu-MaTran submission for the English–French constrained translation task: • human evaluation: ranked 1st (shared with UEdin and KIT) 31.02 31.04 31.06 31.07 31.14 60.77 60.71 60.77 60.69 60.66 34.90 54.70 34.90 54.80 Table 5 : Final results • automatic evaluation: ranked 2nd • Two-step data selection method leads to significant improvement • Impact of vocabulary saturation clear on the model size reduction but unclear on the performance • Investigate why results vary with OSM and decoding parameters The research leading to these results has received funding from the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran).