...

Abu-MaTran at WMT 2014 Translation Task:

by user

on
Category: Documents
48

views

Report

Comments

Transcript

Abu-MaTran at WMT 2014 Translation Task:
Abu-MaTran at WMT 2014 Translation Task:
Two-step Data Selection and RBMT-Style Synthetic Rules
1
2
1
1
Ferrández-Tordera ,
3
2
Raphael Rubino , Antonio Toral , Victor M. Sánchez-Cartagena , Jorge
1
1
Sergio Ortiz-Rojas , Gema Ramı́rez-Sánchez , Felipe Sánchez-Martı́nez , Andy Way
1
Prompsit Language Engineering, Spain, 2Dublin City University, Ireland, 3Universitat d’Alacant, Spain
{rrubino,vmsanchez,jferrandez,sortiz,gramirez}@prompsit.com, {atoral,away}@computing.dcu.ie, [email protected]
Introductionp
Datasets and Toolsp
• Abu-MaTran submission for the English–French
translation task
English
French
• Parallel data filtering
→ Bilingual cross-entropy difference
Sentences
Monolingual Data
120.9
42.2
Parallel Data
109 Corpus
Common Crawl
Europarl v7
News Commentary v9
UN
• Pseudo out-of-domain data filtering
→ Vocabulary saturation
• Hybrid SMT
→ Automatic translation rules extraction
21.3
3.2
2.0
0.2
12.3
Language Modelsp
Words
• Individual LMs trained per corpus (using all the
7,000.6
1,856.0
en
549.0
76.0
52.5
4.5
313.4
normalised, tokenised and true-cased,
parallel corpora are cleaned.
fr
642.5
82.7
56.7
5.3
356.5
• srilm compute-best-mix to interpolate LMs
• Perplexity minimisation based on dev sets (08-12)
interpolate LMs.
1-gram 2-gram 3-gram 4-gram 5-gram
English (pruned)
13.4 198.6 381.2 776.3 1,068.7
French
6.0
75.5 353.2 850.8 1,354.0
• Moses v2.1, mgiza++, tmcombine.py
for TMs interpolation and mira for
tuning.
Table 1 : Millions of sentences and
words after data pre-processing
Table 2 : Millions of n-grams in the interpolated LMs
Synthetic Translation Rulesp
Out-of-domain
Corpus
Sentences (k) BLEUdev
Baseline
181.3
27.76
Common Crawl
1,598.7
29.84
Europarl
461.9
28.87
109 Corpus
5,153.0
30.50
UN
1,707.3
29.03
Interpolation
31.37
• Based on development sets (2008-2012) and Apertium RBMT system
10
Bilingual Cross-Entropy Difference
Out-of-domain parallel corpora are individually split in two parts based on
a threshold on the bilingual cross-entropy difference (bced < 0).
Data above the threshold is reduced with vocabulary saturation (2/3 data
dropped).
Corpus
Sentences (k) BLEUdev
Baseline
181.3
27.76
Common Crawl
208.3
27.73
Europarl
142.0
27.63
109 Corpus
1,442.4
30.29
UN
642.4
28.91
Interpolation
30.78
• 5-gram LM with modified Kneser-Ney smoothing
• kenlm and srilm to train and
Translation Modelsp
In-domain
monolingual and target side of parallel data)
• Monolingual and parallel data are
8
• Morphological analysis and POS disambiguation of the parallel data
6
• Words and phrases alignments followed by heuristics to generate rules
4
• Number of rules reduced using integer linear programming search.
2
0
Common Crawl
Europarl
10^9
UN
-2
-4
0
2k
4k
6k
8k
Figure 1 : 10k sentence-pairs sample distribution
• newstest2012 used as dev and newstest2013 used as train to pick the
Table 4 : Scores on newstest2013
Table 3 : Scores on newstest2013 with and without synthetic rules.
More about synthetic rules generation in Sánchez-Cartagena et al., WMT14.
Conclusionp
• newstest2013 used as dev to pick the best tuning parameters
System
BLEUdev
Baseline
27.76
+ pseudo in + pseudo out
31.93
+ OSM
31.90
+ MERT 200-best
32.21
+ Rules
32.10
10k
Sentence Pairs
Tuning and Decodingp
best decoding parameters
System
BLEUdev
Baseline
27.76
Baseline+Rules
28.06
System
Bleu13a TER
newstest2013
Best tuning
cube-pruning (pop-limit 10000)
increased table-limit (100)
monotonic reordering
Best decoding
newstest2014
Best decoding
Best decoding + Rules
• Abu-MaTran submission for the English–French constrained translation task:
• human evaluation: ranked 1st (shared with UEdin and KIT)
31.02
31.04
31.06
31.07
31.14
60.77
60.71
60.77
60.69
60.66
34.90 54.70
34.90 54.80
Table 5 : Final results
• automatic evaluation: ranked 2nd
• Two-step data selection method leads to significant improvement
• Impact of vocabulary saturation clear on the model size reduction but unclear on
the performance
• Investigate why results vary with OSM and decoding parameters
The research leading to these results has received funding from the European Union Seventh Framework Programme FP7/2007-2013 under grant
agreement PIAP-GA-2012-324414 (Abu-MaTran).
Fly UP