The University of Maryland Statistical Machine Translation System
by user
Comments
Transcript
The University of Maryland Statistical Machine Translation System
The University of Maryland Statistical Machine Translation System Vladimir Eidelman†‡ Chris Dyer†? Philip Resnik†? ‡ Department of Computer Science ?Department of Linguistics † Institute for Advanced Computer Studies University of Maryland, College Park Baseline System Description Overall Task I German-to-English translation I Constrained data condition Translation model I Hierarchical phrase-based translation model from provided Europarl and News Commentary parallel training data I Based on synchronous context-free grammar rules I Non-terminal span limit was 12 for non-glue grammars I Grammar was extracted using a suffix array rule extractor [3] I Features: 1. rule relative frequency P(e|f ) 2. target n-gram language model P(e) 3. ‘pass-through’ penalty when passing a source word to the target side untranslated 4. lexical translation probabilities Plex (e|f ) and Plex (f |e) 5. count of the number of times that arity-0,1, or 2 SCFG rules were used 6. count of the total number of rules used 7. source word penalty 8. target word penalty 9. the segmentation model cost 10. count of the number of times the glue rule is used Language Model I SRILM 5-gram language model I Modified Kneser-Ney smoothing from the English monolingual training data and the non-Europarl portions of the parallel data Framework I cdec [2] I Modular open source framework for aligning, training, and decoding with a number of different translation models I Integration of a translation model with different language models, pruning strategies and inference algorithms I Input as a string, lattice, or context-free forest I Output as the full translation forest without any pruning Data Preparation I Explicitly annotate sentence boundaries [X ] → hsi lebenkkk hsi lives I Data description (All results reported on Test set) Set Size Dev news-test2009 2525 Test news-test2010 2489 Viterbi Envelope Semiring Training (VEST) Motivation I Semiring operations can be overloaded to compute standard quantities in a generic framework I Useful mathematical abstraction defining two general operations, addition and multiplication Implementation I System parameters optimized using VEST, an implementation of minimum error rate training (MERT) within a semiring, on the dev set I Compute MERT line search using: Semiring element VEST K set of line segments ⊕ the union of line segments ⊗ Minkowski addition of the lines I Error function is case-insensitive bleu I For training, use cube-pruning with 100 candidates at each node, while decoding raise limit to 1000 Minimum Bayes risk decoding Motivation I VEST uses maximum derivation decision rule I Empirically useful to rescore with minimum risk decision rule to minimize expected loss Implementation I Estimate posterior distribution using a unique 500-best list of hypotheses I Posterior scaling factor α was optimized on a held out development set for bleu I Performance: Language Model Decoder bleu ter Max-D 22.4 69.1 RandLM MBR 22.7 68.8 Max-D 23.1 68.0 SRILM MBR 23.4 67.7 University of Maryland , vlad,redpony,[email protected] Compound Segmentation Lattices Motivation I German possesses a rich inflectional morphology, productive compounding, and significant word reordering I In place of translating a single representation of the input, we encode alternative ways of segmenting compound words in a word lattice I The decoder automatically chooses which segmentation is best for translation, leading to markedly improved results [1] I Segmentation variants of the raw input are encoded in the word lattice: Raw text: die Mülltonnenanzahl für ... Lattice: tonnen tonne anzahl müll die für mülltonne mülltonnen mülltonnenanzahl Reference: the number of rubbish bins for ... Lattice construction I Create maximum entropy model to model compound word splitting I Features: frequency of hypothesized morphemes as separate units, number of predicted morphemes, number of letters in a predicted morpheme I Learn parameters to maximize conditional log-likelihood using a small amount of manually created reference lattices I Create dev/test lattice of segmentations for words >6 letters and prune unlikely paths using max-marginals Bloom Filter LM Motivation I LM complexity causes tradeoff between translation quality and decoder memory usage and speed I Delays caused when LM size necessitates remote language model server I Randomized language models (RandLM) [4] using Bloom filters can be used locally Implementation I Convert existing SRILM directly into a RandLM using default settings I Performance: Language Model bleu ter RandLM 22.4 69.1 SRILM 23.1 68.0 Grammar Extraction Motivation I SCFG’s are expressive at the cost of a large number (millions) of rules I Memory requirements have either been for the grammar, when extracted beforehand, or the corpus, for suffix arrays Implementation I Sentence-specific grammars extracted and loaded on an as-needed basis by the decoder I Performance Language Model Grammar Decoder Memory (GB) Decoder time (Sec/Sent) Local SRILM corpus 14.293 ± 1.228 5.254 ± 3.768 Local SRILM sentence 10.964 ± .964 5.517 ± 3.884 Remote SRILM corpus 3.771 ± .235 15.252 ± 10.878 Remote SRILM sentence .443 ± .235 14.751 ± 10.370 RandLM corpus 7.901 ± .721 9.398 ± 6.965 RandLM sentence 4.612 ± .699 9.561 ± 7.149 Sentence grammars massively reduce memory footprint with no impact on decoding speed I Marked trade-off between memory usage and decoding speed: I Remote SRILM reduces the memory footprint substantially but at the cost of significantly slower decoding speed I Local SRILM has faster speed but accordingly increases memory I RandLM reduces memory and has (relatively) fast decoding at the price of somewhat decreased translation quality. I References C. Dyer. Using a maximum entropy model to build segmentation lattices for MT. In Proceedings of NAACL-HLT, 2009. C. Dyer, A. Lopez, J. Ganitkevitch, J. Weese, F. Ture, P. Blunsom, H. Setiawan, V. Eidelman, and P. Resnik. cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models. In Proceedings of ACL System Demonstrations, 2010. A. Lopez. Hierarchical phrase-based translation with suffix arrays. In Proceedings of EMNLP, pages 976–985, 2007. D. Talbot and M. Osborne. Randomised language modelling for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, June 2007. http://www.umiacs.umd.edu/labs/CLIP/