The University of Maryland Statistical Machine Translation System

by user

on 15-09-2016

Category: Documents

>> Downloads: 12

views

Report

Comments

Description

Download The University of Maryland Statistical Machine Translation System

Transcript

The University of Maryland Statistical Machine Translation System

The University of Maryland Statistical Machine Translation System
Vladimir Eidelman†‡ Chris Dyer†? Philip Resnik†?
‡
Department of Computer Science ?Department of Linguistics
†
Institute for Advanced Computer Studies
University of Maryland, College Park
Baseline System Description
Overall Task
I German-to-English translation
I Constrained data condition
Translation model
I Hierarchical phrase-based translation model from provided Europarl and News
Commentary parallel training data
I Based on synchronous context-free grammar rules
I Non-terminal span limit was 12 for non-glue grammars
I Grammar was extracted using a suffix array rule extractor [3]
I Features:
1. rule relative frequency P(e|f )
2. target n-gram language model P(e)
3. ‘pass-through’ penalty when passing a source word to the target side
untranslated
4. lexical translation probabilities Plex (e|f ) and Plex (f |e)
5. count of the number of times that arity-0,1, or 2 SCFG rules were used
6. count of the total number of rules used
7. source word penalty
8. target word penalty
9. the segmentation model cost
10. count of the number of times the glue rule is used
Language Model
I SRILM 5-gram language model
I Modified Kneser-Ney smoothing from the English monolingual training data and
the non-Europarl portions of the parallel data
Framework
I cdec [2]
I Modular open source framework for aligning, training, and decoding with a
number of different translation models
I Integration of a translation model with different language models, pruning
strategies and inference algorithms
I Input as a string, lattice, or context-free forest
I Output as the full translation forest without any pruning
Data Preparation
I Explicitly annotate sentence boundaries
[X ] → hsi lebenkkk hsi lives
I
Data description (All results reported on Test set)
Set
Size
Dev news-test2009 2525
Test news-test2010 2489
Viterbi Envelope Semiring Training (VEST)
Motivation
I Semiring operations can be overloaded to compute standard quantities in a
generic framework
I Useful mathematical abstraction defining two general operations, addition and
multiplication
Implementation
I System parameters optimized using VEST, an implementation of minimum error
rate training (MERT) within a semiring, on the dev set
I Compute MERT line search using:
Semiring element
VEST
K
set of line segments
⊕
the union of line segments
⊗
Minkowski addition of the lines
I Error function is case-insensitive bleu
I For training, use cube-pruning with 100 candidates at each node, while decoding
raise limit to 1000
Minimum Bayes risk decoding
Motivation
I VEST uses maximum derivation decision rule
I Empirically useful to rescore with minimum risk decision rule to minimize expected
loss
Implementation
I Estimate posterior distribution using a unique 500-best list of hypotheses
I Posterior scaling factor α was optimized on a held out development set for bleu
I Performance:
Language Model Decoder bleu ter
Max-D 22.4 69.1
RandLM
MBR 22.7 68.8
Max-D 23.1 68.0
SRILM
MBR 23.4 67.7
University of Maryland , vlad,redpony,[email protected]
Compound Segmentation Lattices
Motivation
I German possesses a rich inflectional morphology, productive compounding, and
significant word reordering
I In place of translating a single representation of the input, we encode alternative
ways of segmenting compound words in a word lattice
I The decoder automatically chooses which segmentation is best for translation,
leading to markedly improved results [1]
I Segmentation variants of the raw input are encoded in the word lattice:
Raw text:
die
Mülltonnenanzahl für ...
Lattice:
tonnen
tonne
anzahl
müll
die
für
mülltonne
mülltonnen
mülltonnenanzahl
Reference:
the
number of rubbish bins
for
...
Lattice construction
I Create maximum entropy model to model compound word splitting
I Features: frequency of hypothesized morphemes as separate units, number of
predicted morphemes, number of letters in a predicted morpheme
I Learn parameters to maximize conditional log-likelihood using a small amount of
manually created reference lattices
I Create dev/test lattice of segmentations for words >6 letters and prune unlikely
paths using max-marginals
Bloom Filter LM
Motivation
I LM complexity causes tradeoff between translation quality and decoder memory
usage and speed
I Delays caused when LM size necessitates remote language model server
I Randomized language models (RandLM) [4] using Bloom filters can be used
locally
Implementation
I Convert existing SRILM directly into a RandLM using default settings
I Performance:
Language Model bleu ter
RandLM
22.4 69.1
SRILM
23.1 68.0
Grammar Extraction
Motivation
I SCFG’s are expressive at the cost of a large number (millions) of rules
I Memory requirements have either been for the grammar, when extracted
beforehand, or the corpus, for suffix arrays
Implementation
I Sentence-specific grammars extracted and loaded on an as-needed basis by the
decoder
I Performance
Language Model Grammar Decoder Memory (GB) Decoder time (Sec/Sent)
Local SRILM
corpus
14.293 ± 1.228
5.254 ± 3.768
Local SRILM sentence
10.964 ± .964
5.517 ± 3.884
Remote SRILM corpus
3.771 ± .235
15.252 ± 10.878
Remote SRILM sentence
.443 ± .235
14.751 ± 10.370
RandLM
corpus
7.901 ± .721
9.398 ± 6.965
RandLM
sentence
4.612 ± .699
9.561 ± 7.149
Sentence grammars massively reduce memory footprint with no impact on
decoding speed
I Marked trade-off between memory usage and decoding speed:
I Remote SRILM reduces the memory footprint substantially but at the cost of
significantly slower decoding speed
I Local SRILM has faster speed but accordingly increases memory
I RandLM reduces memory and has (relatively) fast decoding at the price of
somewhat decreased translation quality.
I
References
C. Dyer.
Using a maximum entropy model to build segmentation lattices for MT.
In Proceedings of NAACL-HLT, 2009.
C. Dyer, A. Lopez, J. Ganitkevitch, J. Weese, F. Ture, P. Blunsom, H. Setiawan, V. Eidelman, and P. Resnik.
cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models.
In Proceedings of ACL System Demonstrations, 2010.
A. Lopez.
Hierarchical phrase-based translation with suffix arrays.
In Proceedings of EMNLP, pages 976–985, 2007.
D. Talbot and M. Osborne.
Randomised language modelling for statistical machine translation.
In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, June 2007.
http://www.umiacs.umd.edu/labs/CLIP/