...

Data-Driven MT

by user

on
Category: Documents
51

views

Report

Comments

Transcript

Data-Driven MT
A sentence-aligned corpus
Data-Driven MT
E1: Often, in the textile industry, businesses close their plant in
Montreal to move to the Eastern townships.
F1: Dans le domaine du textile souvent, dans Montréal, on ferme et on
Andy Way,
va s’installer dans les Cantons de l’Est.
School of Computing
E2: There is no legislation to prevent them from doing so, for it
Email: [email protected]
is a matter of internal economy.
F2: Il n’ya aucune loi pour empêcher cela, c’est de la régie interne.
February 11, 2011
E3: That is serious.
F3: C’est grave.
1
Why use Statistics?
Why data-driven over rule-based techniques?
But not always:
• the (relative) failure of rule-based approaches
A non-exact alignment
• the increasing availability of machine-readable text
• increase in capability of hardware (CPU, memory, disk space) with decrease in cost ...
E1: Hon. members opposite scoff at the freeze suggested by this party;
to them it is laughable.
A prerequisite for Data-Driven MT,1 that is:
F1: Les deputés d’en face se moquent du gel que a propose notre parti.
F2: Pour eux, c’est une mesure risible.
• Example-Based MT (EBMT),
• Statistical MT (SMT),
• Hybrid Models which use some probabilistic processing,
So some (like this) have one sentence correspond to two, or more, or none. Or there may be a two-to-two
sentence alignment without there being a one-to-one relation between the component sentences.
is a parallel corpus (or bitext) of aligned sentences. Could do it manually, which avoids the considerable
problem of trying to align source and target texts. But what if we already have bilingual corpora which are
1.2
Good Language Models come from Good Corpora
not aligned?! Automation!
Any statistical approach to MT requires the availability of aligned bilingual corpora which are:
1.1
Automatic Alignment
• large;
Most alignments are one-to-one ...
1 And
• good-quality;
also Translation Memory, which is not MT, but rather CAT.
1
• representative.
2
Class Questions
1.3
Bilingual Corpora
This is all monolingual, but the same applies w.r.t. bilingual corpora. We’ll look more specifically at these
Assume the following, small corpus:
issues when we address EBMT and SMT.
One issue that we can think about now is what corpora we’re going to extract our probabilistic language
A Small Corpus
(and translation) models from. What sorts of large, good quality, representative bilingual corpora exist?
Mary and John have two children.
• Canadian Hansards
The children that Mary and John have are aged 3 and 4.
• Proceedings of the Hong Kong parliament
John has blue eyes.
• Dáil Proceedings
i.e. while the statistical techniques can be applied to any pair of languages, this approach is currently limited
Question 1: what’s P(have) vs. P(has) in a corpus?
to only a few language pairs.
Question 2: what’s P(have | John) vs. P(has | John) in a corpus?
W.r.t. ‘representativeness’, consider the following from a 9000 word experiment (Brown et al., 1990) on
Question 3: what’s P(have) vs. P(has) in this corpus? What’s their relative probability?
Canadian Hansards ...
Question 4: what’s P(have | John) vs. P(has | John) in this corpus?
French
Now assume the following (small, again) corpus:
Another Small Corpus
Am I right, or am I wrong?
Probability
???
.808
entendre
.079
entendu
.026
entends
.024
entendons
.013
£ 1,000 question: for what English word are these candidate translations?
Peter and I are seldom wrong.
£ 32,000 question: what’s ???:
I am sometimes right.
Sam and I are often mistaken.
Beware of sparse data and unrepresentative corpora!!
Ditto poor quality language ... though I’ll come back to this one!
Question 5: What two generalisations would a probabilistic language model (based on bigrams, say) infer
from this data, which are not true of English as a whole? Are there any other generalisations that could be
If the corpora are small, or of poor quality, or are unrepresentative, then our statistical language models will
be poor, so any results we achieve will be poor.
inferred?
Question 6: Try to think of some trigrams (and 4-grams, if you can) that cannot be ‘discovered’ by a bigram
1.4
Is the World Wide Web a good corpus to use?
model? What you’re looking for here is a phrase where the third (or subsequent) word depends on the first
word, which in a bigram model is ‘too far away’ ...
Let’s imagine you want to find the correct translation of the French compound groupe de travail. For each
Note that all the sentences in these corpora are well-formed. If, on the other hand, the corpus contains
of the two French nouns here, there are several translations:
ill-formed input, then that too will skew our probability models ...
• groupe: cluster, group, grouping, concern, collective
• travail: work, labor, labour
3
4
If groupe de travail is not in our dictionary (cf. proliferation of compounds), but the two composite nouns
< The children that Mary and John have are aged 3 and 4.
are, then any compositional translation is multiply ambiguous.
---
Now here’s the trick: let’s search for all 15 possible pairs on the WWW. We might find:
> Die Kinder, die Maria und Johannes haben, sind 3 und 4 Jahre alt.
5c5
• labour cluster: 2
< John has blue eyes.
• labour concern: 9
> Johannes hat blaue Augen.
---
• .....
Or automate it fully using string compare routines measured against some % threshold ....
• work group: 66593
There are at least two ways in which we could attempt to resolve the potential ambiguity:
2
Sentential Alignment
• simply take the most often occurring term as the translation;
Manual construction of aligned corpora, a sine qua non for probabilistic techniques, also avoids the consid-
• see which candidate translations surpass some threshold (in terms of relative frequency, say)
erable problem of trying to align source and target texts. But what if we already have bilingual corpora
which are not aligned?! Automation!
cf. share price, stock price: assuming these two massively outrank all other candidates, we can keep both as
synonymous translations.
2.1
So, we can use the WWW to create a very large bilingual lexicon ...
1.5
Can we use the WWW to extract parallel corpora?
Automatic Alignment
We’ve just seen one (novel) way of automating this process.
What are the problems?
Most alignments are one-to-one ... but some are not, as we saw on p2. Some have one sentence correspond
Lots of pages where you can click on a link to get a version of that page in a different language – how to
to two, or more, or none. Or there may be a two-to-two sentence alignment without there being a one-to-one
find these?
relation between the component sentences.
Query a search engine for a string “English” ‘not too far away’ from the string “Spanish”. This might get
Let’s look at (some of) the major algorithms for aligning pairs of sentences ...
you things like:
• Click here for English version vs. Click here for Spanish version
2.1.1
Gale & Church’s Algorithm
Gale & Church (1993): ‘A program for aligning sentences in bilingual corpora’, in Computational Linguistics
but also:
19(1):75–102.
All sentence alignment algorithms are premised on the notion that good candidates are not dissimilar w.r.t.
• English literature vs. Spanish literature
length.
What then? Need a process to evaluate these candidates. Using sthg. like diff, you could align each case
G&C use this length-based notion, together with the probability that a typical translation is one of various
and manually remove poor candidates.
many-to-many sentence relations (0-1, 1-0, 1-1, 2-1, 1-2, 2-2 etc).
sunsite{away}691: diff a b
Distance measure: P (match | δ), where match = 1-1, 0-1 etc, and δ = difference in length.
1c1
How to calculate text length?
< Mary and John have two children.
• word tokens;
--> Maria und Johannes haben zwei Kinder.
• characters.
3c3
5
6
G&C use characters. No language maps character by character to another language, so first calculate the
that which maximizes the output of the HMM.
ratio of chars. in L1 to chars. in L2 .
e.g. English text = 100,000 chars; Spanish text = 110,000, then scaling ratio = 1.1
Question: what’s the difference in length betw. an English text of 50 chars. and a Spanish text of 56 chars
given this scaling ratio?
Typically, the longer the text, the bigger the difference, so we need to normalize the difference in length to
lt −c.ls
, where lt and ls are the lengths of the target and source texts
take into account longer texts: δ = √
lm .s2
respectively, lm is the average of the two lengths, c is the scaling factor, and s2 is the variance for all values
of δ in a typical aligned corpus, which G&C suggest is 6.8.
Terms like P (match | δ) are called conditional probabilities.
Question: P(throwing at least 7 with 2 die)?
Question: P(throwing at least 7 | 1st dice thrown was a 6)?
i.e. sst indicates that two source strings align with one target string, and ¶s signifies that a source paragraph
delimiter matches nothing in the target language, i.e. it is deleted.
Bayes’ Theorem is used to to relate probabilities:
P (match | δ) =
The HMM of Brown et al. models alignment by determining probabilities for eight types of alignments, or
beads: s, t, st, sst, stt, ¶s, ¶t, ¶s ¶t
P (δ|match).P (match)
P (δ)
The s bead has P (ls ) of emitting a sentence of length ls , which is equal to finding the probability of finding
P (δ) is a normalizing constant, so can be ignored. P (match) is estimated from correctly aligned corpora
a sentence of length l in the source text as a whole.
– G&C give one-to-one alignments a probability of 0.89, with the rest of the probability space assigned to
For st, P (ls , lt ) depends on P (ls ), and P (lt | ls ). This latter term depends on the ratio r = log( llst ):
other possible alignments. P (δ | match) has to be estimated (I’ll spare you the details, cf. Trujillo, 1999:71
if you’re feeling masochistic!!).
Comment: Aligning complete texts is computationally expensive. Luckily, we can use paragraphs, sections,
headings, chapters etc to identify chunks of texts and align these smaller elements. That is, we’re looking
for anchors which can identify text chunks in both languages. Other things can be tags, proper names,
acronyms, cognates, figures, dates ....
Results? G&C report error rates of 4.2% on 1316 alignments. Most errors occur in non 1-1 alignments.
Selection of the best 80% alignments reduces the error rate to 0.7%. Interestingly, for European languages
P (lt | ls ) = αe−(r−µ)
2
/(2σ 2 )
µ is the average of r and σ 2 is its variance. α is a constant which normalizes the result so that P (lt | ls ) = 1.
For stt or sst, r is calculated by adding the lengths of the two source or target strings. The probability of
both sentences is calculated by P (lt1 ).P (lt2 ) and P (ls1 ).P (ls2 ) respectively. Calculation of P (ls ) and P (lt )
comes from a corpus of 1000 sentences, max. length 81 words.
Results? Reported error rates for st beads for English-French are 0.9%.
at least, the algorithm is not sensitive to different values for c and s2 , but ‘noisy’ texts or from very different
2.1.3
Kay & Röscheisen’s Algorithm
languages can degrade performance.
Kay & Röscheisen (1993): ‘Text-translation alignment’, in Computational Linguistics 19(1):121–142.
Text-based alignment assumes a set of anchors (cf. above). Consider a parallel text and an algorithm with
2.1.2
Brown et al.’s Algorithm
access to an English-French dictionary. Two sentences will be aligned iff word translation pairs from the
Brown et al. (1991): ‘Aligning sentences in parallel corpora’, in Proceedings of 29th Annual Meeting of the
Association for Computational Linguistics, University of California, Berkeley, pp.169–176.
Brown et al. measure the length of a text in terms of the number of word tokens. Their method assumes
alignments of text fragments are produced by a Hidden Markov Model (HMM), and the correct alignment is
bilingual dictionary are found with sufficient frequency between the two sentences, relative to their size and
to other sentences in the corpus.
cf. length-based algorithms. Disadvantage: requires bilingual dictionary – always available?!
Use notion of relative sentence position on English-French Canadian Hansards.
Example: assume 10 source and 12 target sentences. Initial estimate alignment maps hs1 , t1 i and hs10 , t12 i.
Now let’s look at (say) s3 . Assume this is aligned with t3 +/- 3 sentences, i.e. s3 is estimated to align with
t1−6 .
Now produce a set of word alignments by selecting hs, ti word pairs in currently aligned sentences. For
7
8
example, if file occurs in s3 and s4 , and fichier occurs in t2 and t5 , then sentence alignments hs3 , t2 i and
2.3
Fuzzy Matches
hs4 , t5 i are made and they become further instances of anchors.
Caveat: sentence anchors may not cross, e.g. if we know hs1 , t1 i and hs3 , t2 i, then s2 can only align with t1
A Fuzzy Match
or t2 , but nothing after t2 , as this is already aligned with s3 .
Results? Algorithm converges after 4 or 5 iterations, and on first 1000 sentences of Canadian Hansards, all
but 7 sentences were aligned correctly. But either time or space complexity was high.
New Source Segment
The specified operation failed because
it requires the Character to be active.
2.2
Stored TM Unit
Exact Matches
EN: The specified language for the Character
is not supported on the computer.
FR: La langue spécifiée pour le Compagnon n’est
In Translation Memory (TM) and EBMT systems, we would like to first look for exact matches.
pas prise en charge par cet ordinateur.
Some non-exact matches
Different spelling
Change the color of the font.
Multiple Fuzzy Matches in Ranking Order
Change the colour of the font.
Different punctuation
Open the file and select the text.
New Source Segment
Open the file, and select the text.
Different inflection
Different formatting
the Character was hidden.
Delete the document.
Delete the documents.
Different numbers
The operation was interrupted because
Best Match
EN: The operation was interrupted because
Use version 1.1.
the Listening key was pressed.
Use version 1.2.
FR: L’opération a été interrompue car la
Click on OK.
touche d’écoute a été enfoncée.
Click on OK.
2nd Best Match
EN: The specified method failed because
the Character is hidden.
FR: La méthode spécifiée a échoué car
le Compagnon est masqué.
3rd Best Match
EN: The operation was interrupted by
the application.
FR: L’opération a été interrompue par
l’application.
9
10
3
Subsentential Alignment
Algorithm:
• Terminology alignment
1. estimate word translation and offset probabilities;
• Word alignment
2. find the most probable alignment for the given text.
Words and terms have similar distributions within two texts, and can be expected to occur in the same
Naı̈ve alignment: for a source word Ws located 25% of the way through a source text, its translation Wt is
aligned sentences.
located 25% of the way through a target text, plus or minus 20 words (say).
Word alignment −→bilingual dictionaries.
In the aligned example text, bold words indicate the beginning and end of an aligned text segment (using a
e.g. computer occurs in 30 English strings and computadora in 29 Spanish strings and nowhere else, there is
a strong chance they are translations of each other. We need to measure co-occurrence. One such measure
character-based algorithm).
Ignoring function words (in a stoplist), our naı̈ve alignment program might produce the following alignments:
is Mutual Information:
• comenzar ←→start
P (s,t)
I(s; t) = log P (s).P
(t)
• * sistema ←→start
e.g. in an aligned corpus of 100 strings, if store occurs in 30 aligned English sentences, speichern occurs in
20 aligned German sentences, and these alignments coincide in 19 cases, then in this corpus:
I(store; speichern) = log
19
100
30
20
100 . 100
• * empezar ←→system
= 1.66
• teclado ←→keyboard
• Positive values indicate that the occurrence of one word strongly predicts that the other word will
Two are right, two are wrong ... but their correct translations lie close by, and certainly within a window of
• a value of near 0 indicates that they occur independently;
can be confirmed as correct alignments. This leaves fewer candidates to be aligned. For European languages,
• Negative values predict that if one word occurs, the other does not.
together in the target. Also, words at the end of source sentences map to words at the end of target sentences,
occur;
Problem with MI: doesn’t provide probabilities, so difficult to integrate into a statistical MT system.
A trick: use MI to align hsource, targeti words, and then if a threshold can be determined empirically, rank
+/- 20 words. Assuming enough text, we assume that the pairings hsistema, systemi and hempezar, launchi
at least, we can assume that words which are close together in the source translate as words which are close
ditto for the beginning of strings.
Other techniques?
them. If a Ws has a higher MI score with Wt1 than with Wt2 , then hWs , Wt2 i is deleted as a less likely
• Assume that the distance (in terms of characters) between the occurrences of a source word mirrors
translation.
the distance between the occurrences of a target word;
Another technique relies on the fact that the translation of a source word can be found in ‘about the same
• Confirm aligned word pairs w.r.t. a third corpus.
position’ in the target sentence to which it is aligned, or if not then within a fixed distance.
A Simple Aligned Text
4
Final Remarks
TMs have become popular tools in the translation profession, especially in the localization industry, where
EN: Start the operating system.
ES: Comenzar el sistema operativo.
EN: Launch the program via the
ES: Empezar el programa mediante el
keyboard.
teclado.
manuals and technical documentation needs to be translated and updated (leveraged) on a regular basis.
Some vendors include alignment programs which will allow the generation of TMs from legacy material.
Word and terminology alignment technologies are beginning to make an impact in the areas of bilingual
lexicography and terminology management.
11
12
Fly UP