Comments
Description
Transcript
Data-Driven MT
A sentence-aligned corpus Data-Driven MT E1: Often, in the textile industry, businesses close their plant in Montreal to move to the Eastern townships. F1: Dans le domaine du textile souvent, dans Montréal, on ferme et on Andy Way, va s’installer dans les Cantons de l’Est. School of Computing E2: There is no legislation to prevent them from doing so, for it Email: [email protected] is a matter of internal economy. F2: Il n’ya aucune loi pour empêcher cela, c’est de la régie interne. February 11, 2011 E3: That is serious. F3: C’est grave. 1 Why use Statistics? Why data-driven over rule-based techniques? But not always: • the (relative) failure of rule-based approaches A non-exact alignment • the increasing availability of machine-readable text • increase in capability of hardware (CPU, memory, disk space) with decrease in cost ... E1: Hon. members opposite scoff at the freeze suggested by this party; to them it is laughable. A prerequisite for Data-Driven MT,1 that is: F1: Les deputés d’en face se moquent du gel que a propose notre parti. F2: Pour eux, c’est une mesure risible. • Example-Based MT (EBMT), • Statistical MT (SMT), • Hybrid Models which use some probabilistic processing, So some (like this) have one sentence correspond to two, or more, or none. Or there may be a two-to-two sentence alignment without there being a one-to-one relation between the component sentences. is a parallel corpus (or bitext) of aligned sentences. Could do it manually, which avoids the considerable problem of trying to align source and target texts. But what if we already have bilingual corpora which are 1.2 Good Language Models come from Good Corpora not aligned?! Automation! Any statistical approach to MT requires the availability of aligned bilingual corpora which are: 1.1 Automatic Alignment • large; Most alignments are one-to-one ... 1 And • good-quality; also Translation Memory, which is not MT, but rather CAT. 1 • representative. 2 Class Questions 1.3 Bilingual Corpora This is all monolingual, but the same applies w.r.t. bilingual corpora. We’ll look more specifically at these Assume the following, small corpus: issues when we address EBMT and SMT. One issue that we can think about now is what corpora we’re going to extract our probabilistic language A Small Corpus (and translation) models from. What sorts of large, good quality, representative bilingual corpora exist? Mary and John have two children. • Canadian Hansards The children that Mary and John have are aged 3 and 4. • Proceedings of the Hong Kong parliament John has blue eyes. • Dáil Proceedings i.e. while the statistical techniques can be applied to any pair of languages, this approach is currently limited Question 1: what’s P(have) vs. P(has) in a corpus? to only a few language pairs. Question 2: what’s P(have | John) vs. P(has | John) in a corpus? W.r.t. ‘representativeness’, consider the following from a 9000 word experiment (Brown et al., 1990) on Question 3: what’s P(have) vs. P(has) in this corpus? What’s their relative probability? Canadian Hansards ... Question 4: what’s P(have | John) vs. P(has | John) in this corpus? French Now assume the following (small, again) corpus: Another Small Corpus Am I right, or am I wrong? Probability ??? .808 entendre .079 entendu .026 entends .024 entendons .013 £ 1,000 question: for what English word are these candidate translations? Peter and I are seldom wrong. £ 32,000 question: what’s ???: I am sometimes right. Sam and I are often mistaken. Beware of sparse data and unrepresentative corpora!! Ditto poor quality language ... though I’ll come back to this one! Question 5: What two generalisations would a probabilistic language model (based on bigrams, say) infer from this data, which are not true of English as a whole? Are there any other generalisations that could be If the corpora are small, or of poor quality, or are unrepresentative, then our statistical language models will be poor, so any results we achieve will be poor. inferred? Question 6: Try to think of some trigrams (and 4-grams, if you can) that cannot be ‘discovered’ by a bigram 1.4 Is the World Wide Web a good corpus to use? model? What you’re looking for here is a phrase where the third (or subsequent) word depends on the first word, which in a bigram model is ‘too far away’ ... Let’s imagine you want to find the correct translation of the French compound groupe de travail. For each Note that all the sentences in these corpora are well-formed. If, on the other hand, the corpus contains of the two French nouns here, there are several translations: ill-formed input, then that too will skew our probability models ... • groupe: cluster, group, grouping, concern, collective • travail: work, labor, labour 3 4 If groupe de travail is not in our dictionary (cf. proliferation of compounds), but the two composite nouns < The children that Mary and John have are aged 3 and 4. are, then any compositional translation is multiply ambiguous. --- Now here’s the trick: let’s search for all 15 possible pairs on the WWW. We might find: > Die Kinder, die Maria und Johannes haben, sind 3 und 4 Jahre alt. 5c5 • labour cluster: 2 < John has blue eyes. • labour concern: 9 > Johannes hat blaue Augen. --- • ..... Or automate it fully using string compare routines measured against some % threshold .... • work group: 66593 There are at least two ways in which we could attempt to resolve the potential ambiguity: 2 Sentential Alignment • simply take the most often occurring term as the translation; Manual construction of aligned corpora, a sine qua non for probabilistic techniques, also avoids the consid- • see which candidate translations surpass some threshold (in terms of relative frequency, say) erable problem of trying to align source and target texts. But what if we already have bilingual corpora which are not aligned?! Automation! cf. share price, stock price: assuming these two massively outrank all other candidates, we can keep both as synonymous translations. 2.1 So, we can use the WWW to create a very large bilingual lexicon ... 1.5 Can we use the WWW to extract parallel corpora? Automatic Alignment We’ve just seen one (novel) way of automating this process. What are the problems? Most alignments are one-to-one ... but some are not, as we saw on p2. Some have one sentence correspond Lots of pages where you can click on a link to get a version of that page in a different language – how to to two, or more, or none. Or there may be a two-to-two sentence alignment without there being a one-to-one find these? relation between the component sentences. Query a search engine for a string “English” ‘not too far away’ from the string “Spanish”. This might get Let’s look at (some of) the major algorithms for aligning pairs of sentences ... you things like: • Click here for English version vs. Click here for Spanish version 2.1.1 Gale & Church’s Algorithm Gale & Church (1993): ‘A program for aligning sentences in bilingual corpora’, in Computational Linguistics but also: 19(1):75–102. All sentence alignment algorithms are premised on the notion that good candidates are not dissimilar w.r.t. • English literature vs. Spanish literature length. What then? Need a process to evaluate these candidates. Using sthg. like diff, you could align each case G&C use this length-based notion, together with the probability that a typical translation is one of various and manually remove poor candidates. many-to-many sentence relations (0-1, 1-0, 1-1, 2-1, 1-2, 2-2 etc). sunsite{away}691: diff a b Distance measure: P (match | δ), where match = 1-1, 0-1 etc, and δ = difference in length. 1c1 How to calculate text length? < Mary and John have two children. • word tokens; --> Maria und Johannes haben zwei Kinder. • characters. 3c3 5 6 G&C use characters. No language maps character by character to another language, so first calculate the that which maximizes the output of the HMM. ratio of chars. in L1 to chars. in L2 . e.g. English text = 100,000 chars; Spanish text = 110,000, then scaling ratio = 1.1 Question: what’s the difference in length betw. an English text of 50 chars. and a Spanish text of 56 chars given this scaling ratio? Typically, the longer the text, the bigger the difference, so we need to normalize the difference in length to lt −c.ls , where lt and ls are the lengths of the target and source texts take into account longer texts: δ = √ lm .s2 respectively, lm is the average of the two lengths, c is the scaling factor, and s2 is the variance for all values of δ in a typical aligned corpus, which G&C suggest is 6.8. Terms like P (match | δ) are called conditional probabilities. Question: P(throwing at least 7 with 2 die)? Question: P(throwing at least 7 | 1st dice thrown was a 6)? i.e. sst indicates that two source strings align with one target string, and ¶s signifies that a source paragraph delimiter matches nothing in the target language, i.e. it is deleted. Bayes’ Theorem is used to to relate probabilities: P (match | δ) = The HMM of Brown et al. models alignment by determining probabilities for eight types of alignments, or beads: s, t, st, sst, stt, ¶s, ¶t, ¶s ¶t P (δ|match).P (match) P (δ) The s bead has P (ls ) of emitting a sentence of length ls , which is equal to finding the probability of finding P (δ) is a normalizing constant, so can be ignored. P (match) is estimated from correctly aligned corpora a sentence of length l in the source text as a whole. – G&C give one-to-one alignments a probability of 0.89, with the rest of the probability space assigned to For st, P (ls , lt ) depends on P (ls ), and P (lt | ls ). This latter term depends on the ratio r = log( llst ): other possible alignments. P (δ | match) has to be estimated (I’ll spare you the details, cf. Trujillo, 1999:71 if you’re feeling masochistic!!). Comment: Aligning complete texts is computationally expensive. Luckily, we can use paragraphs, sections, headings, chapters etc to identify chunks of texts and align these smaller elements. That is, we’re looking for anchors which can identify text chunks in both languages. Other things can be tags, proper names, acronyms, cognates, figures, dates .... Results? G&C report error rates of 4.2% on 1316 alignments. Most errors occur in non 1-1 alignments. Selection of the best 80% alignments reduces the error rate to 0.7%. Interestingly, for European languages P (lt | ls ) = αe−(r−µ) 2 /(2σ 2 ) µ is the average of r and σ 2 is its variance. α is a constant which normalizes the result so that P (lt | ls ) = 1. For stt or sst, r is calculated by adding the lengths of the two source or target strings. The probability of both sentences is calculated by P (lt1 ).P (lt2 ) and P (ls1 ).P (ls2 ) respectively. Calculation of P (ls ) and P (lt ) comes from a corpus of 1000 sentences, max. length 81 words. Results? Reported error rates for st beads for English-French are 0.9%. at least, the algorithm is not sensitive to different values for c and s2 , but ‘noisy’ texts or from very different 2.1.3 Kay & Röscheisen’s Algorithm languages can degrade performance. Kay & Röscheisen (1993): ‘Text-translation alignment’, in Computational Linguistics 19(1):121–142. Text-based alignment assumes a set of anchors (cf. above). Consider a parallel text and an algorithm with 2.1.2 Brown et al.’s Algorithm access to an English-French dictionary. Two sentences will be aligned iff word translation pairs from the Brown et al. (1991): ‘Aligning sentences in parallel corpora’, in Proceedings of 29th Annual Meeting of the Association for Computational Linguistics, University of California, Berkeley, pp.169–176. Brown et al. measure the length of a text in terms of the number of word tokens. Their method assumes alignments of text fragments are produced by a Hidden Markov Model (HMM), and the correct alignment is bilingual dictionary are found with sufficient frequency between the two sentences, relative to their size and to other sentences in the corpus. cf. length-based algorithms. Disadvantage: requires bilingual dictionary – always available?! Use notion of relative sentence position on English-French Canadian Hansards. Example: assume 10 source and 12 target sentences. Initial estimate alignment maps hs1 , t1 i and hs10 , t12 i. Now let’s look at (say) s3 . Assume this is aligned with t3 +/- 3 sentences, i.e. s3 is estimated to align with t1−6 . Now produce a set of word alignments by selecting hs, ti word pairs in currently aligned sentences. For 7 8 example, if file occurs in s3 and s4 , and fichier occurs in t2 and t5 , then sentence alignments hs3 , t2 i and 2.3 Fuzzy Matches hs4 , t5 i are made and they become further instances of anchors. Caveat: sentence anchors may not cross, e.g. if we know hs1 , t1 i and hs3 , t2 i, then s2 can only align with t1 A Fuzzy Match or t2 , but nothing after t2 , as this is already aligned with s3 . Results? Algorithm converges after 4 or 5 iterations, and on first 1000 sentences of Canadian Hansards, all but 7 sentences were aligned correctly. But either time or space complexity was high. New Source Segment The specified operation failed because it requires the Character to be active. 2.2 Stored TM Unit Exact Matches EN: The specified language for the Character is not supported on the computer. FR: La langue spécifiée pour le Compagnon n’est In Translation Memory (TM) and EBMT systems, we would like to first look for exact matches. pas prise en charge par cet ordinateur. Some non-exact matches Different spelling Change the color of the font. Multiple Fuzzy Matches in Ranking Order Change the colour of the font. Different punctuation Open the file and select the text. New Source Segment Open the file, and select the text. Different inflection Different formatting the Character was hidden. Delete the document. Delete the documents. Different numbers The operation was interrupted because Best Match EN: The operation was interrupted because Use version 1.1. the Listening key was pressed. Use version 1.2. FR: L’opération a été interrompue car la Click on OK. touche d’écoute a été enfoncée. Click on OK. 2nd Best Match EN: The specified method failed because the Character is hidden. FR: La méthode spécifiée a échoué car le Compagnon est masqué. 3rd Best Match EN: The operation was interrupted by the application. FR: L’opération a été interrompue par l’application. 9 10 3 Subsentential Alignment Algorithm: • Terminology alignment 1. estimate word translation and offset probabilities; • Word alignment 2. find the most probable alignment for the given text. Words and terms have similar distributions within two texts, and can be expected to occur in the same Naı̈ve alignment: for a source word Ws located 25% of the way through a source text, its translation Wt is aligned sentences. located 25% of the way through a target text, plus or minus 20 words (say). Word alignment −→bilingual dictionaries. In the aligned example text, bold words indicate the beginning and end of an aligned text segment (using a e.g. computer occurs in 30 English strings and computadora in 29 Spanish strings and nowhere else, there is a strong chance they are translations of each other. We need to measure co-occurrence. One such measure character-based algorithm). Ignoring function words (in a stoplist), our naı̈ve alignment program might produce the following alignments: is Mutual Information: • comenzar ←→start P (s,t) I(s; t) = log P (s).P (t) • * sistema ←→start e.g. in an aligned corpus of 100 strings, if store occurs in 30 aligned English sentences, speichern occurs in 20 aligned German sentences, and these alignments coincide in 19 cases, then in this corpus: I(store; speichern) = log 19 100 30 20 100 . 100 • * empezar ←→system = 1.66 • teclado ←→keyboard • Positive values indicate that the occurrence of one word strongly predicts that the other word will Two are right, two are wrong ... but their correct translations lie close by, and certainly within a window of • a value of near 0 indicates that they occur independently; can be confirmed as correct alignments. This leaves fewer candidates to be aligned. For European languages, • Negative values predict that if one word occurs, the other does not. together in the target. Also, words at the end of source sentences map to words at the end of target sentences, occur; Problem with MI: doesn’t provide probabilities, so difficult to integrate into a statistical MT system. A trick: use MI to align hsource, targeti words, and then if a threshold can be determined empirically, rank +/- 20 words. Assuming enough text, we assume that the pairings hsistema, systemi and hempezar, launchi at least, we can assume that words which are close together in the source translate as words which are close ditto for the beginning of strings. Other techniques? them. If a Ws has a higher MI score with Wt1 than with Wt2 , then hWs , Wt2 i is deleted as a less likely • Assume that the distance (in terms of characters) between the occurrences of a source word mirrors translation. the distance between the occurrences of a target word; Another technique relies on the fact that the translation of a source word can be found in ‘about the same • Confirm aligned word pairs w.r.t. a third corpus. position’ in the target sentence to which it is aligned, or if not then within a fixed distance. A Simple Aligned Text 4 Final Remarks TMs have become popular tools in the translation profession, especially in the localization industry, where EN: Start the operating system. ES: Comenzar el sistema operativo. EN: Launch the program via the ES: Empezar el programa mediante el keyboard. teclado. manuals and technical documentation needs to be translated and updated (leveraged) on a regular basis. Some vendors include alignment programs which will allow the generation of TMs from legacy material. Word and terminology alignment technologies are beginning to make an impact in the areas of bilingual lexicography and terminology management. 11 12