An Introduction to Machine Translation Andy Way, DCU
by user
Comments
Transcript
An Introduction to Machine Translation Andy Way, DCU
An Introduction to Machine Translation Andy Way, DCU The Rise & Fall of Different MT Paradigms Three main approaches to RBMT language-neutral interlingua TRANSFER GENERATION ANALYSIS direct translation source text The Vauquois Pyramid target text System Design: Concerns Multilingual vs. Bilingual Multilingual: Extreme: Eurotra, i.e. 72 language pairs Modest: EN DE,FR,ES, i.e. 3 language pairs Intermediate: EN,FR,DE,ES,JP, but not all combinations Bilingual: Unidirectional vs. Bidirectional ENFR or FREN Reversible vs. Non-reversible ENFR, same EN,FR components for Analysis & Generation, and reversible transfer module ENFR & FREN, but different EN, FR components for Analysis & Generation, and different transfer modules, NB, lack of modularity … Direct vs. Transfer vs. Interlingua Batch vs. Interactive Advantages/Disadvantages of Direct Systems Advantages Engine's competence lies in its comparative grammar. Highly robust. Does not break down or stop when it encounters unknown words, unknown grammatical constructs, or ill-formed Input Designed for unidirectional translation between one pair of langs. Not conducive to genuine multilingual MT design. Disadvantages ‘word-for-word' translation + local reordering = poor translation, using cheap bilingual dictionary & rudimentary knowledge of target language. Linguistically, computationally naive. No analysis of internal structure of Input, especially w.r.t. the grammatical relationships between the main parts of sentences. Advantages/Disadvantages of Interlingual Systems Advantages Intermediate representation (IR) fully specified, i.e. no need to ‘look back' at Source in order to generate Target. Easy to extend to other langs. Built-in back translation: useful for testing. Disadvantages How to define an Interlingua for closely related languages? Truly universal Interlingua possible? Advantages/Disadvantages of Transfer Systems Advantages No language-independent representations: source IR specific to a particular lang., as is the target lang. IR. So Complexity of Analysis & Generation components much reduced … Also, no necessary equivalence between source and target IRs for the same language! Disadvantages Not so easy to extend to other languages: n analysis modules, n generation modules, n x n-1 transfer modules, i.e. not much less than n² … No guaranteed built-in back translation. Direct, or Indirect? Direct: From manufacturer's viewpoint, better, as it's more robust … Indirect: Falls over more easily. Development phase can be trying. Commercially, must be supplemented with techniques for dealing with unseen Input. What about Translation Quality? Indirect systems clearly better in principle. However, constructing MT engine requires considerable effort. Direct Systems can achieve good performance. Summary Research: mostly Transfer-based, with rules automatically acquired from data Industrially: we can expect highly-developed Direct Systems to survive for some years to come … Other Material Arnold, D. et al. (1994): Machine Translation - An Introductory Guide; NCC Blackwell, Oxford Hutchins, J. & H. Somers (1992): An Introduction to MT; Academic Press, London Trujillo, A. (1999): Translation Engines; Springer, London Newer books include: Bowker, L. (2002): Computer-Aided Translation Technology, U. of Ottawa Press. Somers, H. (2003): Computers and Translation: A translator's guide, John Benjamins. Bond, F. (2005): Translating the Untranslatable, CSLI. Quah, C. (2006): Translation and Technology, Palgrave MacMillan. Why Corpus-Based MT? the (relative) failure of rule-based approaches the increasing availability of machine-readable text the increase in capability of hardware (CPU, memory, disk space) with associated decrease in cost Corpus-Based MT is here to stay These approaches are now mainstream: Most researchers are developing corpus-based systems; First company to use SMT now exists: http://www.languageweaver.com; CNGL partner Traslán uses EBMT/SMT hybrid; In recent large-scale evaluations, corpus-based MT systems come first. Two caveats: Most industrial systems are still rule-based (but cf. Google’s systems now all SMT); Current mainstream evaluation metrics favour n-gram-based systems (i.e. bias towards SMT). Thanks to Kevin Knight … Centauri/Arcturan Exercise Slides already on CA446 webpage … Centauri/Arcturan [Knight, 1997] Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp } There are 6! different orders possible, so 720 different translations. Best order (according to placement in TL side of the corpus is as given above): Not just unigrams, but n-grams also … It’s Really Spanish—English! Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa 1a. Garcia and associates . 1b. Garcia y asociados . 7a. the clients and the associates are enemies . 7b. los clients y los asociados son enemigos . 2a. Carlos Garcia has three associates . 2b. Carlos Garcia tiene tres asociados . 8a. the company has three groups . 8b. la empresa tiene tres grupos . 3a. his associates are not strong . 3b. sus asociados no son fuertes . 9a. its groups are in Europe . 9b. sus grupos estan en Europa . 4a. Garcia has a company also . 4b. Garcia tambien tiene una empresa . 10a. the modern groups sell strong pharmaceuticals . 10b. los grupos modernos venden medicinas fuertes . 5a. its clients are angry . 5b. sus clientes estan enfadados . 11a. the groups do not sell zenzanine . 11b. los grupos no venden zanzanina . 6a. the associates are also angry . 6b. los asociados tambien estan enfadados . 12a. the small groups are not modern . 12b. los grupos pequenos no son modernos . Some more to try … iat lat pippat eneat hilat oloat at-yurp. totat nnat forat arrat mat bat. wat dat quat cat uskrat at-drubel. Some more to try … iat lat pippat eneat hilat oloat at-yurp. totat nnat forat arrat mat bat. wat dat quat cat uskrat at-drubel. … if you have trouble sleeping at nights! What have we just seen? what parallel corpora look like; how relevant parallel corpora are for MT; how to build bilingual dictionaries from parallel corpora; how cognate information may be useful in MT; how to do word alignment. What else do we need to know? about word alignment on a larger scale; about phrasal alignment, the norm in real translation data; about unknown words; the importance of knowing the target language (vs. source) in making fluent translations; about locality in word order shifts; how to guess the meanings/translations of unknown words; about how much uncertainty the machine faces in working with limited data; about working on different domains; … Do such methods scale to ‘real’ MT? Availability of monolingual and bilingual corpora? Possibility of sentence-aligning bilingual corpora? Can we write an algorithm to extract the translation dictionary? Can we write an algorithm to extract the monolingual word pair counts? Can we write an algorithm to generate translations using our translation dictionary and word pair counts? Do such methods scale to ‘real’ MT? Availability of monolingual and bilingual corpora? Possibility of sentence-aligning bilingual corpora? Can we write an algorithm to extract the translation dictionary? Can we write an algorithm to extract the monolingual word pair counts? Can we write an algorithm to generate translations using our translation dictionary and word pair counts? WILL THE TRANSLATIONS PRODUCED BE ANY GOOD? Parallel Corpora Hugely important … but not available in a wide range of language pairs: Chinese—English: Hong Kong data French—English: Canadian Hansards Older EU pairs: Europarl [Koehn 04] Newer EU pairs: JRC-Acquis Communautaire, very recently distributed updated Europarl Arabic—English: LDC Data NIST, IWSLT, TC-STAR Evaluations … Caveat interpres! Beware of sparse data! Beware of unrepresentative corpora! Beware of poor quality language! If the corpora are small, or of poor quality, or are unrepresentative, then our statistical language models will be poor, so any results we achieve will be poor.