Comments
Transcript
Counting collocations in translated text
Collocational properties of translated language Silvia Bernardini School for Translators and Interpreters University of Bologna at Forlì 30 July 07 [email protected] Overview Collocations Brief overview Frequency vs. Phraseology A note on statistics Translation studies Theory Methodology Collocation Limits Current study Aim Method Results Discussion Limits Ways forward What is a collocation? “[] I would like to put forward the concept of collocation which I have introduced in my own work. This is the study of key-words, pivotal words, leading words, by presenting them in the company they usually keep – that is to say, an element of their meaning is indicated when their habitual word accompaniments are shown.” (Firth 1956:106-107) E.g.: English: the English people, English literature, English reserve, English manners, English countryside, the English and all that can be said about them, the English public schools, English Universities (!) Collocations Frequency-oriented views “Significant” collocation is regular collocation between items, such that they occur more often than their respective frequencies and the length of the text in which they occur would predict (Jones and Sinclair 1974:19) A collocation is a sequence of words that occurs more than once in identical form and is grammatically well-structured (Kjellmer 1987: 133) Collocations Phraseology-oriented views Restricted collocations are fully institutionalised phrases, memorized as wholes and used as conventional formmeaning pairings (Howarth 1996: 37) Collocations Frequency vs. Phraseology Sum of many occurrences in texts Position important Number of words involved important Syntactic relationship can be important Frequency/statistics important An abstract entity with instantiations in texts (PERFORM + TASK) Position/number of constituents not central; Different restrictions distinguished (DOG+BARK not a collocation) Main criterion: semantic unpredictability Collocations 2 ways of finding collocations Starting from a (set of) keyword(s) and looking left and right Gledhill (2000): phraseology surrounding “keywords” in different sections of cancer research articles Selecting all sequences of N words that recur a certain number of times Kjellmer (1994): All two-word sequences appearing more than two times in the Brown corpus Collocations A note on statistics Frequency (Danielsson 2001) Statistics: pointwise Mutual Information (MI) Compares the probability of observing x and y together (the joint probability) with the probabilities of observing x and y independently (chance). (Church and Hanks 1990: 77) Formula MI(x;y)= p(xy) * N log2 ------------p(x) * p(y) Limits of MI Collocations Corpus-based TS Theoretical background Methodological background Studies of collocation within TS Limits Theoretical background 1 Baker (1993: 243) The most important task that awaits the application of corpus techniques in translation studies […] is the elucidation of the nature of translated text as a mediated communicative event. Corpus-based Translation Studies Theoretical background 2 Toury (1995) Translation as norm-governed behaviour: ‘translatorship’ amounts first and foremost to being able to play a social role, i.e. to fulfil a function allotted by a community […] in a way which is deemed appropriate in its own terms of reference (ibid.: 53) Corpus-based Translation Studies Operationalising it Studies should be carried out focusing on the nature of translational norms as compared to those governing non-translational kinds of text production (Toury 1995: 61). Corpus research in TS should focus on the identification of universal features of translation, that is features which typically occur in translated text rather than original utterances and which are not the result of interference from specific linguistic systems. (Baker 1993:243). Corpus-based Translation Studies “Universals” Explicitation/explicitness Simplification Disambiguation Levelling out (homogeneity) Preference for conventional grammar Avoidance of repetition Exaggeration of features of the target language Normalisation/sanitisation Absence of TL-specific “unique items” “Shining-through” Corpus-based Translation Studies Methodological background Monolingual comparable corpora Originals in Language A and comparable translations into Language A They should make visible “patterning which is specific to translated texts, irrespective of the source or target languages involved” (Baker 1995: 234). Parallel corpora Originals in Language A and their translations into Language B, usually combined with reference corpora Corpus-based Translation Studies TS: research on collocation Olohan (2004): Collocation and moderation Quite, rather, pretty and fairly in translated vs. original English fiction Pretty and rather, and more marginally quite, “are used a lot less in [TEC-Fiction] but, when they are, there is usually more variation in usage than in [BNC-fiction] and less repetition of common collocates” Corpus-based Translation Studies TS: research on collocation Øverås (1998): Collocation and explicitation First 50 sentences of 40 novel extracts (English + Norwegian) Additions enriching the text with a common target language collocation ST: Det var en blanding av vill dristighet og en frøkenaktig, fornem finhet i hans slekt. (a mixture) TT: There was a strange mixture of wild boldness and dignified gentility in the family. A collocational clash in the ST is rendered with a conventional TL combination ST: the cook's fat son would play plump tunes on his accordion. TT: kokkens fete sønn spille trivelige melodier på trekkspillet sitt. (pleasant tunes) Corpus-based Translation Studies TS: research on collocation Kenny (2001): Collocation and sanitisation Three-way comparison: a parallel corpus (English/German) and reference corpora of SL/TL Treatment of lexical creativity in translation Starting points: collocation hapaxes and clusters that are repeated in the work of a single author but not attested in any other texts Augen ~ trinken ich trinke mit gierigen Augen (literally: I drink with greedy eyes) translated as: “my avid eyes drank in…” Corpus-based Translation Studies TS: research on collocation Baroni and Bernardini (2003): Collocation in MCC Monolingual comparable corpus of Italian original and translated articles from a single geopolitics journal. All bigrams from the translated sub-corpus and from the original sub-corpus Ranked according to their log-likelihood ratio value “Translated language is repetitive, possibly more repetitive than original language. Yet the two differ in what they tend to repeat: translations show a tendency to repeat structural patterns and strongly topic-dependent sequences, whereas originals show a higher incidence of topic-independent sequences, i.e. the more usual lexicalised collocations in the language” Corpus-based Translation Studies TS: research on collocation Danielsson (2001): Collocation: monolingual & translational Units of meaning in two large corpora of English and Swedish Words occurring ≥200 times Collocates (≥5) plugs sockets (6 occurrences) headphone sockets (7 occurrences) sunken sockets (6 occurrences) bulging their sockets (5 occurrences) Data-sparseness problem: only 2 units of meaning (of the 12,099 previously identified) occur five times or more in the ST component of the parallel fiction corpus (Swedish into English, ~400,000 words per component) Corpus-based Translation Studies Limits General limits of MCC Variables Tools and methods: too crude? Excessive downplay of the source text Over-generalisation of translation universals Specific difficulty of collocational studies Data-sparseness in relatively small corpora Corpus-based Translation Studies Collocations: a new approach Aim and method Results (monolingual and parallel) Discussion, limits, ways forward Research questions Are translated texts more/less collocational than original texts in the same language i.e., are their collocation types overall more/less frequently attested and significant? If so, is this a consequence of the translation process? i.e., can we identify shifts that could account for the observed overall differences? Aim and method Intuition The point is not to look for collocations that repeat themselves frequently within small and hardly comparable “translation-driven” corpora, but to identify those collocations that are frequent and/or significant in the language as a whole. Aim and method 2 sets of corpus resources Study corpora Small monolingual comparable corpora of fiction texts (English => Italian; Italian => English) Reference corpora The British National Corpus The Repubblica Corpus (100 million words from a variety of sources) (340 million words from a single Italian newspaper) The English and Italian Web via Google/Yahoo automatic API queries Aim and method Study corpora (fiction) 1. 2. 3. 4. 5. 6. 7. 8. M. Atwood/C. Penati Il racconto dell’ancella M. Atwood/M. Papi Occhio di gatto M. Cruz Smith/P. F. Paolini Gorky Park C. Fowler/S. Bini Nozze di sangue N. Gordimer/F. Cavagnoli Storia di mio figlio G. Greene/B. Oddera Il decimo uomo D. Leavitt/A. Cossiga Un luogo dove non sono mai stato R. Rendell/H. Brinis Oltre il cancello 1. 2. 3. 4. 5. 6. 7. Aim and method F. Camon La malattia chiamata uomo G. Celati I narratori delle pianure C. Comencini Le pagine strappate L. Blissett Q D. Maraini Donna in guerra G. Pontiggia Il giocatore invisibile G. Tomasi di Lampedusa Il Gattopardo Corpus preparation Scanning in Tokenisation Tagging (part-of-speech) Lemmatisation treetagger Metadata annotation Alignment (easyalign) Indexing (CorpusWorkBench, CWB) Aim and method Extraction of candidates 1 Target sequences Lexical collocations Made of two words Contiguous Pos-based extraction from study corpora Based on literature, e.g. JN, NN, VN, V * N, N * * N Collection of frequencies from reference corpora Aim and method Extraction of candidates 2 Calculate MI Rank sequences Take top UCS (Evert 2004-2006) Arbitrary cut-off point: MI>2 and fq2 Calculate significance of difference btw original and translated Mann-Whitney significance tests Aim and method Results (MCC, Mann-Whitney) J N lit eng (MI; higher in original, p=.08) N V lit ita (MI; p=.008) N V lit eng (FQ; p=.05) V N lit ita (MI; p=.01) J * J lit ita (MI; p=.06) N prep/conj N lit ita (MI; p=.007) N * N lit eng (FQ; p=.06) N * * N lit ita (FQ; p=.07) Results Results for N prep/conj N (lit ita) MI MI original translated min 2.001 2.000 q1 2.381 2.425 median 2.736 2.853 max q3 3.392 3.590 q3 max 5.757 6.059 mean 2.954 3.069 6,5 6 5,5 q1 5 min 4,5 median 4 3,5 3 2,5 2 original translated Results Results (MCC, quantitative) Translated 855 Original 691 Total number of types 3853 3971 Tokens (randomly-sampled) 4222 4222 Types with MI>2 and fq2 Results Results (parallel, summary) Shift type Occurrences Creative collocational 7 Collocational collocational ( meaning) 7 Free collocational 11 More explicit 86 More formal/precise 16 Marginal cases (additions, changes) Total shifts observed 9 136 Total concordance lines analysed 1,061 Shifts leading to increased “collocativeness” Results Creative => collocational (7) TT: Ricordo l’odore della terra smossa, il <senso di pienezza> che davano le forme tonde dei bulbi chiusi nella mano LIT: I remember the smell of the turned earth, the sense of fullness that gave the round shapes of the bulbs held in the hand ST: I can remember the smell of the turned earth, the plump shapes of bulbs held in the hands, fullness The handmaid’s tale TT: Il <rumore dei tacchi> risuonò sulle piastrelle del corridoio. LIT: the noise of the heels resounded on the tiles of the corridor ST: Her heels clicked on the hall tiles. Red bride Results Different meaning (7) TT: Fa collezione di <cartine di sigarette> con disegni di aeroplani, e ne conosce tutti i nomi. ST: He collects cigarette cards with pictures of airplanes on them, and knows the names of all the planes. Cat’s eye Cigarette cards Occurrences Meaning BNC: 16 Google: 491,000 Collectible cards found in cigarette packs Cartine Repubblica: 3 da/per/di/delle Google: 726 sigarette Rolling papers, i.e. small sheets of paper which are sold for rolling one's own cigarettes Figurine Repubblica: 0 da/per/di/delle Google: 2 sigarette (by analogy with other products) collectible cards found in cigarette packs Free => collocational (11) TT: decorazioni di <spicchi d' aglio>, si rende conto che ST: handpainted by Alex with purple garlic bulbs, she sees that A place I’ve never been garlic 34,300,000 (100%) 2,580,000 (100%) aglio garlic bulbs + bulbs of garlic 109,600 (0.31%) 1305 (0.05%) bulbi d’aglio + bulbi di aglio + garlic heads + heads of garlic 59,300 (0.17%) 612 (0.02%) teste d’aglio + teste di aglio garlic cloves + cloves of garlic 2,207,000 (6.43%) 229,100 (8.87%) spicchi d’aglio + spicchi di aglio Web data Results Explicitation (86) - general TT: All'apertura nel basso <muro di cinta> l'autista esitò, poi accelerò LIT: At the opening in the low perimeter wall the driver hesitated, then accelerated ST: He hesitated at the gap in the low wall, then accelerated and went ahead A place I’ve never been TT: schiacciato sotto il <tacco della scarpa>, seppellito LIT: ground away under the heel of the shoe, buried ST: ground away under my heel, buried My son’s story Results Explicitation (86) - partitives TT: Non riuscivo a prendere sonno, così sono sceso a bere un <sorso d'acqua> LIT: I couldn’t sleep, so I came down to drink a gulp of water ST: I couldn't sleep, so I came down for water The tenth man TT: i <raggi del sole> filtrano dalla lunetta sulla porta LIT: the rays of the sun filter through the fanlight ST: Sun comes through the fanlight The handmaid’s tale Results Explicitation (86) - head nouns TT: manifesti di Bon Jovi e dei Guns' n Roses attaccati con le <puntine da disegno> sul grande mare della parete ST: Bon Jovi and Guns' n Roses posters thumbtacked into the great sea wall A place I’ve never been TT: Osserviamo il <cerume delle orecchie>, il muco del naso e lo sporco tra le dita dei piedi ST: We look at ear-wax, or snot, or dirt from our toes Cat’s eye Results More formal/more exact (16) TT: Spostando col piede i <capi di vestiario> sul pavimento, non trovò traccia della prova incriminante. LIT: items of clothing ST: Kicking around among the clothes on the floor, he found no trace of the incriminating article. Red bride TT: Si stava frugando tra le <pieghe dell'abito>, per prendere il lasciapassare LIT: folds of the robe ST: She was fumbling in her robe, for her pass The handmaid’s tale Results Other cases (9) Adverbs TT: Dal <punto di vista> domestico, si adattarono l' uno all' altra ST: Domestically they adjusted to one another My son’s story Domestication TT: Il cadavere era stato fatto a fettine da una lama larga e pesante, non trovata sul <luogo del delitto> ST: The corpse had been slashed to ribbons by a large, heavy blade, not found on the premises. Red bride Gratuitous changes TT: del greco c'era anche qualche tavolino con sudici <vasetti di fiori> artificiali e bottiglie di ketchup ST: the Greek had a few tables set out with flyspotted artificial flowers and tomato sauce bottles My son’s story Results Discussion - MCC Are Italian translated texts more/less collocational than originals? Translated texts would seem to be more collocational than originals A single exception: JN into Eng Translated less collocational than original, why? Probable shining-through Over-representation of collocations with common words Discussion, limits, ways forward JN in Eng: shining-through? Delicate fingers TT: I put some soft golden apricots as big as eggs on his plate, and watch him split them open, hardly moving his long, <delicate fingers>. ST: Gli ho messo nel piatto delle albicocche grandi come uova, morbide, dorate, e l'ho osservato mentre le spaccava, muovendo appena le dita lunghe e delicate. Donna in guerra Collocation fq1 delicate fingers 1646 gentle fingers 2477 slender fingers 701 nimble fingers 101 fq2 fq1-2 5346 5 5346 12 5346 15 5346 15 Discussion, limits, ways forward MI LL 2.7545 53.4624 2.9572 139.5338 3.6023 219.2139 4.4437 279.3528 JN in Eng: common words TRANSLATED ENGLISH ORIGINAL ENGLISH few days few evenings few feet few followers few friends few hours few jokes few kilos few minutes few months few paces few pages few passengers few days few feet few hours few inches few miles few minutes few moments few months few seconds few steps few tables few phrases few rays few scraps few seconds few sentences few spots few steps few stones few survivors few weeks few words few yards few years Overall frequency of few: translated 133, original 39 Discussion - parallel Is higher collocativeness a consequence of the translation process? Probably… NB: shifts towards higher collocativeness would appear to be partly independent free=> collocational, different meaning (normalisation) partly related to other strategies/procedures explicitation, shining-through Discussion, limits, ways forward Limits Just how certain are we that these shifts are the cause of the observed differences? Shifts are no doubt observable also in nonsignificant rankings… (To what extent) could single author or translator preferences account for these differences? The corpora are very small… Discussion, limits, ways forward Further work Bottom-up search for regularities Source-oriented approach BNC, WWW, ukwac / Repubblica, WWW, itwac Collocation extraction Starting from ST collocations Role of reference corpora Other genres? Evaluation of method: no hands! Creative exploitation of collocations Can it be automatised? Discussion, limits, ways forward Thank you [email protected] References Baker, M. 1993. “Corpus linguistics and translation studies” In Baker et al. (eds) Text and Technology. Benjamins. Baker, M. 1995. “Corpora in translation studies: An overview and some suggestions for future research”. Target 7, 2. Baroni, M. and S. Bernardini. 2003. “A preliminary analysis of collocational differences in monolingual comparable corpora”. In Archer et al. (eds), Proceedings of CL 2003. UCREL. Danielsson P. 2001. The Automatic identification of meaningful units in language. PhD Thesis. Göteborg University. Evert, S. 2004-2006. The UCS Toolkit [http://www.collocations.de/software.html] Firth, J.R. 1956 (1968). “Descriptive linguistics and the study of English”. in Palmer (ed) Selected papers of J.R. Firth 1952-1959. Longman. Gledhill, C. 2000. Collocations in science writing. Gunter Narr. Howarth, P. 1996. Phraseology in English academic writing. Max Niemeyer. Kenny, D. 2001. Lexis and creativity in translation. St. Jerome. Kjellmer, G. 1987. “Aspects of English collocations”. In Meijs (ed) Corpus Linguistics and Beyond. Rodopi. Kjellmer, G. 1994. A Dictionary of English collocations. Clarendon Press. Olohan, M. 2004. Introducing corpora in translation studies. Routledge. Øverås, L. 1998. “In search of the third code: An investigation of norms in literary translation”. Meta 43, 4. Sinclair, J. McH. 1991. Corpus, concordance, collocation. Oxford University Press. Sinclair, J. McH. and S. Jones 1974. “English lexical collocations”. Cahiers de Lexicologie 24. Toury, G. 1995. Descriptive translation studies and beyond. Benjamins. Pattern W p value MI/LOG FQ Higher in 2JN ita w= 122618 p= 0.002261 MI Original 2JN eng w= 165680.5 p= 0.05237 MI Original 2NJ ita w= 78109.5 p= 0.001134 MI Original 2NN eng w= 19142.5 p= 0.005172 (LOG)FQ Translated 2RJ eng w=7609 p= 0.06921 MI Original 2RV eng w= 10458 p= 0.04767 MI Original 2VR eng w= 2907 p= 0.01517 (LOG)FQ Original 3NN ita w= 21683 w= 22066.5 p= 0.02607 p= 0.01029 MI (LOG)FQ Original Original 3VN eng w= 11904 p= 0.05694 MI Original 3NN eng w= 1910.5 p= 0.0429 (LOG)FQ Original 4VN eng w= 1027 p= 0.06974 (LOG)FQ Original Results of OSS significance testing POS patterns originally searched Rankings selected for in the corpora significance testing for 2-gram 3-grams 4-grams 2-gram 34gram grams s JJ JN JV NN NV RJ JJ NN NV VN VV RN NN VN VV JJ JN N N N V JJ NN VN 2-gram 3-grams 4-grams 2-gram 34gram grams s JJ JN NJ NN NV JJ JN NN NV JN NN NV VN VV JN NJ NV VN JJ NN VN RV VJ VN VV VR RR VJ VN VR VJ VN VV RJ RV VN VR NN VN NN VN English Italian