...

Counting collocations in translated text

by user

on
Category: Documents
29

views

Report

Comments

Transcript

Counting collocations in translated text
Collocational properties
of translated language
Silvia Bernardini
School for Translators and Interpreters
University of Bologna at Forlì
30 July 07
[email protected]
Overview

Collocations




Brief overview
Frequency vs.
Phraseology
A note on statistics
Translation studies




Theory
Methodology
Collocation
Limits

Current study

Aim
Method
Results
Discussion
Limits

Ways forward




What is a collocation?

“[] I would like to put forward the concept of collocation
which I have introduced in my own work. This is the
study of key-words, pivotal words, leading words, by
presenting them in the company they usually keep – that
is to say, an element of their meaning is indicated when
their habitual word accompaniments are shown.” (Firth
1956:106-107)

E.g.: English: the English people, English literature, English
reserve, English manners, English countryside, the English and
all that can be said about them, the English public schools,
English Universities (!)
Collocations
Frequency-oriented views

“Significant” collocation is regular collocation
between items, such that they occur more
often than their respective frequencies and
the length of the text in which they occur
would predict (Jones and Sinclair 1974:19)

A collocation is a sequence of words that
occurs more than once in identical form and
is grammatically well-structured
(Kjellmer 1987: 133)
Collocations
Phraseology-oriented views

Restricted collocations are fully
institutionalised phrases, memorized as
wholes and used as conventional formmeaning pairings (Howarth 1996: 37)
Collocations
Frequency





vs. Phraseology
Sum of many
occurrences in texts
Position important
Number of words
involved important
Syntactic relationship
can be important
Frequency/statistics
important




An abstract entity with
instantiations in texts
(PERFORM + TASK)
Position/number of
constituents not central;
Different restrictions
distinguished
(DOG+BARK not a
collocation)
Main criterion: semantic
unpredictability
Collocations
2 ways of finding collocations

Starting from a (set of) keyword(s) and looking
left and right


Gledhill (2000): phraseology surrounding “keywords”
in different sections of cancer research articles
Selecting all sequences of N words that recur a
certain number of times

Kjellmer (1994): All two-word sequences appearing
more than two times in the Brown corpus
Collocations
A note on statistics


Frequency (Danielsson 2001)
Statistics: pointwise Mutual Information (MI)


Compares the probability of observing x and y together
(the joint probability) with the probabilities of observing x
and y independently (chance).
(Church and Hanks 1990: 77)
Formula
MI(x;y)=

p(xy) * N
log2 ------------p(x) * p(y)
Limits of MI
Collocations
Corpus-based TS




Theoretical background
Methodological background
Studies of collocation within TS
Limits
Theoretical background 1
Baker (1993: 243)
The most important task that awaits the
application of corpus techniques in translation
studies […] is the elucidation of the nature
of translated text as a mediated
communicative event.
Corpus-based Translation Studies
Theoretical background 2
Toury (1995)
Translation as norm-governed behaviour:
‘translatorship’ amounts first and foremost to
being able to play a social role, i.e. to fulfil a
function allotted by a community […] in a way
which is deemed appropriate in its own terms of
reference (ibid.: 53)
Corpus-based Translation Studies
Operationalising it


Studies should be carried out focusing on the nature
of translational norms as compared to those
governing non-translational kinds of text production
(Toury 1995: 61).
Corpus research in TS should focus on the
identification of universal features of translation, that
is features which typically occur in translated text
rather than original utterances and which are not the
result of interference from specific linguistic
systems. (Baker 1993:243).
Corpus-based Translation Studies
“Universals”










Explicitation/explicitness
Simplification
Disambiguation
Levelling out (homogeneity)
Preference for conventional grammar
Avoidance of repetition
Exaggeration of features of the target language
Normalisation/sanitisation
Absence of TL-specific “unique items”
“Shining-through”
Corpus-based Translation Studies
Methodological background

Monolingual comparable corpora



Originals in Language A and comparable translations into
Language A
They should make visible “patterning which is specific to
translated texts, irrespective of the source or target
languages involved” (Baker 1995: 234).
Parallel corpora

Originals in Language A and their translations into
Language B, usually combined with reference corpora
Corpus-based Translation Studies
TS: research on collocation
Olohan (2004): Collocation and moderation


Quite, rather, pretty and fairly in translated vs.
original English fiction
Pretty and rather, and more marginally quite, “are
used a lot less in [TEC-Fiction] but, when they
are, there is usually more variation in usage than
in [BNC-fiction] and less repetition of common
collocates”
Corpus-based Translation Studies
TS: research on collocation
Øverås (1998): Collocation and explicitation
 First 50 sentences of 40 novel extracts (English +
Norwegian)
 Additions enriching the text with a common target
language collocation
ST: Det var en blanding av vill dristighet og en frøkenaktig, fornem finhet i
hans slekt.
(a mixture)
TT: There was a strange mixture of wild boldness and dignified gentility in
the family.

A collocational clash in the ST is rendered with a
conventional TL combination
ST: the cook's fat son would play plump tunes on his accordion.
TT: kokkens fete sønn spille trivelige melodier på trekkspillet sitt.
(pleasant tunes)
Corpus-based Translation Studies
TS: research on collocation
Kenny (2001): Collocation and sanitisation



Three-way comparison: a parallel corpus
(English/German) and reference corpora of SL/TL
Treatment of lexical creativity in translation
Starting points: collocation hapaxes and clusters
that are repeated in the work of a single author but
not attested in any other texts
Augen ~ trinken
ich trinke mit gierigen Augen
(literally: I drink with greedy eyes)
translated as: “my avid eyes drank in…”
Corpus-based Translation Studies
TS: research on collocation
Baroni and Bernardini (2003): Collocation in MCC




Monolingual comparable corpus of Italian original and
translated articles from a single geopolitics journal.
All bigrams from the translated sub-corpus and from the
original sub-corpus
Ranked according to their log-likelihood ratio value
“Translated language is repetitive, possibly more repetitive
than original language. Yet the two differ in what they tend
to repeat: translations show a tendency to repeat structural
patterns and strongly topic-dependent sequences, whereas
originals show a higher incidence of topic-independent
sequences, i.e. the more usual lexicalised collocations in
the language”
Corpus-based Translation Studies
TS: research on collocation
Danielsson (2001): Collocation: monolingual & translational



Units of meaning in two large corpora of English and Swedish
Words occurring ≥200 times
Collocates (≥5)
plugs sockets (6 occurrences)
headphone sockets (7 occurrences)
sunken sockets (6 occurrences)
bulging their sockets (5 occurrences)

Data-sparseness problem: only 2 units of meaning (of the 12,099
previously identified) occur five times or more in the ST
component of the parallel fiction corpus (Swedish into English,
~400,000 words per component)
Corpus-based Translation Studies
Limits

General limits of MCC





Variables
Tools and methods: too crude?
Excessive downplay of the source text
Over-generalisation of translation universals
Specific difficulty of collocational studies

Data-sparseness in relatively small corpora
Corpus-based Translation Studies
Collocations: a new approach



Aim and method
Results (monolingual and parallel)
Discussion, limits, ways forward
Research questions

Are translated texts more/less collocational than
original texts in the same language


i.e., are their collocation types overall more/less
frequently attested and significant?
If so, is this a consequence of the translation
process?

i.e., can we identify shifts that could account for the
observed overall differences?
Aim and method
Intuition

The point is not to look for collocations that repeat
themselves frequently within small and hardly
comparable “translation-driven” corpora, but to
identify those collocations that are frequent and/or
significant in the language as a whole.
Aim and method
2 sets of corpus resources
Study corpora


Small monolingual comparable corpora of fiction
texts (English => Italian; Italian => English)
Reference corpora


The British National Corpus


The Repubblica Corpus


(100 million words from a variety of sources)
(340 million words from a single Italian newspaper)
The English and Italian Web via Google/Yahoo
automatic API queries
Aim and method
Study corpora (fiction)
1.
2.
3.
4.
5.
6.
7.
8.
M. Atwood/C. Penati
Il racconto dell’ancella
M. Atwood/M. Papi
Occhio di gatto
M. Cruz Smith/P. F. Paolini
Gorky Park
C. Fowler/S. Bini
Nozze di sangue
N. Gordimer/F. Cavagnoli
Storia di mio figlio
G. Greene/B. Oddera
Il decimo uomo
D. Leavitt/A. Cossiga
Un luogo dove non sono mai stato
R. Rendell/H. Brinis
Oltre il cancello
1.
2.
3.
4.
5.
6.
7.
Aim and method
F. Camon
La malattia chiamata uomo
G. Celati
I narratori delle pianure
C. Comencini
Le pagine strappate
L. Blissett
Q
D. Maraini
Donna in guerra
G. Pontiggia
Il giocatore invisibile
G. Tomasi di Lampedusa
Il Gattopardo
Corpus preparation
Scanning in
Tokenisation
Tagging (part-of-speech)
Lemmatisation








treetagger
Metadata annotation
Alignment (easyalign)
Indexing (CorpusWorkBench, CWB)
Aim and method
Extraction of candidates 1

Target sequences




Lexical collocations
Made of two words
Contiguous
Pos-based extraction from study corpora

Based on literature, e.g.


JN, NN, VN, V * N, N * * N
Collection of frequencies from reference
corpora
Aim and method
Extraction of candidates 2

Calculate MI



Rank sequences
Take top


UCS (Evert 2004-2006)
Arbitrary cut-off point: MI>2 and fq2
Calculate significance of difference btw
original and translated

Mann-Whitney significance tests
Aim and method
Results (MCC, Mann-Whitney)








J N lit eng (MI; higher in original, p=.08)
N V lit ita (MI; p=.008)
N V lit eng (FQ; p=.05)
V N lit ita (MI; p=.01)
J * J lit ita (MI; p=.06)
N prep/conj N lit ita (MI; p=.007)
N * N lit eng (FQ; p=.06)
N * * N lit ita (FQ; p=.07)
Results
Results for N prep/conj N (lit ita)
MI
MI
original
translated
min
2.001
2.000
q1
2.381
2.425
median
2.736
2.853
max
q3
3.392
3.590
q3
max
5.757
6.059
mean
2.954
3.069
6,5
6
5,5
q1
5
min
4,5
median
4
3,5
3
2,5
2
original
translated
Results
Results (MCC, quantitative)
Translated
855
Original
691
Total number of types
3853
3971
Tokens (randomly-sampled)
4222
4222
Types with MI>2 and fq2
Results
Results (parallel, summary)
Shift type
Occurrences
Creative  collocational
7
Collocational  collocational ( meaning)
7
Free  collocational
11
More explicit
86
More formal/precise
16
Marginal cases (additions, changes)
Total shifts observed
9
136
Total concordance lines analysed
1,061
Shifts leading to increased “collocativeness”
Results
Creative => collocational (7)
TT: Ricordo l’odore della terra smossa, il <senso di pienezza> che
davano le forme tonde dei bulbi chiusi nella mano
LIT: I remember the smell of the turned earth, the sense of
fullness that gave the round shapes of the bulbs held in the
hand
ST: I can remember the smell of the turned earth, the plump shapes
of bulbs held in the hands, fullness
The handmaid’s tale
TT: Il <rumore dei tacchi> risuonò sulle piastrelle del corridoio.
LIT: the noise of the heels resounded on the tiles of the corridor
ST: Her heels clicked on the hall tiles.
Red bride
Results
Different meaning (7)
TT: Fa collezione di <cartine di sigarette> con disegni di
aeroplani, e ne conosce tutti i nomi.
ST: He collects cigarette cards with pictures of airplanes on
them, and knows the names of all the planes.
Cat’s eye
Cigarette
cards
Occurrences
Meaning
BNC: 16
Google: 491,000
Collectible cards found in cigarette
packs
Cartine
Repubblica: 3
da/per/di/delle Google: 726
sigarette
Rolling papers, i.e. small sheets of
paper which are sold for rolling
one's own cigarettes
Figurine
Repubblica: 0
da/per/di/delle Google: 2
sigarette
(by analogy with other products)
collectible cards found in cigarette
packs
Free => collocational (11)
TT: decorazioni di <spicchi d'
aglio>, si rende conto che
ST: handpainted by Alex with purple
garlic bulbs, she sees that
A place I’ve never been
garlic
34,300,000 (100%)
2,580,000 (100%)
aglio
garlic bulbs +
bulbs of garlic
109,600 (0.31%)
1305 (0.05%)
bulbi d’aglio +
bulbi di aglio +
garlic heads +
heads of garlic
59,300 (0.17%)
612 (0.02%)
teste d’aglio +
teste di aglio
garlic cloves +
cloves of garlic
2,207,000 (6.43%)
229,100 (8.87%)
spicchi d’aglio +
spicchi di aglio
Web data
Results
Explicitation (86) - general
TT: All'apertura nel basso <muro di cinta> l'autista esitò, poi
accelerò
LIT: At the opening in the low perimeter wall the driver hesitated,
then accelerated
ST: He hesitated at the gap in the low wall, then accelerated and
went ahead
A place I’ve never been
TT: schiacciato sotto il <tacco della scarpa>, seppellito
LIT: ground away under the heel of the shoe, buried
ST: ground away under my heel, buried
My son’s story
Results
Explicitation (86) - partitives
TT: Non riuscivo a prendere sonno, così sono sceso a
bere un <sorso d'acqua>
LIT: I couldn’t sleep, so I came down to drink a gulp of
water
ST: I couldn't sleep, so I came down for water
The tenth man
TT: i <raggi del sole> filtrano dalla lunetta sulla porta
LIT: the rays of the sun filter through the fanlight
ST: Sun comes through the fanlight
The handmaid’s tale
Results
Explicitation (86) - head nouns
TT: manifesti di Bon Jovi e dei Guns' n Roses attaccati con
le <puntine da disegno> sul grande mare della parete
ST: Bon Jovi and Guns' n Roses posters thumbtacked into
the great sea wall
A place I’ve never been
TT: Osserviamo il <cerume delle orecchie>, il muco del
naso e lo sporco tra le dita dei piedi
ST: We look at ear-wax, or snot, or dirt from our toes
Cat’s eye
Results
More formal/more exact (16)
TT: Spostando col piede i <capi di vestiario> sul
pavimento, non trovò traccia della prova incriminante.
LIT: items of clothing
ST: Kicking around among the clothes on the floor, he
found no trace of the incriminating article.
Red bride
TT: Si stava frugando tra le <pieghe dell'abito>, per
prendere il lasciapassare
LIT: folds of the robe
ST: She was fumbling in her robe, for her pass
The handmaid’s tale
Results
Other cases (9)
Adverbs
TT: Dal <punto di vista> domestico, si adattarono l' uno all' altra
ST: Domestically they adjusted to one another
My son’s story

Domestication
TT: Il cadavere era stato fatto a fettine da una lama larga e pesante,
non trovata sul <luogo del delitto>
ST: The corpse had been slashed to ribbons by a large, heavy blade,
not found on the premises.
Red bride

Gratuitous changes
TT: del greco c'era anche qualche tavolino con sudici <vasetti di fiori>
artificiali e bottiglie di ketchup
ST: the Greek had a few tables set out with flyspotted artificial flowers
and tomato sauce bottles
My son’s story

Results
Discussion - MCC

Are Italian translated texts more/less collocational
than originals?


Translated texts would seem to be more
collocational than originals
A single exception: JN into Eng

Translated less collocational than original, why?


Probable shining-through
Over-representation of collocations with common words
Discussion, limits, ways forward
JN in Eng: shining-through?
Delicate fingers
TT: I put some soft golden apricots as big as eggs on his plate,
and watch him split them open, hardly moving his long,
<delicate fingers>.
ST: Gli ho messo nel piatto delle albicocche grandi come uova,
morbide, dorate, e l'ho osservato mentre le spaccava, muovendo
appena le dita lunghe e delicate.
Donna in guerra
Collocation
fq1
delicate fingers 1646
gentle fingers 2477
slender fingers 701
nimble fingers 101
fq2 fq1-2
5346
5
5346 12
5346 15
5346 15
Discussion, limits, ways forward
MI
LL
2.7545 53.4624
2.9572 139.5338
3.6023 219.2139
4.4437 279.3528
JN in Eng: common words
TRANSLATED ENGLISH
ORIGINAL ENGLISH
few days
few evenings
few feet
few followers
few friends
few hours
few jokes
few kilos
few minutes
few months
few paces
few pages
few passengers
few days
few feet
few hours
few inches
few miles
few minutes
few moments
few months
few seconds
few steps
few tables
few phrases
few rays
few scraps
few seconds
few sentences
few spots
few steps
few stones
few survivors
few weeks
few words
few yards
few years
Overall frequency of few: translated 133, original 39
Discussion - parallel

Is higher collocativeness a consequence of
the translation process?


Probably…
NB: shifts towards higher collocativeness
would appear to be

partly independent


free=> collocational, different meaning (normalisation)
partly related to other strategies/procedures

explicitation, shining-through
Discussion, limits, ways forward
Limits

Just how certain are we that these shifts are
the cause of the observed differences?


Shifts are no doubt observable also in nonsignificant rankings…
(To what extent) could single author or
translator preferences account for these
differences?

The corpora are very small…
Discussion, limits, ways forward
Further work

Bottom-up search for regularities


Source-oriented approach


BNC, WWW, ukwac / Repubblica, WWW, itwac
Collocation extraction


Starting from ST collocations
Role of reference corpora


Other genres?
Evaluation of method: no hands!
Creative exploitation of collocations

Can it be automatised?
Discussion, limits, ways forward
Thank you
[email protected]
References
Baker, M. 1993. “Corpus linguistics and translation studies” In Baker et al. (eds) Text
and Technology. Benjamins.
Baker, M. 1995. “Corpora in translation studies: An overview and some suggestions for
future research”. Target 7, 2.
Baroni, M. and S. Bernardini. 2003. “A preliminary analysis of collocational differences
in monolingual comparable corpora”. In Archer et al. (eds), Proceedings of CL 2003.
UCREL.
Danielsson P. 2001. The Automatic identification of meaningful units in language. PhD
Thesis. Göteborg University.
Evert, S. 2004-2006. The UCS Toolkit [http://www.collocations.de/software.html]
Firth, J.R. 1956 (1968). “Descriptive linguistics and the study of English”. in Palmer (ed)
Selected papers of J.R. Firth 1952-1959. Longman.
Gledhill, C. 2000. Collocations in science writing. Gunter Narr.
Howarth, P. 1996. Phraseology in English academic writing. Max Niemeyer.
Kenny, D. 2001. Lexis and creativity in translation. St. Jerome.
Kjellmer, G. 1987. “Aspects of English collocations”. In Meijs (ed) Corpus Linguistics
and Beyond. Rodopi.
Kjellmer, G. 1994. A Dictionary of English collocations. Clarendon Press.
Olohan, M. 2004. Introducing corpora in translation studies. Routledge.
Øverås, L. 1998. “In search of the third code: An investigation of norms in literary
translation”. Meta 43, 4.
Sinclair, J. McH. 1991. Corpus, concordance, collocation. Oxford University Press.
Sinclair, J. McH. and S. Jones 1974. “English lexical collocations”. Cahiers de
Lexicologie 24.
Toury, G. 1995. Descriptive translation studies and beyond. Benjamins.
Pattern
W
p value
MI/LOG FQ
Higher in
2JN ita
w= 122618
p= 0.002261
MI
Original
2JN eng
w= 165680.5
p= 0.05237
MI
Original
2NJ ita
w= 78109.5
p= 0.001134
MI
Original
2NN eng
w= 19142.5
p= 0.005172
(LOG)FQ
Translated
2RJ eng
w=7609
p= 0.06921
MI
Original
2RV eng
w= 10458
p= 0.04767
MI
Original
2VR eng
w= 2907
p= 0.01517
(LOG)FQ
Original
3NN ita
w= 21683
w= 22066.5
p= 0.02607
p= 0.01029
MI
(LOG)FQ
Original
Original
3VN eng
w= 11904
p= 0.05694
MI
Original
3NN eng
w= 1910.5
p= 0.0429
(LOG)FQ
Original
4VN eng
w= 1027
p= 0.06974
(LOG)FQ
Original
Results of OSS significance testing
POS patterns originally searched Rankings selected
for in the corpora
significance testing
for
2-gram
3-grams
4-grams
2-gram
34gram grams
s
JJ
JN
JV
NN
NV
RJ
JJ
NN
NV
VN
VV
RN
NN
VN
VV
JJ
JN
N
N
N
V
JJ
NN
VN
2-gram
3-grams
4-grams
2-gram
34gram grams
s
JJ
JN
NJ
NN
NV
JJ
JN
NN
NV
JN
NN
NV
VN
VV
JN
NJ
NV
VN
JJ
NN
VN
RV
VJ
VN
VV
VR
RR
VJ
VN
VR
VJ
VN
VV
RJ
RV
VN
VR
NN
VN
NN
VN
English
Italian
Fly UP