Week 4: Language Modelling Statistical Machine Translation Lecturer: Qun Liu

by user

on 15 сентября 2016

Category: Documents

>> Downloads: 4

views

Report

Comments

Description

Download Week 4: Language Modelling Statistical Machine Translation Lecturer: Qun Liu

Transcript

Week 4: Language Modelling Statistical Machine Translation Lecturer: Qun Liu

CA446
Statistical Machine Translation
Week 4: Language Modelling
Lecturer: Qun Liu
Lab Tutor: Xiaofeng Wu, Iacer Calixto
2nd Semester, 2014-2015 Academic Year
http://computing.dcu.ie/~qliu/CA446
Content
What’s a Language Model?
N-gram Language Model
Language Model Smoothing
Language Model Evaluation
Managing Very Large Language Models
Other Language Modelling Approaches
20 March 2015
2
What is a language?
Can we define a language mathematically?
Deterministic Definition:
A language is the set of all the sentences we
can say.
Probabilistic Definition:
A language is the probabilistic distribution of all
possible sentences.
20 March 2015
3
Deterministic Definition
• Given a vocabulary 𝑉 = 𝑣1 , … , 𝑣𝑛 ,
• We use 𝑉 ∗ to denote the set of any sequence of
words of 𝑉,
𝑉 ∗ = 𝑊|𝑊 = 𝑤1 , … , 𝑤𝑙 , 𝑙 ≥ 1, 𝑤𝑖 ∈ 𝑉
• Then a language can be defined as a subset of
sentences:
𝐿 ⊆ 𝑉∗
• Any element of 𝐿 is called a sentence in the
language.
20 March 2015
4
Formal Language
• Languages may contain infinite sentences,
thus they cannot be defined by enumeration.
• One way to define a formal language is to
define the grammar to generate all the
sentences in that language.
• Another way is to define a machine to
recognize if a sentence is a valid sentence in
that language.
20 March 2015
5
Chomsky Hierarchy
Grammar
Languages
Automaton
Production rules
(constraints)
Type-0
Recursively
enumerable
Turing machine
𝛼→𝛽
(no restrictions)
Type-1
Contextsensitive
Linear-bounded nondeterministic Turing machine
𝛼𝐴𝛽 → 𝛼𝛾𝛽
Type-2
Context-free
Non-deterministic pushdown
automaton
𝐴→𝛼
Finite state automaton
𝐴→𝑎
and
𝐴 → 𝑎𝐵
Type-3
20 March 2015
Regular
6
Chomsky Hierarchy
• Regular Languages:
– + − ? 0 − 9 + . 0 − 9 ∗ ? (decimal numbers)
– 𝑎𝑚 𝑏 𝑛
• Context Free Languages:
– 𝑎𝑛 𝑏 𝑛
– Any programming languages
• Context Sensitive Languages:
– 𝑎𝑛 𝑏 𝑛 𝑐 𝑛
• Recursively Enumerable
– Semi-decidable problem
20 March 2015
7
Chomsky Hierarchy
Set inclusions of formal
languages described by
the Chomsky hierarchy
20 March 2015
8
What is a language?
Can we define a language mathematically?
Deterministic Definition:
A language is the set of all the sentences we
can say.
Probabilistic Definition:
A language is the probabilistic distribution of all
possible sentences.
20 March 2015
9
Probabilistic Definition
• Deterministic definition is not suitable for
natural language in most case
• For some sentences, it is not easy to say if
it is a legal sentence in a certain language:
– Child languages
– Tweets
– Slang
–…
20 March 2015
10
Statistical Language Model
A statistical language model is defined as:
𝑝 𝑠 = 𝑤1 …𝑤𝑛 , ∀𝑖: 𝑤𝑖 ∈ 𝑉
The normalization condition:
𝑝(𝑠) = 1
𝑠∈𝑉 ∗
20 March 2015
11
Noisy Channel Model Revisited
𝑒 = argmax 𝑝 𝑒 𝑝(𝑓|𝑒)
𝑒
Translation Model
Decoder
Language Model
20 March 2015
12
Statistical Language Model
• How can we estimate the probability of a sentence in
a specific language?
• Unlike estimating the probability distribution of a
dice, we cannot exhaust all the possible sentences
in limited samples
• Idea: breakdown
– break all sentences down into limited substrings
– Estimate the probability of a sentence by these substrings
– If a string has lots of reasonable/plausible/likely substrings
(n-grams) then it might be a reasonable sentence.
20 March 2015
13
Simplest Language Model
• The simplest idea to break down a sentence is
to split it to words.
• Thus we get the simplest language model:
𝑛
𝑝 𝑠 = 𝑤1 …𝑤𝑛 =
𝑝(𝑤𝑖 )
𝑖=1
• Here the probability of a sentence is just the
multiplication of the probability of the words in
the sentence.
• This model is called a unigram language model.
20 March 2015
14
Word Frequency
• 𝑝 𝑤 is word frequency:
𝑜𝑐𝑐𝑢𝑟𝑎𝑛𝑐𝑒 𝑜𝑓 𝑤𝑜𝑟𝑑 𝑤 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑟𝑝𝑢𝑠
𝑝 𝑤 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑜𝑘𝑒𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑟𝑝𝑢𝑠
Type
Occurrences
Rank
the
3789654
1st
he
2098762
2nd
king
57897
1,356th
boy
56975
1,357th
5
34,589th
1
123,567th
[...]
[...]
stringyfy
[...]
20 March 2015
transducionalify
15
Word Frequency
20 March 2015
16
Word Frequency
20 March 2015
17
Unigram Language Model
𝑝("I am a student.")
= 𝑝("I") × 𝑝("am") × 𝑝("a") × 𝑝("student") × 𝑝(". ")
• Problem of unigram language model:
– Unseen words;
𝑝("I am a zdwi.") = ?
– No word order.
𝑝("I am a student.") = 𝑝("a I . student am")
20 March 2015
18
Content
What’s a Language Model?
N-gram Language Model
Language Model Smoothing
Language Model Evaluation
Managing Very Large Language Models
Other Language Modelling Approaches
20 March 2015
19
N-gram Language Model
𝑝 𝑠 = 𝑤1 …𝑤𝑛 = 𝑝 𝑤1 𝑝 𝑤2 |𝑤1 … 𝑝 𝑤𝑛 𝑤1 … 𝑤𝑛−1
𝑛
=
𝑝 𝑤𝑖 𝑤1 … 𝑤𝑖−1
𝑖=1
20 March 2015
20
Markov Assumption
it is assumed that the probability of observing
the 𝑖 𝑡ℎ word 𝑤𝑖 in the context history of the
preceding 𝑖 − 1 words can be approximated
by the probability of observing it in the
shortened context history of the preceding
𝑛 − 1 words (𝑛𝑡ℎ order Markov property).
20 March 2015
21
N-gram Language Model
𝑝 𝑠 = 𝑤1 …𝑤𝑛 = 𝑝 𝑤1 𝑝 𝑤2 |𝑤1 … 𝑝 𝑤𝑛 𝑤1 … 𝑤𝑛−1
𝑛
=
𝑝 𝑤𝑖 𝑤1 𝑤2 … 𝑤𝑖−1
𝑖=1
𝑛
≈
The whole history (𝑖 − 1 words)
𝑝 𝑤𝑖 𝑤𝑖−(𝑁−1) 𝑤𝑖−(𝑁−2) … 𝑤𝑖−1
𝑖=1
The shortened history (𝑁 − 1 words)
20 March 2015
22
N-gram Language Model
1-gram (unigram) Model:
𝑝 𝑠 = 𝑤1 …𝑤𝑛 = 𝑝 𝑤1 × 𝑝 𝑤2 × ⋯ × 𝑝(𝑤𝑛 )
2-gram (bigram) Model:
𝑝 𝑠 = 𝑤1 …𝑤𝑛 = 𝑝 𝑤1 × 𝑝 𝑤2 |𝑤1 × ⋯ × 𝑝(𝑤𝑛 |𝑤𝑛−1 )
3-gram (trigram) Model:
𝑝 𝑠 = 𝑤1 …𝑤𝑛 = 𝑝 𝑤1 × 𝑝 𝑤2 |𝑤1 × 𝑝 𝑤3 |𝑤1 𝑤2 ×
… … × 𝑝(𝑤𝑛 |𝑤𝑛−2 𝑤𝑛−1 )
20 March 2015
23
N-gram Language Model
𝑝 𝐼 𝑑𝑜𝑛’𝑡 𝑙𝑖𝑘𝑒 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠
=𝑝 𝐼
× 𝑝 𝑑𝑜𝑛’𝑡 𝐼
× 𝑝 𝑙𝑖𝑘𝑒 𝑑𝑜𝑛’𝑡
× 𝑝 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 𝑙𝑖𝑘𝑒
× 𝑝 𝑡ℎ𝑎𝑡 𝑠𝑝𝑖𝑑𝑒𝑟𝑠
× 𝑝 𝑎𝑟𝑒 𝑡ℎ𝑎𝑡
× 𝑝(𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠|𝑎𝑟𝑒)
bigram
𝑝 𝐼 𝑑𝑜𝑛’𝑡 𝑙𝑖𝑘𝑒 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠
=𝑝 𝐼
× 𝑝 𝑑𝑜𝑛’𝑡 𝐼
× 𝑝 𝑙𝑖𝑘𝑒 𝐼 𝑑𝑜𝑛’𝑡
× 𝑝(𝑠𝑝𝑖𝑑𝑒𝑟𝑠| 𝑑𝑜𝑛’𝑡 𝑙𝑖𝑘𝑒 )
× 𝑝 𝑡ℎ𝑎𝑡 𝑙𝑖𝑘𝑒 𝑠𝑝𝑖𝑑𝑒𝑟𝑠
× 𝑝 𝑎𝑟𝑒 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 𝑡ℎ𝑎𝑡
× 𝑝(𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠|𝑡ℎ𝑎𝑡 𝑎𝑟𝑒)
trigram
20 March 2015
24
Deal with Sentence Boundaries
• To better estimate the probabilities of
words at sentence boundaries, we usually
add <s> and </s> before and after
sentences.
20 March 2015
25
N-gram Language Model
𝑝 <s>𝐼 𝑑𝑜𝑛’𝑡 𝑙𝑖𝑘𝑒 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠</s>
= 𝑝 𝐼|<s>
× 𝑝 𝑑𝑜𝑛’𝑡 𝐼
× 𝑝 𝑙𝑖𝑘𝑒 𝑑𝑜𝑛’𝑡
× 𝑝 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 𝑙𝑖𝑘𝑒
× 𝑝 𝑡ℎ𝑎𝑡 𝑠𝑝𝑖𝑑𝑒𝑟𝑠
× 𝑝 𝑎𝑟𝑒 𝑡ℎ𝑎𝑡
× 𝑝 𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠 𝑎𝑟𝑒
× 𝑝(</s>|𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠)
𝑝 <s>𝐼 𝑑𝑜𝑛’𝑡 𝑙𝑖𝑘𝑒 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠</s>
= 𝑝 𝐼|<s>
× 𝑝 𝑑𝑜𝑛’𝑡 <s>𝐼
× 𝑝 𝑙𝑖𝑘𝑒 𝐼 𝑑𝑜𝑛’𝑡
× 𝑝(𝑠𝑝𝑖𝑑𝑒𝑟𝑠| 𝑑𝑜𝑛’𝑡 𝑙𝑖𝑘𝑒 )
× 𝑝 𝑡ℎ𝑎𝑡 𝑙𝑖𝑘𝑒 𝑠𝑝𝑖𝑑𝑒𝑟𝑠
× 𝑝 𝑎𝑟𝑒 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 𝑡ℎ𝑎𝑡
× 𝑝(𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠|𝑡ℎ𝑎𝑡 𝑎𝑟𝑒)
× 𝑝 </s> 𝑎𝑟𝑒 𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠
× 𝑝(</s>|𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠)
20 March 2015
bigram
trigram
26
Probability of an n-gram
Suppose we have the phrase “x y” (i.e. word “x” followed by word “y”).
p(y|x) is the probability that word y follows word x and can be estimated
from a corpus as follows:
𝑛𝑢𝑚𝑏𝑒𝑟−𝑜𝑓−𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠 (“𝑥 𝑦”)
𝑝(𝑦|𝑥)=
𝑛𝑢𝑚𝑏𝑒𝑟−𝑜𝑓−𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠 (“𝑥”)
bigram
Similarly, suppose we have the phrase “x y z”.
p(z|x y) is the probability that word z follows words x and y
𝑝(𝑧|𝑥
20 March 2015
𝑛𝑢𝑚𝑏𝑒𝑟−𝑜𝑓−𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠 (“𝑥 𝑦 𝑧”)
𝑦)=
𝑛𝑢𝑚𝑏𝑒𝑟−𝑜𝑓−𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠 (“𝑥 𝑦”)
trigram
27
Size of N-gram LM
• Size of an n-gram LM is the number of
parameters in the model.
• For n-gram LM, the maximal number of
parameters is:
|𝑉|𝑛
where 𝑉 is the vocabulary
LM Size
20 March 2015
Unigram
Bigram
Trigram
4-gram
100K
1010 =10G
1015 =1T
1020
28
A character-based bigram LM
𝑝 t−i−p =𝑝 t ×𝑝 i t ×𝑝 p i
= 1.0 × 0.3 × 0.6
= 0.18
20 March 2015
29
Bag Translation
• In bag translation we take a sentence, cut it up into
words, place the words in a bag, and then try to recover
the sentence given the bag.
• We use the n-gram model to rank different arrangements
of the words in the bag.
• Thus, we treat an arrangement S as better than another
arrangement S' if Pr(S) is greater than Pr(S').
• We tried this scheme on a random sample of sentences.
This example comes from:
Brown P F, Cocke J, Pietra S A D, et al. A statistical approach to
machine translation[J]. Computational linguistics, 1990, 16(2): 79-85.
20 March 2015
30
Bag Translation
• In bag translation we take a sentence, cut it up into
words, place the words in a bag, and then try to recover
the sentence given the bag.
• We use the n-gram model to rank different arrangements
of the words in the bag.
• Thus, we treat an arrangement S as better than another
arrangement S' if Pr(S) is greater than Pr(S').
• We tried this scheme on a random sample of sentences.
This example comes from:
Brown P F, Cocke J, Pietra S A D, et al. A statistical approach to
machine translation[J]. Computational linguistics, 1990, 16(2): 79-85.
20 March 2015
31
Bag Translation
From a collection of 100 sentences, we considered the 38
sentences with fewer than 11 words each. (why?)
20 March 2015
32
Unseen n-grams
• We have seen “𝑖 𝑙𝑖𝑘𝑒” to in our corpus
• We have never seen “𝑖 𝑙𝑖𝑘𝑒 𝑡𝑜 𝑠𝑚𝑜𝑜𝑡ℎ”
in our corpus
𝑝(𝑠𝑚𝑜𝑜𝑡ℎ|𝑖 𝑙𝑖𝑘𝑒 𝑡𝑜) = 0
• Any sentence that includes
“𝑖 𝑙𝑖𝑘𝑒 𝑡𝑜 𝑠𝑚𝑜𝑜𝑡ℎ” will be assigned
probability 0
• Why is this a bad thing?
20 March 2015
33
Content
What’s a Language Model?
N-gram Language Model
Language Model Smoothing
Language Model Evaluation
Managing Very Large Language Models
Other Language Modelling Approaches
20 March 2015
34
Language Model Smoothing
• Smoothing in language modelling is the
process by which unseen events are
assigned a non-zero probability.
• This is achieved by some combination of:
– Count adjustment
– Interpolation
– Back-off
20 March 2015
35
Count Adjustment
count(tea) = 10
count(tea cup) = 3
count(tea drinker) = 3
count(tea time) = 4
𝑐𝑜𝑢𝑛𝑡(𝑡𝑒𝑎 𝑥)
𝑝(𝑥|𝑡𝑒𝑎)=
𝑐𝑜𝑢𝑛𝑡(𝑡𝑒𝑎)
20 March 2015
𝑝 𝑐𝑢𝑝 𝑡𝑒𝑎 = .3
𝑝(𝑑𝑟𝑖𝑛𝑘𝑒𝑟|𝑡𝑒𝑎) = .4
𝑝(𝑡𝑖𝑚𝑒|𝑡𝑒𝑎) = .3
36
Count Adjustment
What happens when we want to estimate the
probability of the phrase “𝑡𝑒𝑎 𝑐𝑜𝑠𝑦”?
20 March 2015
37
Count Adjustment
What happens when we want to estimate the
probability of the phrase “𝑡𝑒𝑎 𝑐𝑜𝑠𝑦” ?
𝑝(𝑐𝑜𝑠𝑦|𝑡𝑒𝑎) = 0/10 = 0
Just because we haven’t seen an event in our sample, it
doesn’t mean that it can’t occur. The smaller our sample
size, the less we trust counts derived from it.
Therefore, we need to assign 𝑝(𝑐𝑜𝑠𝑦|𝑡𝑒𝑎) a non-zero
probability.
20 March 2015
38
Count Adjustment
Remember for our model to be sound:
𝑝(𝑥|𝑡𝑒𝑎) = 1
𝑥
20 March 2015
39
Count Adjustment
So, for 𝑝(𝑐𝑜𝑠𝑦|𝑡𝑒𝑎) not to be zero, we have to
adjust downwards the probability of our seen
events:
• 𝑝(𝑡𝑖𝑚𝑒|𝑡𝑒𝑎)
• 𝑝(𝑐𝑢𝑝|𝑡𝑒𝑎)
• 𝑝(𝑑𝑟𝑖𝑛𝑘𝑒𝑟|𝑡𝑒𝑎)
How do we do that?
20 March 2015
40
Count Adjustment
So, for 𝑝(𝑐𝑜𝑠𝑦|𝑡𝑒𝑎) not to be zero, we have to
adjust downwards the probability of our seen
events:
• 𝑝(𝑡𝑖𝑚𝑒|𝑡𝑒𝑎)
• 𝑝(𝑐𝑢𝑝|𝑡𝑒𝑎)
• 𝑝(𝑑𝑟𝑖𝑛𝑘𝑒𝑟|𝑡𝑒𝑎)
How do we do that? • Add-one smoothing,
• Add-α smoothing or
• Good-Turing smoothing.
20 March 2015
41
Add-One Smoothing
For all possible events, add the count of one.
𝑐+1
𝑝=
𝑛+𝑣
In case of n-gram language model:
𝑝 = 𝑝(𝑤𝑛 |𝑤1 𝑤𝑛 … 𝑤𝑛−1 )
𝑐 = count of n-gram (𝑤1 𝑤𝑛 … 𝑤𝑛−1 𝑤𝑛 ) in corpus
𝑛 = count of history (𝑤1 𝑤𝑛 … 𝑤𝑛−1 ∗) in corpus
𝑣 = vocabulary size
20 March 2015
42
Add-One Smoothing
Words in language= {𝑡𝑒𝑎, 𝑡𝑖𝑚𝑒, 𝑐𝑢𝑝, 𝑑𝑟𝑖𝑛𝑘𝑒𝑟, 𝑐𝑜𝑠𝑦}
Vocabulary size =5
Observed events =
𝑡𝑒𝑎 𝑡𝑖𝑚𝑒,
𝑡𝑒𝑎 𝑡𝑖𝑚𝑒,
𝑡𝑒𝑎 𝑡𝑖𝑚𝑒,
𝑡𝑒𝑎 𝑐𝑢𝑝,
𝑡𝑒𝑎 𝑐𝑢𝑝,
𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑒𝑟,
𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑒𝑟, 𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑒𝑟
𝑡𝑒𝑎 𝑐𝑢𝑝,
𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑒𝑟,
𝑡𝑒𝑎
𝑡𝑖𝑚𝑒
𝑐𝑢𝑝
𝑑𝑟𝑖𝑛𝑘
𝑐𝑜𝑧𝑦
SUM
𝑇𝑒𝑎
0
3
3
4
0
10
𝑇𝑖𝑚𝑒
0
0
0
0
0
0
𝐶𝑢𝑝
0
0
0
0
0
0
𝐷𝑟𝑖𝑛𝑘
0
0
0
0
0
0
𝑐𝑜𝑧𝑦
0
0
0
0
0
0
20 March 2015
43
Add-One Smoothing
𝑇𝑒𝑎
𝑇𝑖𝑚𝑒
𝐶𝑢𝑝
𝐷𝑟𝑖𝑛𝑘
𝑐𝑜𝑧𝑦
SUM
𝑇𝑒𝑎
0+1=1
3+1=4
3+1=4
4+1=5
0+1=1
15
𝑇𝑖𝑚𝑒
0+1=1
0+1=1
0+1=1
0+1=1
0+1=1
5
𝐶𝑢𝑝
0+1=1
0+1=1
0+1=1
0+1=1
0+1=1
5
𝐷𝑟𝑖𝑛𝑘
0+1=1
0+1=1
0+1=1
0+1=1
0+1=1
5
𝑐𝑜𝑧𝑦
0+1=1
0+1=1
0+1=1
0+1=1
0+1=1
5
𝑝(𝑐𝑢𝑝|𝑡𝑒𝑎) = (3 + 1)/(10 + 5) = 4/15
𝑝(𝑡𝑖𝑚𝑒|𝑡𝑒𝑎) = (3 + 1)/(10 + 5) = 4/15
𝑝(𝑑𝑟𝑖𝑛𝑘𝑒𝑟|𝑡𝑒𝑎) = (4 + 1)/(10 + 5) = 5/15
𝑝(𝑐𝑜𝑠𝑦|𝑡𝑒𝑎) = (0 + 1)/(10 + 5) = 1/15
𝑝(𝑡𝑒𝑎|𝑡𝑒𝑎) = (0 + 1)/(10 + 5) = 1/15
20 March 2015
𝑝(𝑐𝑢𝑝|𝑐𝑢𝑝) = 1/5
𝑝(𝑡𝑖𝑚𝑒|𝑐𝑢𝑝) = 1/5
𝑝(𝑑𝑟𝑖𝑛𝑘𝑒𝑟|𝑐𝑢𝑝) = 1/5
𝑝(𝑐𝑜𝑠𝑦|𝑐𝑢𝑝) = 1/5
𝑝(𝑡𝑒𝑎|𝑐𝑢𝑝) = 1/5
44
Add-One Smoothing
𝑝 𝑡𝑒𝑎 𝑐𝑜𝑧𝑦
= 𝑝 𝑡𝑒𝑎 𝑝 𝑐𝑜𝑧𝑦 𝑡𝑒𝑎
=
10
20
×
1
15
=
1
30
Here we use unigram language model to
estimate 𝑝(𝑡𝑒𝑎).
20 March 2015
45
Deal with Sentence Boundaries
If we add sentence boundaries to observed events:
Observed events =
<s>𝑡𝑒𝑎 𝑡𝑖𝑚𝑒</s>, <s>𝑡𝑒𝑎 𝑡𝑖𝑚𝑒</s>, <s>𝑡𝑒𝑎 𝑡𝑖𝑚𝑒</s>, <s>𝑡𝑒𝑎 𝑐𝑢𝑝</s>,
<s>𝑡𝑒𝑎 𝑐𝑢𝑝</s>, <s>𝑡𝑒𝑎 𝑐𝑢𝑝</s>, <s>𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑒𝑟<s>,
<s>𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑒𝑟</s>, <s>𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑒𝑟</s>, <s>𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑒𝑟</s>
𝑡𝑒𝑎
𝑡𝑖𝑚𝑒
𝑐𝑢𝑝
𝑑𝑟𝑖𝑛𝑘
𝑐𝑜𝑧𝑦
</s>
SUM
<s>
10
0
0
0
0
0
10
𝑇𝑒𝑎
0
3
3
4
0
0
10
𝑇𝑖𝑚𝑒
0
0
0
0
0
3
0
𝐶𝑢𝑝
0
0
0
0
0
3
0
𝐷𝑟𝑖𝑛𝑘
0
0
0
0
0
4
0
𝑐𝑜𝑧𝑦
0
0
0
0
0
0
0
20 March 2015
46
Deal with Sentence Boundaries
Add-one Smoothing:
𝑇𝑒𝑎
𝑇𝑖𝑚𝑒
𝐶𝑢𝑝
𝐷𝑟𝑖𝑛𝑘
𝑐𝑜𝑧𝑦
</s>
SUM
<s>
10+1=11
0+1=1
0+1=1
0+1=1
0+1=1
0+1=1
16
𝑇𝑒𝑎
0+1=1
3+1=4
3+1=4
4+1=5
0+1=1
0+1=1
16
𝑇𝑖𝑚𝑒
0+1=1
0+1=1
0+1=1
0+1=1
0+1=1
3+1=4
9
𝐶𝑢𝑝
0+1=1
0+1=1
0+1=1
0+1=1
0+1=1
3+1=4
9
𝐷𝑟𝑖𝑛𝑘
0+1=1
0+1=1
0+1=1
0+1=1
0+1=1
4+1=5
10
𝑐𝑜𝑧𝑦
0+1=1
0+1=1
0+1=1
0+1=1
0+1=1
0+1=1
6
𝑝(𝑐𝑢𝑝|𝑡𝑒𝑎) = (3 + 1)/(10 + 6) = 4/16
𝑝(𝑡𝑖𝑚𝑒|𝑡𝑒𝑎) = (3 + 1)/(10 + 6) = 4/16
𝑝(𝑑𝑟𝑖𝑛𝑘𝑒𝑟|𝑡𝑒𝑎) = (4 + 1)/(10 + 6) = 5/16
𝑝(𝑐𝑜𝑠𝑦|𝑡𝑒𝑎) = (0 + 1)/(10 + 6) = 1/16
𝑝(𝑡𝑒𝑎|𝑡𝑒𝑎) = (0 + 1)/(10 + 6) = 1/16
20 March 2015
𝑝(𝑐𝑢𝑝|𝑐𝑢𝑝) = 1/6
𝑝(𝑡𝑖𝑚𝑒|𝑐𝑢𝑝) = 1/6
𝑝(𝑑𝑟𝑖𝑛𝑘𝑒𝑟|𝑐𝑢𝑝) = 1/6
𝑝(𝑐𝑜𝑠𝑦|𝑐𝑢𝑝) = 1/6
𝑝(𝑡𝑒𝑎|𝑐𝑢𝑝) = 1/6
47
Deal with Sentence Boundaries
𝑝 <s>𝑡𝑒𝑎 𝑐𝑜𝑧𝑦</s>
= 𝑝 𝑡𝑒𝑎|<s> 𝑝 𝑐𝑜𝑧𝑦 𝑡𝑒𝑎 𝑝(</s>|𝑐𝑜𝑧𝑦)
=
20 March 2015
11
16
×
1
1
×
16
6
=
11
1536
48
Add-One Smoothing
But there are many more unseen n-grams than
seen n-grams.
Example: Europarl 2-bigrams:
• 86,700 distinct words
• 86,7002 = 7,516,890,000 possible bigrams (𝑣)
• but only about 30,000,000 words (and bigrams)
in corpus
Seen n-grams are assigned too low a probability!
20 March 2015
49
Add-α Smoothing
Instead of adding one to the count, we add
the value α (<1):
𝑐+𝛼
𝑝=
𝑛 + 𝛼𝑣
How do we determine a value for α?
20 March 2015
50
Example: 2-grams in Europarl
Count
0
1
2
3
4
5
6
8
10
20
add-one
0.00378
0.00755
0.01133
0.01511
0.01888
0.02266
0.02644
0.03399
0.04155
0.07931
add-α
0.00016
0.95725
1.91433
2.87141
3.82850
4.78558
5.74266
7.65683
9.57100
19.14183
add-α counts are more realistic than add-one counts
20 March 2015
51
Good-Turing Smoothing
Adjust actual counts 𝑟 to expected counts 𝑟 ∗
using the following:
𝑟∗
𝑁𝑟+1
= (𝑟 + 1)
𝑁𝑟
where 𝑁𝑟 is the number of n-grams with count r
in the corpus (count of counts). 𝑁0 is the total
number of n-grams in the corpus.
20 March 2015
52
Good-Turing Europarl
Bigram Counts
20 March 2015
Count
Count of counts
Adjusted count
𝑟
𝑁𝑟
𝑟∗
0
7,514,941,065
0.00015
1
1,132,844
0.46539
2
263,611
1.40679
3
123,615
2.38767
4
73,788
3.33753
5
49,254
4.36967
6
35,869
5.32928
8
21,693
7.43798
10
14,880
9.31304
20
4,546
19.54487
53
Smoothing: the story so far
• The smoothing techniques we have seen
so far work by adjusting the counts of seen
n-grams so that unseen n-grams get
assigned some of the probability mass.
• Other techniques are interpolation and
backoff.
• These are based on the idea of combining
language models of different orders.
20 March 2015
54
Combining language models
• In given corpus, we may never observe
– Scottish beer drinkers
– Scottish beer eaters
• Both have count 0.
• The smoothing methods we have looked at so far
will assign them the same probability.
• It would be useful here to look at bigram frequencies:
– beer drinkers
– beer eaters
20 March 2015
55
Interpolation
Higher and lower order n-gram models have
different strengths and weaknesses:
– high-order n-grams are sensitive to more
context, but have sparse counts
– low-order n-grams consider only very limited
context, but have robust counts
20 March 2015
56
Interpolation
• Combine them:
𝑝 𝑤3 𝑤1 , 𝑤2
= 𝜆1 𝑝1 𝑤3
+ 𝜆2 𝑝2 𝑤3 𝑤2
+ 𝜆3 𝑝3 𝑤 3 𝑤 1, 𝑤 2
Where: 𝜆1 +𝜆2 +𝜆3 = 1
20 March 2015
57
Backoff
• Backoff is slightly different.
• The basic idea is to trust the highest order
language model that contains the n-gram.
20 March 2015
58
Backoff
• Given a trigram, bigram and unigram
language model:
– if trigram count > 0:
• Use it
– else if bigram count > 0:
• Use it
– else:
• Use the unigram count
20 March 2015
59
Other Considerations
• Consider the English words spite and constant
– both occur 993 times in Europarl corpus
– only 9 different words follow spite
– almost always followed by of (979 times), due to
expression in spite of
– 415 different words follow constant
– most frequent: and (42 times), concern (27 times),
pressure (26 times), but huge tail of singletons:
268 different words`
20 March 2015
60
Other Considerations
• More likely to see new bigram that starts with
constant than spite.
• Witten-Bell smoothing takes this into account.
20 March 2015
61
Other Considerations
• Consider the word York
– fairly frequent word in Europarl corpus, occurs
477 times
– as frequent as foods, indicates and providers
– in unigram language model: a respectable
probability
20 March 2015
62
Other Considerations
• However, it almost always directly follows New
(473 times)
• Recall: unigram model only used if the bigram
model inconclusive
• York unlikely second word in unseen bigram
• In backoff unigram model, York should have low
probability
• Kneser-Ney smoothing takes this into account.
20 March 2015
63
Content
What’s a Language Model?
N-gram Language Model
Language Model Smoothing
Language Model Evaluation
Managing Very Large Language Models
Other Language Modelling Approaches
20 March 2015
64
Evaluating Language Models
The quality of a language model can be
estimated using perplexity:
𝑃𝑃 = 2𝐻(𝑃𝐿𝑀 )
which in turn is based on the notion of
cross-entropy:
𝐻 𝑃𝐿𝑀
20 March 2015
1
= − log 𝑃𝐿𝑀 (𝑤1 … 𝑤𝑛 )
𝑛
65
Example
• Say we wanted to measure how well a
language model 𝑝𝐿𝑀 predicts the following
sentence:
I would like to commend the rapporteur on his
work.
20 March 2015
66
𝒑𝑳𝑴
−𝐥𝐨𝐠 𝟐 𝒑𝑳𝑴
𝒑𝑳𝑴 (I | </s><s>)
0.109
3.197
𝒑𝑳𝑴(would | <s> I )
0.144
2.791
𝒑𝑳𝑴 (like | I would)
0.489
1.031
𝒑𝑳𝑴 (to | would like)
0.905
0.144
𝒑𝑳𝑴 (commend | like to)
0.002
8.794
𝒑𝑳𝑴 (the | to recommend)
0.472
1.084
𝒑𝑳𝑴(rapporteur | recommend the)
0.147
2.763
𝒑𝑳𝑴 (on | the rapporteur)
0.056
4.15
𝒑𝑳𝑴 (his | rapporteur on)
0.194
2.367
𝒑𝑳𝑴 (work | on his)
0.089
3.498
𝒑𝑳𝑴 (. | his work)
0.29
1.785
0.99999
average
0.000014
2.643
Prediction
𝒑𝑳𝑴 (</s> | work .)
Here is
how we
would
proceed
N-Gram Comparison
word
i
would
like
to
unigram
6.684
8
9
5
bigram
3.197
2.884
2.026
0.402
trigram
3.197
2.791
1.031
0.144
4-gram
3.197
2.791
1.29
0.113
commend
the
15
4
12.335
1.402
8.794
1.084
8.633
0.88
rapporteur
on
his
work
.
</s>
average
11
7
11
10
5
4.828
8.051
7.319
4.14
7.316
4.816
3.02
0.005
4.072
2.763
4.15
2.367
3.498
1.785
0
2.634
2.35
1.862
1.978
2.394
1.51
0
2.251
perplexity
265.136
16.817
6.206
4.758
20 March 2015
68
Content
What’s a Language Model?
N-gram Language Model
Language Model Smoothing
Language Model Evaluation
Managing Very Large Language Models
Other Language Modelling Approaches
20 March 2015
69
Managing very large
language models
• Millions to billions of words are easy to get
(trillions of English words available on the
web)
• But: huge language models do not fit into
RAM
20 March 2015
70
Delete singletons
Europarl data:
Order
Unique n-grams
Singletons
unigram
86,700
33,447 (38.6%)
bigram
1,948,935
1,132,844 (58.1%)
trigram
8,092,798
6,022,286 (74.4%)
4-gram
15,303,847
13,081,621 (85.5%)
5-gram
19,882,175
18,324,577 (92.2%)
Delete singletons of higher order n-grams
20 March 2015
71
Other Techniques
• Store the language model using an
appropriate data structure – trie
• Reduce vocabulary size – e.g. mapping
numbers to symbol NUM
• Store language model on disk and load
only what is necessary into memory
• Filter language model only for those words
that are in the translation options being
explored by the decoder
20 March 2015
72
Content
What’s a Language Model?
N-gram Language Model
Language Model Smoothing
Language Model Evaluation
Managing Very Large Language Models
Other Language Modelling Approaches
20 March 2015
73
Other Language Modelling
Approaches
• Syntax Language Models
• Neural Network Language Models
20 March 2015
74
Discussion
20 March 2015
75
Acknowledgement
Parts of the content of this lecture are
taken from previous lectures and
presentations given by Jennifer Foster,
Declan Groves, Yvette Graham, Kevin
Knight, Josef van Genabith, Andy Way.
20 March 2015
76