...

CMSC 723/LING 723 Computational Linguistics I

by user

on
Category: Documents
47

views

Report

Comments

Transcript

CMSC 723/LING 723 Computational Linguistics I
CMSC 723/LING 723
Computational Linguistics I
N-gram Language Models
Lecture 9
October 31, 2007
Professor : Bonnie Dorr
Co-instructor : Nitin Madnani
TAs : Hamid Shahri, Alexandros Tzannes
1
1
Administrivia
• Take-home midterm due now !
• Assignment 3 graded; Assignment 4 out
• Assignment Submission
•
•
•
Use the subject “CMSC 723: Assignment N”
Print out written portion and submit in class
Submit zips/tarballs; not individual files
• Forum open for use
2
Modeling English
• Assign probability values to utterances in
English
• Applications
•
•
•
•
Statistical Machine Translation (1st lecture)
Speech/Handwriting Recognition
Spelling Correction
Predictive Text Input
3
N-grams
This is a sentence
N=1 (Unigrams)
This, is, a, sentence (4)
No overlap/dependence
Partially overlapping word sequences of length N
4
N-grams
This is a sentence
N=2 (Bigrams)
This is, is a, a sentence (3)
Partially overlapping word sequences of length N
5
N-grams
This is a sentence
N=3 (Trigrams)
This is a, is a sentence (2)
Partially overlapping word sequences of length N
6
Computing Probabilities
P (w1 , w2 , . . . , wT )
= P (w1 )P (w2 |w1 )P (w3 |w1 , w2 ) . . . P (wT |w1 , . . . , wT −1 )
[ By chain rule ]
Is this practical ?
No ! Can’t keep track of all possible histories of all words !
7
Approximating Probabilities
Basic idea: limit history to fixed number of words (N)
(Markov Assumption)
P (wk |w1 , . . . , wk−1 ) ≈ P (wk |wk−N +1 , . . . , wk−1 )
N=1
P (wk |w1 , . . . , wk−1 ) ≈ P (wk )
⇒ P (w1 , w2 , . . . , wT ) ≈ P (w1 )P (w2 ) . . . P (wT )
Unigram Language Model
or
0th order Markov Model
8
Approximating Probabilities
Basic idea: limit history to fixed number of words (N)
(Markov Assumption)
P (wk |w1 , . . . , wk−1 ) ≈ P (wk |wk−N +1 , . . . , wk−1 )
N=2
P (wk |w1 , . . . , wk−1 ) ≈ P (wk |wk−1 )
⇒ P (w1 , w2 , . . . , wT ) ≈ P (w1 |< S >)P (w2 |w1 ) . . . P (wT |wT −1 )
Bigram Language Model
or
1st order Markov Model
9
Approximating Probabilities
Basic idea: limit history to fixed number of words (N)
(Markov Assumption)
P (wk |w1 , . . . , wk−1 ) ≈ P (wk |wk−N +1 , . . . , wk−1 )
N=3
P (wk |w1 , . . . , wk−1 ) ≈ P (wk |wk−2 , wk−1 )
⇒ P (w1 , w2 , . . . , wT ) ≈ P (w1 |< S >< S >) . . . P (wT |wT −2 wT −1 )
Trigram Language Model
or
2nd order Markov Model
10
Building N-gram Models
•
Use existing English sentences to compute n-gram
probability estimates (training)
•
Terminology:
•
•
N = total number of words in training data (tokens)
•
•
•
C(w1, ..., wk) = frequency of n-gram w1, ..., wk in training data
V = vocabulary size / number of unique words in training data
(types)
P(w1, ..., wk) = probability estimate for n-gram w1 ... wk
P(wk|w1, ..., wk-1) = conditional probability of producing wk
given the history w1, ... wk-1
11
Building N-gram Models
•
•
Start with what’s easiest
Compute maximum likelihood estimates for individual ngram probabilities
•
•
•
•
C(wi )
Unigram: P (wi ) =
N
C(wi wj )
Bigram: P (wi , wj ) =
N
P (wi , wj )
C(wi wj )
C(wi wj )
P (wj |wi ) =
=!
=
P (wi )
C(wi )
w C(wi w)
Uses relative frequencies as estimates
Maximizes the likelihood of the data given the model
P(D|M)
12
Building N-gram Models
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
Training Corpus
P(I|<s>)
= 2/3 = 0.67 P(Sam|<s>) = 1/3 = 0.33
P(am|I)
= 2/3 = 0.67 P(do|I)
= 1/3 = 0.33
P(</s>|Sam)= 1/2 = 0.50 P(Sam|am) = 1/2 = 0.50
...
Bigram Probability Estimates
Example: Bigram Language Model
13
More context, More work
•
•
Larger N = Larger Context
•
•
Lexical co-occurrences
Local syntactic relations
However, larger N also means
a more complex model
•
For example, assume that we
have 20,000 unique words in
our training data
•
Larger N has another more
serious problem !
Model
Number of
parameters to
estimate
Unigram
20000
Bigram
(20000)2
Trigram
(20000)3
14
Data Sparsity
P(I|<s>)
= 2/3 = 0.67 P(Sam|<s>) = 1/3 = 0.33
P(am|I)
= 2/3 = 0.67 P(do|I)
= 1/3 = 0.33
P(</s>|Sam)= 1/2 = 0.50 P(Sam|am) = 1/2 = 0.50
...
Bigram Probability Estimates
P(I like ham)
= P(I|<s>) P(like|I) P(ham|like) P(</s>|ham)
= 0
0 (never seen in training data)
15
Data Sparsity
•
•
Serious problem for practical language modeling
•
Solution 1: Use larger training data
•
Becomes more severe as N increases (precision/
recall tradeoff)
•
Can’t always work; Blame Zipf ’s Law
Solution 2: Assign some non-zero probability to
unseen n-grams
•
Known as Discounting
16
Discounting
• Zeros are bad for any statistical estimator
• Need better estimators because MLEs give
us a lot of zeros
• Basic idea: Take from the rich (seen n-
grams) and give to the poor (unseen ngrams)
• Also called Smoothing; A distribution
without zeros is smoother
17
Laplace’s Law
• Simplest and oldest smoothing technique
• Just add 1 to all collected n-gram counts
• So, what do the revised estimates look
like ?
18
Laplace’s Law: Probabilities
C(wi )
PM LE (wi ) =
N
C(wi ) + 1
PLAP (wi ) =
N +V
Unigrams
C(wi wj )
PM LE (wi , wj ) =
N
C(wi wj ) + 1
PLAP (wi , wj ) =
N +V2
PLAP (wi , wj )
C(wi wj ) + 1
PLAP (wj |wi ) =
=
PLAP (wi )
C(wi ) + V
Bigrams
19
Laplace’s Law: Frequencies
CLAP (wi )
CLAP (wi wj )
= PLAP (wi )N
= PLAP (wi , wj )N
Expected Frequency Estimates
d1
=
d2
=
CLAP (wi )
C(wi )
CLAP (wi wj )
C(wi wj )
Relative Discount
20
Laplace’s Law
• Bayesian estimator with uniform priors (all
n-grams equally likely)
• Moves too much mass over to unseen ngrams
• What if we added a fraction of 1 instead ?
21
Lidstone’s Law of Succession
•
•
Add 0 < γ < 1 to each count instead
•
The case of γ = 0.5 is known as Jeffery-Perks Law
or Expected Likelihood Estimation
•
Still two problems:
The smaller γ is, the lower the mass moved to the
unseen n-grams
•
•
How to find the right value of γ ?
This law gives probabilities linear in MLE frequency
which does not match our observations
22
Good-Turing Estimator
• Estimate frequencies of items based on the
assumption that they are binomially
distributed
• Works well for words and n-grams despite
assumption not being satisfied
• Intuition: Use n-grams seen once to
estimate n-grams never seen and so on
23
Good-Turing Estimator
•
Pre-compute Nr (frequency of frequency r)
Nr =
•
!
1
wi wj :C(wi wj )=r
For each r=C(wiwj), compute an expected frequency
estimate
Nr+1
r = CGT (wi wj ) = (r + 1)
Nr
!
•
Replace MLE counts of seen bigrams with the expected
frequency estimates and use those for probabilities
CGT (wi , wj )
PGT (wi , wj ) =
N
CGT (wi , wj )
PGT (wj |wi ) =
C(wi )
24
Good-Turing Estimator
• What about an unseen bigram ?
r = CGT
!
PGT
N1
N1
= (0 + 1)
=
N0
N0
CGT
=
N
• Do we know N
0
? Can we compute it ?
N0 = V 2 − bigrams we have seen
25
Good-Turing Estimator: Example
r
Nr
1
2
3
4
5
138741
25413
10531
5997
3565
V = 14585
Seen bigrams =199252
C(person she) = 2
N0 = (14585)2 - 199252
Cunseen = N1/N0 = 0.00065
Punseen = N1/(N0N) = 1.06 x 10-9
CGT(person she) = (2+1)*10531/25413 =1.243
P(she|person) = CGT(person she)/223 = 0.0056
C(person) = 223
26
Good-Turing Estimator
•
•
Can’t replace all MLE counts
•
Two solutions:
•
Unreliable for high values of r, .e.g. Nr = 0 for r =
rmax + 1
•
•
Only replace counts for r <= k (~10)
Fit a curve S through the observed (r, Nr) values and
use S(r) instead
Not used by itself but in combination with other
techniques
27
Combining Estimators
• Combining n-gram probability estimates
from different models
• Leverage different information sources of
prediction and the N-gram hierarchy
• Three major combination techniques:
•
•
•
Simple Linear Interpolation of MLEs
Katz Backoff
Kneser-Ney Smoothing
28
Linear MLE Interpolation
• Mix a trigram model with bigram and
unigram models to offset sparsity
• Mix = Weighted Linear Combination
P (wk |wk−2 wk−1 ) =
λ1 P (wk |wk−2 wk−1 ) + λ2 P (wk |wk−1 ) + λ3 P (wk )
0 <= λi <= 1
!
λi = 1
i
29
Linear MLE Interpolation
• May want to condition λ on the context
• Trigrams with more accurate bigrams
i
automatically get weighted higher
•
λi are estimated on some held-out data set
(not training, not test)
• Estimation is usually done via an EM variant
30
Backoff Models
•
Consult different models in order depending on
specificity (instead of all at the same time)
•
The most detailed model for current context first
and, if that doesn’t work, back off to a lower model
•
Continue backing off until you reach a model that
has some counts
•
Needs to incorporate discounting as an integral
part of the algorithm
31
Katz Backoff
Given a trigram “x y z”
Pkatz (z|x, y) =
!
PGT (z|x, y),
α(x, y)Pkatz (z|y),
Pkatz (z|y) =
!
PGT (z|y),
α(y)PGT (z),
else if C(x, y, z) > 0
otherwise
if C(y, z) > 0
otherwise
Important: α(.) = 1 if immediate history unseen
32
Katz Backoff
• Why P
GTs
•
and not PMLEs directly ?
Can’t save any probability mass for lower order
models without discounting
• Why the α’s ?
•
To ensure that total mass from all lower order
models sums exactly to what we got from the
discounting
• In short, need P
GTs
and α’s to make sure
that Σz’ Pkatz(z’|x,y) = 1
33
Backoff Models
• Simple to implement and work well
• Use discounting to tell us how much mass
to set aside and backoff to distribute it
• Criticized for sudden change in probability
estimates upon adding more data
• Hard to detect genuine ‘grammatical zeros’
34
Kneser-Ney Smoothing
• Average Good-Turing discount for r >= 3 is
largely constant over r
• Subtract a fixed discount D (<=1) from
non-zero counts (Absolute Discounting)
• Kneser-Ney: Interpolate this discounted
model with a unigram model that’s special
in nature
*Church
& Gale (1991)
35
Kneser-Ney Smoothing
•
•
Intuition
•
•
Lower model important only when higher is sparse
Should be optimized to perform in such situations
Example
•
•
•
C(Los Angeles) = M, C(Angeles) = M; M is very large
•
It shouldn’t because Angeles occurs with only a single
context in the entire training data
Angeles always and only occurs after Los
Unigram MLE for Angeles will be high and a normal
smoothing algorithm will likely pick it in any context
36
Kneser-Ney Smoothing
C(wk−1 wk ) − D
PKN (wk |wk−1 ) =
+ β(wi )PCON T (wk )
C(wk−1 )
N (• wk )
PCON T (wk ) = !
wk N (• wk )
N (• wk ) = number of different contexts wk has appeared in
•
•
•
Uses a clever lower-order model
estimated on a held-out dataset
Excellent performance; Default technique
37
Unknown words
•
•
•
•
What if our vocabulary is not closed ?
Fix vocabulary at some reasonable number of words
During training:
•
Consider any words that don’t occur in this list as unknown or
out of vocabulary (OOV) words
•
•
Replace all OOVs with the special word <UNK>
Treat <UNK> as any other word and count and estimate
probabilities
During testing:
•
•
Replace unknown words with <UNK> and use LM
Test set characterized by OOV rate (percentage of OOVs)
38
Evaluating Language Models
•
•
Information theoretic criteria used
•
Perplexity: How surprised are you on average by
what comes next ?
•
If you (the LM) are (is) good at knowing what
comes next in a sentence Low perplexity
•
Lower perplexity is better
Most common: Perplexity assigned by the trained
LM to a test set
39
Computing Perplexity
•
Given testset T with sentences t1, ..., tn and total
words WT
•
Using our LM, we can calculate
P (T ) =
•
n
!
P (ti)
i=1
Mathematically, perplexity (PP) is defined as the
reciprocal of the geometric average probability
assigned to T
!
1
P P (T ) =
P (T )1/WT
"n
"$
T
= W#
1
P
(ti)
i=1
40
Practical Evaluation
•
•
•
•
•
Count </s> but not <s> in WT
Typical range of perplexities on English text is 50-1000
Closed vocabulary testing yields much lower perplexities
Testing across genres yields higher perplexities
Can only compare perplexities if the LMs use the same
vocabulary
Order
PP
Unigram Bigram
962
170
Trigram
109
Training: N=38 million,V~20000, open vocabulary, Katz backoff where applicable
Test: 1.5 million words, same genre as training
41
*
Today’s
LMs
• Training
•
•
N = 1.2 billion words,V = 20k words
•
•
25 million words, OOV rate 3.8%
Trigram model with Kneser-Ney smoothing
• Testing
Perplexity 79.3
*More
like last year’s
42
Fly UP