Comments
Transcript
CMSC 723/LING 723 Computational Linguistics I
CMSC 723/LING 723 Computational Linguistics I N-gram Language Models Lecture 9 October 31, 2007 Professor : Bonnie Dorr Co-instructor : Nitin Madnani TAs : Hamid Shahri, Alexandros Tzannes 1 1 Administrivia • Take-home midterm due now ! • Assignment 3 graded; Assignment 4 out • Assignment Submission • • • Use the subject “CMSC 723: Assignment N” Print out written portion and submit in class Submit zips/tarballs; not individual files • Forum open for use 2 Modeling English • Assign probability values to utterances in English • Applications • • • • Statistical Machine Translation (1st lecture) Speech/Handwriting Recognition Spelling Correction Predictive Text Input 3 N-grams This is a sentence N=1 (Unigrams) This, is, a, sentence (4) No overlap/dependence Partially overlapping word sequences of length N 4 N-grams This is a sentence N=2 (Bigrams) This is, is a, a sentence (3) Partially overlapping word sequences of length N 5 N-grams This is a sentence N=3 (Trigrams) This is a, is a sentence (2) Partially overlapping word sequences of length N 6 Computing Probabilities P (w1 , w2 , . . . , wT ) = P (w1 )P (w2 |w1 )P (w3 |w1 , w2 ) . . . P (wT |w1 , . . . , wT −1 ) [ By chain rule ] Is this practical ? No ! Can’t keep track of all possible histories of all words ! 7 Approximating Probabilities Basic idea: limit history to fixed number of words (N) (Markov Assumption) P (wk |w1 , . . . , wk−1 ) ≈ P (wk |wk−N +1 , . . . , wk−1 ) N=1 P (wk |w1 , . . . , wk−1 ) ≈ P (wk ) ⇒ P (w1 , w2 , . . . , wT ) ≈ P (w1 )P (w2 ) . . . P (wT ) Unigram Language Model or 0th order Markov Model 8 Approximating Probabilities Basic idea: limit history to fixed number of words (N) (Markov Assumption) P (wk |w1 , . . . , wk−1 ) ≈ P (wk |wk−N +1 , . . . , wk−1 ) N=2 P (wk |w1 , . . . , wk−1 ) ≈ P (wk |wk−1 ) ⇒ P (w1 , w2 , . . . , wT ) ≈ P (w1 |< S >)P (w2 |w1 ) . . . P (wT |wT −1 ) Bigram Language Model or 1st order Markov Model 9 Approximating Probabilities Basic idea: limit history to fixed number of words (N) (Markov Assumption) P (wk |w1 , . . . , wk−1 ) ≈ P (wk |wk−N +1 , . . . , wk−1 ) N=3 P (wk |w1 , . . . , wk−1 ) ≈ P (wk |wk−2 , wk−1 ) ⇒ P (w1 , w2 , . . . , wT ) ≈ P (w1 |< S >< S >) . . . P (wT |wT −2 wT −1 ) Trigram Language Model or 2nd order Markov Model 10 Building N-gram Models • Use existing English sentences to compute n-gram probability estimates (training) • Terminology: • • N = total number of words in training data (tokens) • • • C(w1, ..., wk) = frequency of n-gram w1, ..., wk in training data V = vocabulary size / number of unique words in training data (types) P(w1, ..., wk) = probability estimate for n-gram w1 ... wk P(wk|w1, ..., wk-1) = conditional probability of producing wk given the history w1, ... wk-1 11 Building N-gram Models • • Start with what’s easiest Compute maximum likelihood estimates for individual ngram probabilities • • • • C(wi ) Unigram: P (wi ) = N C(wi wj ) Bigram: P (wi , wj ) = N P (wi , wj ) C(wi wj ) C(wi wj ) P (wj |wi ) = =! = P (wi ) C(wi ) w C(wi w) Uses relative frequencies as estimates Maximizes the likelihood of the data given the model P(D|M) 12 Building N-gram Models <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s> Training Corpus P(I|<s>) = 2/3 = 0.67 P(Sam|<s>) = 1/3 = 0.33 P(am|I) = 2/3 = 0.67 P(do|I) = 1/3 = 0.33 P(</s>|Sam)= 1/2 = 0.50 P(Sam|am) = 1/2 = 0.50 ... Bigram Probability Estimates Example: Bigram Language Model 13 More context, More work • • Larger N = Larger Context • • Lexical co-occurrences Local syntactic relations However, larger N also means a more complex model • For example, assume that we have 20,000 unique words in our training data • Larger N has another more serious problem ! Model Number of parameters to estimate Unigram 20000 Bigram (20000)2 Trigram (20000)3 14 Data Sparsity P(I|<s>) = 2/3 = 0.67 P(Sam|<s>) = 1/3 = 0.33 P(am|I) = 2/3 = 0.67 P(do|I) = 1/3 = 0.33 P(</s>|Sam)= 1/2 = 0.50 P(Sam|am) = 1/2 = 0.50 ... Bigram Probability Estimates P(I like ham) = P(I|<s>) P(like|I) P(ham|like) P(</s>|ham) = 0 0 (never seen in training data) 15 Data Sparsity • • Serious problem for practical language modeling • Solution 1: Use larger training data • Becomes more severe as N increases (precision/ recall tradeoff) • Can’t always work; Blame Zipf ’s Law Solution 2: Assign some non-zero probability to unseen n-grams • Known as Discounting 16 Discounting • Zeros are bad for any statistical estimator • Need better estimators because MLEs give us a lot of zeros • Basic idea: Take from the rich (seen n- grams) and give to the poor (unseen ngrams) • Also called Smoothing; A distribution without zeros is smoother 17 Laplace’s Law • Simplest and oldest smoothing technique • Just add 1 to all collected n-gram counts • So, what do the revised estimates look like ? 18 Laplace’s Law: Probabilities C(wi ) PM LE (wi ) = N C(wi ) + 1 PLAP (wi ) = N +V Unigrams C(wi wj ) PM LE (wi , wj ) = N C(wi wj ) + 1 PLAP (wi , wj ) = N +V2 PLAP (wi , wj ) C(wi wj ) + 1 PLAP (wj |wi ) = = PLAP (wi ) C(wi ) + V Bigrams 19 Laplace’s Law: Frequencies CLAP (wi ) CLAP (wi wj ) = PLAP (wi )N = PLAP (wi , wj )N Expected Frequency Estimates d1 = d2 = CLAP (wi ) C(wi ) CLAP (wi wj ) C(wi wj ) Relative Discount 20 Laplace’s Law • Bayesian estimator with uniform priors (all n-grams equally likely) • Moves too much mass over to unseen ngrams • What if we added a fraction of 1 instead ? 21 Lidstone’s Law of Succession • • Add 0 < γ < 1 to each count instead • The case of γ = 0.5 is known as Jeffery-Perks Law or Expected Likelihood Estimation • Still two problems: The smaller γ is, the lower the mass moved to the unseen n-grams • • How to find the right value of γ ? This law gives probabilities linear in MLE frequency which does not match our observations 22 Good-Turing Estimator • Estimate frequencies of items based on the assumption that they are binomially distributed • Works well for words and n-grams despite assumption not being satisfied • Intuition: Use n-grams seen once to estimate n-grams never seen and so on 23 Good-Turing Estimator • Pre-compute Nr (frequency of frequency r) Nr = • ! 1 wi wj :C(wi wj )=r For each r=C(wiwj), compute an expected frequency estimate Nr+1 r = CGT (wi wj ) = (r + 1) Nr ! • Replace MLE counts of seen bigrams with the expected frequency estimates and use those for probabilities CGT (wi , wj ) PGT (wi , wj ) = N CGT (wi , wj ) PGT (wj |wi ) = C(wi ) 24 Good-Turing Estimator • What about an unseen bigram ? r = CGT ! PGT N1 N1 = (0 + 1) = N0 N0 CGT = N • Do we know N 0 ? Can we compute it ? N0 = V 2 − bigrams we have seen 25 Good-Turing Estimator: Example r Nr 1 2 3 4 5 138741 25413 10531 5997 3565 V = 14585 Seen bigrams =199252 C(person she) = 2 N0 = (14585)2 - 199252 Cunseen = N1/N0 = 0.00065 Punseen = N1/(N0N) = 1.06 x 10-9 CGT(person she) = (2+1)*10531/25413 =1.243 P(she|person) = CGT(person she)/223 = 0.0056 C(person) = 223 26 Good-Turing Estimator • • Can’t replace all MLE counts • Two solutions: • Unreliable for high values of r, .e.g. Nr = 0 for r = rmax + 1 • • Only replace counts for r <= k (~10) Fit a curve S through the observed (r, Nr) values and use S(r) instead Not used by itself but in combination with other techniques 27 Combining Estimators • Combining n-gram probability estimates from different models • Leverage different information sources of prediction and the N-gram hierarchy • Three major combination techniques: • • • Simple Linear Interpolation of MLEs Katz Backoff Kneser-Ney Smoothing 28 Linear MLE Interpolation • Mix a trigram model with bigram and unigram models to offset sparsity • Mix = Weighted Linear Combination P (wk |wk−2 wk−1 ) = λ1 P (wk |wk−2 wk−1 ) + λ2 P (wk |wk−1 ) + λ3 P (wk ) 0 <= λi <= 1 ! λi = 1 i 29 Linear MLE Interpolation • May want to condition λ on the context • Trigrams with more accurate bigrams i automatically get weighted higher • λi are estimated on some held-out data set (not training, not test) • Estimation is usually done via an EM variant 30 Backoff Models • Consult different models in order depending on specificity (instead of all at the same time) • The most detailed model for current context first and, if that doesn’t work, back off to a lower model • Continue backing off until you reach a model that has some counts • Needs to incorporate discounting as an integral part of the algorithm 31 Katz Backoff Given a trigram “x y z” Pkatz (z|x, y) = ! PGT (z|x, y), α(x, y)Pkatz (z|y), Pkatz (z|y) = ! PGT (z|y), α(y)PGT (z), else if C(x, y, z) > 0 otherwise if C(y, z) > 0 otherwise Important: α(.) = 1 if immediate history unseen 32 Katz Backoff • Why P GTs • and not PMLEs directly ? Can’t save any probability mass for lower order models without discounting • Why the α’s ? • To ensure that total mass from all lower order models sums exactly to what we got from the discounting • In short, need P GTs and α’s to make sure that Σz’ Pkatz(z’|x,y) = 1 33 Backoff Models • Simple to implement and work well • Use discounting to tell us how much mass to set aside and backoff to distribute it • Criticized for sudden change in probability estimates upon adding more data • Hard to detect genuine ‘grammatical zeros’ 34 Kneser-Ney Smoothing • Average Good-Turing discount for r >= 3 is largely constant over r • Subtract a fixed discount D (<=1) from non-zero counts (Absolute Discounting) • Kneser-Ney: Interpolate this discounted model with a unigram model that’s special in nature *Church & Gale (1991) 35 Kneser-Ney Smoothing • • Intuition • • Lower model important only when higher is sparse Should be optimized to perform in such situations Example • • • C(Los Angeles) = M, C(Angeles) = M; M is very large • It shouldn’t because Angeles occurs with only a single context in the entire training data Angeles always and only occurs after Los Unigram MLE for Angeles will be high and a normal smoothing algorithm will likely pick it in any context 36 Kneser-Ney Smoothing C(wk−1 wk ) − D PKN (wk |wk−1 ) = + β(wi )PCON T (wk ) C(wk−1 ) N (• wk ) PCON T (wk ) = ! wk N (• wk ) N (• wk ) = number of different contexts wk has appeared in • • • Uses a clever lower-order model estimated on a held-out dataset Excellent performance; Default technique 37 Unknown words • • • • What if our vocabulary is not closed ? Fix vocabulary at some reasonable number of words During training: • Consider any words that don’t occur in this list as unknown or out of vocabulary (OOV) words • • Replace all OOVs with the special word <UNK> Treat <UNK> as any other word and count and estimate probabilities During testing: • • Replace unknown words with <UNK> and use LM Test set characterized by OOV rate (percentage of OOVs) 38 Evaluating Language Models • • Information theoretic criteria used • Perplexity: How surprised are you on average by what comes next ? • If you (the LM) are (is) good at knowing what comes next in a sentence Low perplexity • Lower perplexity is better Most common: Perplexity assigned by the trained LM to a test set 39 Computing Perplexity • Given testset T with sentences t1, ..., tn and total words WT • Using our LM, we can calculate P (T ) = • n ! P (ti) i=1 Mathematically, perplexity (PP) is defined as the reciprocal of the geometric average probability assigned to T ! 1 P P (T ) = P (T )1/WT "n "$ T = W# 1 P (ti) i=1 40 Practical Evaluation • • • • • Count </s> but not <s> in WT Typical range of perplexities on English text is 50-1000 Closed vocabulary testing yields much lower perplexities Testing across genres yields higher perplexities Can only compare perplexities if the LMs use the same vocabulary Order PP Unigram Bigram 962 170 Trigram 109 Training: N=38 million,V~20000, open vocabulary, Katz backoff where applicable Test: 1.5 million words, same genre as training 41 * Today’s LMs • Training • • N = 1.2 billion words,V = 20k words • • 25 million words, OOV rate 3.8% Trigram model with Kneser-Ney smoothing • Testing Perplexity 79.3 *More like last year’s 42