Week 4: Language Modelling Statistical Machine Translation Lecturer: Qun Liu
by user
Comments
Transcript
Week 4: Language Modelling Statistical Machine Translation Lecturer: Qun Liu
CA446 Statistical Machine Translation Week 4: Language Modelling Lecturer: Qun Liu Lab Tutor: Xiaofeng Wu, Iacer Calixto 2nd Semester, 2014-2015 Academic Year http://computing.dcu.ie/~qliu/CA446 Content What’s a Language Model? N-gram Language Model Language Model Smoothing Language Model Evaluation Managing Very Large Language Models Other Language Modelling Approaches 20 March 2015 2 What is a language? Can we define a language mathematically? Deterministic Definition: A language is the set of all the sentences we can say. Probabilistic Definition: A language is the probabilistic distribution of all possible sentences. 20 March 2015 3 Deterministic Definition • Given a vocabulary 𝑉 = 𝑣1 , … , 𝑣𝑛 , • We use 𝑉 ∗ to denote the set of any sequence of words of 𝑉, 𝑉 ∗ = 𝑊|𝑊 = 𝑤1 , … , 𝑤𝑙 , 𝑙 ≥ 1, 𝑤𝑖 ∈ 𝑉 • Then a language can be defined as a subset of sentences: 𝐿 ⊆ 𝑉∗ • Any element of 𝐿 is called a sentence in the language. 20 March 2015 4 Formal Language • Languages may contain infinite sentences, thus they cannot be defined by enumeration. • One way to define a formal language is to define the grammar to generate all the sentences in that language. • Another way is to define a machine to recognize if a sentence is a valid sentence in that language. 20 March 2015 5 Chomsky Hierarchy Grammar Languages Automaton Production rules (constraints) Type-0 Recursively enumerable Turing machine 𝛼→𝛽 (no restrictions) Type-1 Contextsensitive Linear-bounded nondeterministic Turing machine 𝛼𝐴𝛽 → 𝛼𝛾𝛽 Type-2 Context-free Non-deterministic pushdown automaton 𝐴→𝛼 Finite state automaton 𝐴→𝑎 and 𝐴 → 𝑎𝐵 Type-3 20 March 2015 Regular 6 Chomsky Hierarchy • Regular Languages: – + − ? 0 − 9 + . 0 − 9 ∗ ? (decimal numbers) – 𝑎𝑚 𝑏 𝑛 • Context Free Languages: – 𝑎𝑛 𝑏 𝑛 – Any programming languages • Context Sensitive Languages: – 𝑎𝑛 𝑏 𝑛 𝑐 𝑛 • Recursively Enumerable – Semi-decidable problem 20 March 2015 7 Chomsky Hierarchy Set inclusions of formal languages described by the Chomsky hierarchy 20 March 2015 8 What is a language? Can we define a language mathematically? Deterministic Definition: A language is the set of all the sentences we can say. Probabilistic Definition: A language is the probabilistic distribution of all possible sentences. 20 March 2015 9 Probabilistic Definition • Deterministic definition is not suitable for natural language in most case • For some sentences, it is not easy to say if it is a legal sentence in a certain language: – Child languages – Tweets – Slang –… 20 March 2015 10 Statistical Language Model A statistical language model is defined as: 𝑝 𝑠 = 𝑤1 …𝑤𝑛 , ∀𝑖: 𝑤𝑖 ∈ 𝑉 The normalization condition: 𝑝(𝑠) = 1 𝑠∈𝑉 ∗ 20 March 2015 11 Noisy Channel Model Revisited 𝑒 = argmax 𝑝 𝑒 𝑝(𝑓|𝑒) 𝑒 Translation Model Decoder Language Model 20 March 2015 12 Statistical Language Model • How can we estimate the probability of a sentence in a specific language? • Unlike estimating the probability distribution of a dice, we cannot exhaust all the possible sentences in limited samples • Idea: breakdown – break all sentences down into limited substrings – Estimate the probability of a sentence by these substrings – If a string has lots of reasonable/plausible/likely substrings (n-grams) then it might be a reasonable sentence. 20 March 2015 13 Simplest Language Model • The simplest idea to break down a sentence is to split it to words. • Thus we get the simplest language model: 𝑛 𝑝 𝑠 = 𝑤1 …𝑤𝑛 = 𝑝(𝑤𝑖 ) 𝑖=1 • Here the probability of a sentence is just the multiplication of the probability of the words in the sentence. • This model is called a unigram language model. 20 March 2015 14 Word Frequency • 𝑝 𝑤 is word frequency: 𝑜𝑐𝑐𝑢𝑟𝑎𝑛𝑐𝑒 𝑜𝑓 𝑤𝑜𝑟𝑑 𝑤 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑟𝑝𝑢𝑠 𝑝 𝑤 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑜𝑘𝑒𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑟𝑝𝑢𝑠 Type Occurrences Rank the 3789654 1st he 2098762 2nd king 57897 1,356th boy 56975 1,357th 5 34,589th 1 123,567th [...] [...] stringyfy [...] 20 March 2015 transducionalify 15 Word Frequency 20 March 2015 16 Word Frequency 20 March 2015 17 Unigram Language Model 𝑝("I am a student.") = 𝑝("I") × 𝑝("am") × 𝑝("a") × 𝑝("student") × 𝑝(". ") • Problem of unigram language model: – Unseen words; 𝑝("I am a zdwi.") = ? – No word order. 𝑝("I am a student.") = 𝑝("a I . student am") 20 March 2015 18 Content What’s a Language Model? N-gram Language Model Language Model Smoothing Language Model Evaluation Managing Very Large Language Models Other Language Modelling Approaches 20 March 2015 19 N-gram Language Model 𝑝 𝑠 = 𝑤1 …𝑤𝑛 = 𝑝 𝑤1 𝑝 𝑤2 |𝑤1 … 𝑝 𝑤𝑛 𝑤1 … 𝑤𝑛−1 𝑛 = 𝑝 𝑤𝑖 𝑤1 … 𝑤𝑖−1 𝑖=1 20 March 2015 20 Markov Assumption it is assumed that the probability of observing the 𝑖 𝑡ℎ word 𝑤𝑖 in the context history of the preceding 𝑖 − 1 words can be approximated by the probability of observing it in the shortened context history of the preceding 𝑛 − 1 words (𝑛𝑡ℎ order Markov property). 20 March 2015 21 N-gram Language Model 𝑝 𝑠 = 𝑤1 …𝑤𝑛 = 𝑝 𝑤1 𝑝 𝑤2 |𝑤1 … 𝑝 𝑤𝑛 𝑤1 … 𝑤𝑛−1 𝑛 = 𝑝 𝑤𝑖 𝑤1 𝑤2 … 𝑤𝑖−1 𝑖=1 𝑛 ≈ The whole history (𝑖 − 1 words) 𝑝 𝑤𝑖 𝑤𝑖−(𝑁−1) 𝑤𝑖−(𝑁−2) … 𝑤𝑖−1 𝑖=1 The shortened history (𝑁 − 1 words) 20 March 2015 22 N-gram Language Model 1-gram (unigram) Model: 𝑝 𝑠 = 𝑤1 …𝑤𝑛 = 𝑝 𝑤1 × 𝑝 𝑤2 × ⋯ × 𝑝(𝑤𝑛 ) 2-gram (bigram) Model: 𝑝 𝑠 = 𝑤1 …𝑤𝑛 = 𝑝 𝑤1 × 𝑝 𝑤2 |𝑤1 × ⋯ × 𝑝(𝑤𝑛 |𝑤𝑛−1 ) 3-gram (trigram) Model: 𝑝 𝑠 = 𝑤1 …𝑤𝑛 = 𝑝 𝑤1 × 𝑝 𝑤2 |𝑤1 × 𝑝 𝑤3 |𝑤1 𝑤2 × … … × 𝑝(𝑤𝑛 |𝑤𝑛−2 𝑤𝑛−1 ) 20 March 2015 23 N-gram Language Model 𝑝 𝐼 𝑑𝑜𝑛’𝑡 𝑙𝑖𝑘𝑒 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠 =𝑝 𝐼 × 𝑝 𝑑𝑜𝑛’𝑡 𝐼 × 𝑝 𝑙𝑖𝑘𝑒 𝑑𝑜𝑛’𝑡 × 𝑝 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 𝑙𝑖𝑘𝑒 × 𝑝 𝑡ℎ𝑎𝑡 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 × 𝑝 𝑎𝑟𝑒 𝑡ℎ𝑎𝑡 × 𝑝(𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠|𝑎𝑟𝑒) bigram 𝑝 𝐼 𝑑𝑜𝑛’𝑡 𝑙𝑖𝑘𝑒 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠 =𝑝 𝐼 × 𝑝 𝑑𝑜𝑛’𝑡 𝐼 × 𝑝 𝑙𝑖𝑘𝑒 𝐼 𝑑𝑜𝑛’𝑡 × 𝑝(𝑠𝑝𝑖𝑑𝑒𝑟𝑠| 𝑑𝑜𝑛’𝑡 𝑙𝑖𝑘𝑒 ) × 𝑝 𝑡ℎ𝑎𝑡 𝑙𝑖𝑘𝑒 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 × 𝑝 𝑎𝑟𝑒 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 𝑡ℎ𝑎𝑡 × 𝑝(𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠|𝑡ℎ𝑎𝑡 𝑎𝑟𝑒) trigram 20 March 2015 24 Deal with Sentence Boundaries • To better estimate the probabilities of words at sentence boundaries, we usually add <s> and </s> before and after sentences. 20 March 2015 25 N-gram Language Model 𝑝 <s>𝐼 𝑑𝑜𝑛’𝑡 𝑙𝑖𝑘𝑒 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠</s> = 𝑝 𝐼|<s> × 𝑝 𝑑𝑜𝑛’𝑡 𝐼 × 𝑝 𝑙𝑖𝑘𝑒 𝑑𝑜𝑛’𝑡 × 𝑝 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 𝑙𝑖𝑘𝑒 × 𝑝 𝑡ℎ𝑎𝑡 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 × 𝑝 𝑎𝑟𝑒 𝑡ℎ𝑎𝑡 × 𝑝 𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠 𝑎𝑟𝑒 × 𝑝(</s>|𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠) 𝑝 <s>𝐼 𝑑𝑜𝑛’𝑡 𝑙𝑖𝑘𝑒 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠</s> = 𝑝 𝐼|<s> × 𝑝 𝑑𝑜𝑛’𝑡 <s>𝐼 × 𝑝 𝑙𝑖𝑘𝑒 𝐼 𝑑𝑜𝑛’𝑡 × 𝑝(𝑠𝑝𝑖𝑑𝑒𝑟𝑠| 𝑑𝑜𝑛’𝑡 𝑙𝑖𝑘𝑒 ) × 𝑝 𝑡ℎ𝑎𝑡 𝑙𝑖𝑘𝑒 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 × 𝑝 𝑎𝑟𝑒 𝑠𝑝𝑖𝑑𝑒𝑟𝑠 𝑡ℎ𝑎𝑡 × 𝑝(𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠|𝑡ℎ𝑎𝑡 𝑎𝑟𝑒) × 𝑝 </s> 𝑎𝑟𝑒 𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠 × 𝑝(</s>|𝑝𝑜𝑖𝑠𝑜𝑛𝑜𝑢𝑠) 20 March 2015 bigram trigram 26 Probability of an n-gram Suppose we have the phrase “x y” (i.e. word “x” followed by word “y”). p(y|x) is the probability that word y follows word x and can be estimated from a corpus as follows: 𝑛𝑢𝑚𝑏𝑒𝑟−𝑜𝑓−𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠 (“𝑥 𝑦”) 𝑝(𝑦|𝑥)= 𝑛𝑢𝑚𝑏𝑒𝑟−𝑜𝑓−𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠 (“𝑥”) bigram Similarly, suppose we have the phrase “x y z”. p(z|x y) is the probability that word z follows words x and y 𝑝(𝑧|𝑥 20 March 2015 𝑛𝑢𝑚𝑏𝑒𝑟−𝑜𝑓−𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠 (“𝑥 𝑦 𝑧”) 𝑦)= 𝑛𝑢𝑚𝑏𝑒𝑟−𝑜𝑓−𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠 (“𝑥 𝑦”) trigram 27 Size of N-gram LM • Size of an n-gram LM is the number of parameters in the model. • For n-gram LM, the maximal number of parameters is: |𝑉|𝑛 where 𝑉 is the vocabulary LM Size 20 March 2015 Unigram Bigram Trigram 4-gram 100K 1010 =10G 1015 =1T 1020 28 A character-based bigram LM 𝑝 t−i−p =𝑝 t ×𝑝 i t ×𝑝 p i = 1.0 × 0.3 × 0.6 = 0.18 20 March 2015 29 Bag Translation • In bag translation we take a sentence, cut it up into words, place the words in a bag, and then try to recover the sentence given the bag. • We use the n-gram model to rank different arrangements of the words in the bag. • Thus, we treat an arrangement S as better than another arrangement S' if Pr(S) is greater than Pr(S'). • We tried this scheme on a random sample of sentences. This example comes from: Brown P F, Cocke J, Pietra S A D, et al. A statistical approach to machine translation[J]. Computational linguistics, 1990, 16(2): 79-85. 20 March 2015 30 Bag Translation • In bag translation we take a sentence, cut it up into words, place the words in a bag, and then try to recover the sentence given the bag. • We use the n-gram model to rank different arrangements of the words in the bag. • Thus, we treat an arrangement S as better than another arrangement S' if Pr(S) is greater than Pr(S'). • We tried this scheme on a random sample of sentences. This example comes from: Brown P F, Cocke J, Pietra S A D, et al. A statistical approach to machine translation[J]. Computational linguistics, 1990, 16(2): 79-85. 20 March 2015 31 Bag Translation From a collection of 100 sentences, we considered the 38 sentences with fewer than 11 words each. (why?) 20 March 2015 32 Unseen n-grams • We have seen “𝑖 𝑙𝑖𝑘𝑒” to in our corpus • We have never seen “𝑖 𝑙𝑖𝑘𝑒 𝑡𝑜 𝑠𝑚𝑜𝑜𝑡ℎ” in our corpus 𝑝(𝑠𝑚𝑜𝑜𝑡ℎ|𝑖 𝑙𝑖𝑘𝑒 𝑡𝑜) = 0 • Any sentence that includes “𝑖 𝑙𝑖𝑘𝑒 𝑡𝑜 𝑠𝑚𝑜𝑜𝑡ℎ” will be assigned probability 0 • Why is this a bad thing? 20 March 2015 33 Content What’s a Language Model? N-gram Language Model Language Model Smoothing Language Model Evaluation Managing Very Large Language Models Other Language Modelling Approaches 20 March 2015 34 Language Model Smoothing • Smoothing in language modelling is the process by which unseen events are assigned a non-zero probability. • This is achieved by some combination of: – Count adjustment – Interpolation – Back-off 20 March 2015 35 Count Adjustment count(tea) = 10 count(tea cup) = 3 count(tea drinker) = 3 count(tea time) = 4 𝑐𝑜𝑢𝑛𝑡(𝑡𝑒𝑎 𝑥) 𝑝(𝑥|𝑡𝑒𝑎)= 𝑐𝑜𝑢𝑛𝑡(𝑡𝑒𝑎) 20 March 2015 𝑝 𝑐𝑢𝑝 𝑡𝑒𝑎 = .3 𝑝(𝑑𝑟𝑖𝑛𝑘𝑒𝑟|𝑡𝑒𝑎) = .4 𝑝(𝑡𝑖𝑚𝑒|𝑡𝑒𝑎) = .3 36 Count Adjustment What happens when we want to estimate the probability of the phrase “𝑡𝑒𝑎 𝑐𝑜𝑠𝑦”? 20 March 2015 37 Count Adjustment What happens when we want to estimate the probability of the phrase “𝑡𝑒𝑎 𝑐𝑜𝑠𝑦” ? 𝑝(𝑐𝑜𝑠𝑦|𝑡𝑒𝑎) = 0/10 = 0 Just because we haven’t seen an event in our sample, it doesn’t mean that it can’t occur. The smaller our sample size, the less we trust counts derived from it. Therefore, we need to assign 𝑝(𝑐𝑜𝑠𝑦|𝑡𝑒𝑎) a non-zero probability. 20 March 2015 38 Count Adjustment Remember for our model to be sound: 𝑝(𝑥|𝑡𝑒𝑎) = 1 𝑥 20 March 2015 39 Count Adjustment So, for 𝑝(𝑐𝑜𝑠𝑦|𝑡𝑒𝑎) not to be zero, we have to adjust downwards the probability of our seen events: • 𝑝(𝑡𝑖𝑚𝑒|𝑡𝑒𝑎) • 𝑝(𝑐𝑢𝑝|𝑡𝑒𝑎) • 𝑝(𝑑𝑟𝑖𝑛𝑘𝑒𝑟|𝑡𝑒𝑎) How do we do that? 20 March 2015 40 Count Adjustment So, for 𝑝(𝑐𝑜𝑠𝑦|𝑡𝑒𝑎) not to be zero, we have to adjust downwards the probability of our seen events: • 𝑝(𝑡𝑖𝑚𝑒|𝑡𝑒𝑎) • 𝑝(𝑐𝑢𝑝|𝑡𝑒𝑎) • 𝑝(𝑑𝑟𝑖𝑛𝑘𝑒𝑟|𝑡𝑒𝑎) How do we do that? • Add-one smoothing, • Add-α smoothing or • Good-Turing smoothing. 20 March 2015 41 Add-One Smoothing For all possible events, add the count of one. 𝑐+1 𝑝= 𝑛+𝑣 In case of n-gram language model: 𝑝 = 𝑝(𝑤𝑛 |𝑤1 𝑤𝑛 … 𝑤𝑛−1 ) 𝑐 = count of n-gram (𝑤1 𝑤𝑛 … 𝑤𝑛−1 𝑤𝑛 ) in corpus 𝑛 = count of history (𝑤1 𝑤𝑛 … 𝑤𝑛−1 ∗) in corpus 𝑣 = vocabulary size 20 March 2015 42 Add-One Smoothing Words in language= {𝑡𝑒𝑎, 𝑡𝑖𝑚𝑒, 𝑐𝑢𝑝, 𝑑𝑟𝑖𝑛𝑘𝑒𝑟, 𝑐𝑜𝑠𝑦} Vocabulary size =5 Observed events = 𝑡𝑒𝑎 𝑡𝑖𝑚𝑒, 𝑡𝑒𝑎 𝑡𝑖𝑚𝑒, 𝑡𝑒𝑎 𝑡𝑖𝑚𝑒, 𝑡𝑒𝑎 𝑐𝑢𝑝, 𝑡𝑒𝑎 𝑐𝑢𝑝, 𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑒𝑟, 𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑒𝑟, 𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑒𝑟 𝑡𝑒𝑎 𝑐𝑢𝑝, 𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑒𝑟, 𝑡𝑒𝑎 𝑡𝑖𝑚𝑒 𝑐𝑢𝑝 𝑑𝑟𝑖𝑛𝑘 𝑐𝑜𝑧𝑦 SUM 𝑇𝑒𝑎 0 3 3 4 0 10 𝑇𝑖𝑚𝑒 0 0 0 0 0 0 𝐶𝑢𝑝 0 0 0 0 0 0 𝐷𝑟𝑖𝑛𝑘 0 0 0 0 0 0 𝑐𝑜𝑧𝑦 0 0 0 0 0 0 20 March 2015 43 Add-One Smoothing 𝑇𝑒𝑎 𝑇𝑖𝑚𝑒 𝐶𝑢𝑝 𝐷𝑟𝑖𝑛𝑘 𝑐𝑜𝑧𝑦 SUM 𝑇𝑒𝑎 0+1=1 3+1=4 3+1=4 4+1=5 0+1=1 15 𝑇𝑖𝑚𝑒 0+1=1 0+1=1 0+1=1 0+1=1 0+1=1 5 𝐶𝑢𝑝 0+1=1 0+1=1 0+1=1 0+1=1 0+1=1 5 𝐷𝑟𝑖𝑛𝑘 0+1=1 0+1=1 0+1=1 0+1=1 0+1=1 5 𝑐𝑜𝑧𝑦 0+1=1 0+1=1 0+1=1 0+1=1 0+1=1 5 𝑝(𝑐𝑢𝑝|𝑡𝑒𝑎) = (3 + 1)/(10 + 5) = 4/15 𝑝(𝑡𝑖𝑚𝑒|𝑡𝑒𝑎) = (3 + 1)/(10 + 5) = 4/15 𝑝(𝑑𝑟𝑖𝑛𝑘𝑒𝑟|𝑡𝑒𝑎) = (4 + 1)/(10 + 5) = 5/15 𝑝(𝑐𝑜𝑠𝑦|𝑡𝑒𝑎) = (0 + 1)/(10 + 5) = 1/15 𝑝(𝑡𝑒𝑎|𝑡𝑒𝑎) = (0 + 1)/(10 + 5) = 1/15 20 March 2015 𝑝(𝑐𝑢𝑝|𝑐𝑢𝑝) = 1/5 𝑝(𝑡𝑖𝑚𝑒|𝑐𝑢𝑝) = 1/5 𝑝(𝑑𝑟𝑖𝑛𝑘𝑒𝑟|𝑐𝑢𝑝) = 1/5 𝑝(𝑐𝑜𝑠𝑦|𝑐𝑢𝑝) = 1/5 𝑝(𝑡𝑒𝑎|𝑐𝑢𝑝) = 1/5 44 Add-One Smoothing 𝑝 𝑡𝑒𝑎 𝑐𝑜𝑧𝑦 = 𝑝 𝑡𝑒𝑎 𝑝 𝑐𝑜𝑧𝑦 𝑡𝑒𝑎 = 10 20 × 1 15 = 1 30 Here we use unigram language model to estimate 𝑝(𝑡𝑒𝑎). 20 March 2015 45 Deal with Sentence Boundaries If we add sentence boundaries to observed events: Observed events = <s>𝑡𝑒𝑎 𝑡𝑖𝑚𝑒</s>, <s>𝑡𝑒𝑎 𝑡𝑖𝑚𝑒</s>, <s>𝑡𝑒𝑎 𝑡𝑖𝑚𝑒</s>, <s>𝑡𝑒𝑎 𝑐𝑢𝑝</s>, <s>𝑡𝑒𝑎 𝑐𝑢𝑝</s>, <s>𝑡𝑒𝑎 𝑐𝑢𝑝</s>, <s>𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑒𝑟<s>, <s>𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑒𝑟</s>, <s>𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑒𝑟</s>, <s>𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑒𝑟</s> 𝑡𝑒𝑎 𝑡𝑖𝑚𝑒 𝑐𝑢𝑝 𝑑𝑟𝑖𝑛𝑘 𝑐𝑜𝑧𝑦 </s> SUM <s> 10 0 0 0 0 0 10 𝑇𝑒𝑎 0 3 3 4 0 0 10 𝑇𝑖𝑚𝑒 0 0 0 0 0 3 0 𝐶𝑢𝑝 0 0 0 0 0 3 0 𝐷𝑟𝑖𝑛𝑘 0 0 0 0 0 4 0 𝑐𝑜𝑧𝑦 0 0 0 0 0 0 0 20 March 2015 46 Deal with Sentence Boundaries Add-one Smoothing: 𝑇𝑒𝑎 𝑇𝑖𝑚𝑒 𝐶𝑢𝑝 𝐷𝑟𝑖𝑛𝑘 𝑐𝑜𝑧𝑦 </s> SUM <s> 10+1=11 0+1=1 0+1=1 0+1=1 0+1=1 0+1=1 16 𝑇𝑒𝑎 0+1=1 3+1=4 3+1=4 4+1=5 0+1=1 0+1=1 16 𝑇𝑖𝑚𝑒 0+1=1 0+1=1 0+1=1 0+1=1 0+1=1 3+1=4 9 𝐶𝑢𝑝 0+1=1 0+1=1 0+1=1 0+1=1 0+1=1 3+1=4 9 𝐷𝑟𝑖𝑛𝑘 0+1=1 0+1=1 0+1=1 0+1=1 0+1=1 4+1=5 10 𝑐𝑜𝑧𝑦 0+1=1 0+1=1 0+1=1 0+1=1 0+1=1 0+1=1 6 𝑝(𝑐𝑢𝑝|𝑡𝑒𝑎) = (3 + 1)/(10 + 6) = 4/16 𝑝(𝑡𝑖𝑚𝑒|𝑡𝑒𝑎) = (3 + 1)/(10 + 6) = 4/16 𝑝(𝑑𝑟𝑖𝑛𝑘𝑒𝑟|𝑡𝑒𝑎) = (4 + 1)/(10 + 6) = 5/16 𝑝(𝑐𝑜𝑠𝑦|𝑡𝑒𝑎) = (0 + 1)/(10 + 6) = 1/16 𝑝(𝑡𝑒𝑎|𝑡𝑒𝑎) = (0 + 1)/(10 + 6) = 1/16 20 March 2015 𝑝(𝑐𝑢𝑝|𝑐𝑢𝑝) = 1/6 𝑝(𝑡𝑖𝑚𝑒|𝑐𝑢𝑝) = 1/6 𝑝(𝑑𝑟𝑖𝑛𝑘𝑒𝑟|𝑐𝑢𝑝) = 1/6 𝑝(𝑐𝑜𝑠𝑦|𝑐𝑢𝑝) = 1/6 𝑝(𝑡𝑒𝑎|𝑐𝑢𝑝) = 1/6 47 Deal with Sentence Boundaries 𝑝 <s>𝑡𝑒𝑎 𝑐𝑜𝑧𝑦</s> = 𝑝 𝑡𝑒𝑎|<s> 𝑝 𝑐𝑜𝑧𝑦 𝑡𝑒𝑎 𝑝(</s>|𝑐𝑜𝑧𝑦) = 20 March 2015 11 16 × 1 1 × 16 6 = 11 1536 48 Add-One Smoothing But there are many more unseen n-grams than seen n-grams. Example: Europarl 2-bigrams: • 86,700 distinct words • 86,7002 = 7,516,890,000 possible bigrams (𝑣) • but only about 30,000,000 words (and bigrams) in corpus Seen n-grams are assigned too low a probability! 20 March 2015 49 Add-α Smoothing Instead of adding one to the count, we add the value α (<1): 𝑐+𝛼 𝑝= 𝑛 + 𝛼𝑣 How do we determine a value for α? 20 March 2015 50 Example: 2-grams in Europarl Count 0 1 2 3 4 5 6 8 10 20 add-one 0.00378 0.00755 0.01133 0.01511 0.01888 0.02266 0.02644 0.03399 0.04155 0.07931 add-α 0.00016 0.95725 1.91433 2.87141 3.82850 4.78558 5.74266 7.65683 9.57100 19.14183 add-α counts are more realistic than add-one counts 20 March 2015 51 Good-Turing Smoothing Adjust actual counts 𝑟 to expected counts 𝑟 ∗ using the following: 𝑟∗ 𝑁𝑟+1 = (𝑟 + 1) 𝑁𝑟 where 𝑁𝑟 is the number of n-grams with count r in the corpus (count of counts). 𝑁0 is the total number of n-grams in the corpus. 20 March 2015 52 Good-Turing Europarl Bigram Counts 20 March 2015 Count Count of counts Adjusted count 𝑟 𝑁𝑟 𝑟∗ 0 7,514,941,065 0.00015 1 1,132,844 0.46539 2 263,611 1.40679 3 123,615 2.38767 4 73,788 3.33753 5 49,254 4.36967 6 35,869 5.32928 8 21,693 7.43798 10 14,880 9.31304 20 4,546 19.54487 53 Smoothing: the story so far • The smoothing techniques we have seen so far work by adjusting the counts of seen n-grams so that unseen n-grams get assigned some of the probability mass. • Other techniques are interpolation and backoff. • These are based on the idea of combining language models of different orders. 20 March 2015 54 Combining language models • In given corpus, we may never observe – Scottish beer drinkers – Scottish beer eaters • Both have count 0. • The smoothing methods we have looked at so far will assign them the same probability. • It would be useful here to look at bigram frequencies: – beer drinkers – beer eaters 20 March 2015 55 Interpolation Higher and lower order n-gram models have different strengths and weaknesses: – high-order n-grams are sensitive to more context, but have sparse counts – low-order n-grams consider only very limited context, but have robust counts 20 March 2015 56 Interpolation • Combine them: 𝑝 𝑤3 𝑤1 , 𝑤2 = 𝜆1 𝑝1 𝑤3 + 𝜆2 𝑝2 𝑤3 𝑤2 + 𝜆3 𝑝3 𝑤 3 𝑤 1, 𝑤 2 Where: 𝜆1 +𝜆2 +𝜆3 = 1 20 March 2015 57 Backoff • Backoff is slightly different. • The basic idea is to trust the highest order language model that contains the n-gram. 20 March 2015 58 Backoff • Given a trigram, bigram and unigram language model: – if trigram count > 0: • Use it – else if bigram count > 0: • Use it – else: • Use the unigram count 20 March 2015 59 Other Considerations • Consider the English words spite and constant – both occur 993 times in Europarl corpus – only 9 different words follow spite – almost always followed by of (979 times), due to expression in spite of – 415 different words follow constant – most frequent: and (42 times), concern (27 times), pressure (26 times), but huge tail of singletons: 268 different words` 20 March 2015 60 Other Considerations • More likely to see new bigram that starts with constant than spite. • Witten-Bell smoothing takes this into account. 20 March 2015 61 Other Considerations • Consider the word York – fairly frequent word in Europarl corpus, occurs 477 times – as frequent as foods, indicates and providers – in unigram language model: a respectable probability 20 March 2015 62 Other Considerations • However, it almost always directly follows New (473 times) • Recall: unigram model only used if the bigram model inconclusive • York unlikely second word in unseen bigram • In backoff unigram model, York should have low probability • Kneser-Ney smoothing takes this into account. 20 March 2015 63 Content What’s a Language Model? N-gram Language Model Language Model Smoothing Language Model Evaluation Managing Very Large Language Models Other Language Modelling Approaches 20 March 2015 64 Evaluating Language Models The quality of a language model can be estimated using perplexity: 𝑃𝑃 = 2𝐻(𝑃𝐿𝑀 ) which in turn is based on the notion of cross-entropy: 𝐻 𝑃𝐿𝑀 20 March 2015 1 = − log 𝑃𝐿𝑀 (𝑤1 … 𝑤𝑛 ) 𝑛 65 Example • Say we wanted to measure how well a language model 𝑝𝐿𝑀 predicts the following sentence: I would like to commend the rapporteur on his work. 20 March 2015 66 𝒑𝑳𝑴 −𝐥𝐨𝐠 𝟐 𝒑𝑳𝑴 𝒑𝑳𝑴 (I | </s><s>) 0.109 3.197 𝒑𝑳𝑴(would | <s> I ) 0.144 2.791 𝒑𝑳𝑴 (like | I would) 0.489 1.031 𝒑𝑳𝑴 (to | would like) 0.905 0.144 𝒑𝑳𝑴 (commend | like to) 0.002 8.794 𝒑𝑳𝑴 (the | to recommend) 0.472 1.084 𝒑𝑳𝑴(rapporteur | recommend the) 0.147 2.763 𝒑𝑳𝑴 (on | the rapporteur) 0.056 4.15 𝒑𝑳𝑴 (his | rapporteur on) 0.194 2.367 𝒑𝑳𝑴 (work | on his) 0.089 3.498 𝒑𝑳𝑴 (. | his work) 0.29 1.785 0.99999 average 0.000014 2.643 Prediction 𝒑𝑳𝑴 (</s> | work .) Here is how we would proceed N-Gram Comparison word i would like to unigram 6.684 8 9 5 bigram 3.197 2.884 2.026 0.402 trigram 3.197 2.791 1.031 0.144 4-gram 3.197 2.791 1.29 0.113 commend the 15 4 12.335 1.402 8.794 1.084 8.633 0.88 rapporteur on his work . </s> average 11 7 11 10 5 4.828 8.051 7.319 4.14 7.316 4.816 3.02 0.005 4.072 2.763 4.15 2.367 3.498 1.785 0 2.634 2.35 1.862 1.978 2.394 1.51 0 2.251 perplexity 265.136 16.817 6.206 4.758 20 March 2015 68 Content What’s a Language Model? N-gram Language Model Language Model Smoothing Language Model Evaluation Managing Very Large Language Models Other Language Modelling Approaches 20 March 2015 69 Managing very large language models • Millions to billions of words are easy to get (trillions of English words available on the web) • But: huge language models do not fit into RAM 20 March 2015 70 Delete singletons Europarl data: Order Unique n-grams Singletons unigram 86,700 33,447 (38.6%) bigram 1,948,935 1,132,844 (58.1%) trigram 8,092,798 6,022,286 (74.4%) 4-gram 15,303,847 13,081,621 (85.5%) 5-gram 19,882,175 18,324,577 (92.2%) Delete singletons of higher order n-grams 20 March 2015 71 Other Techniques • Store the language model using an appropriate data structure – trie • Reduce vocabulary size – e.g. mapping numbers to symbol NUM • Store language model on disk and load only what is necessary into memory • Filter language model only for those words that are in the translation options being explored by the decoder 20 March 2015 72 Content What’s a Language Model? N-gram Language Model Language Model Smoothing Language Model Evaluation Managing Very Large Language Models Other Language Modelling Approaches 20 March 2015 73 Other Language Modelling Approaches • Syntax Language Models • Neural Network Language Models 20 March 2015 74 Discussion 20 March 2015 75 Acknowledgement Parts of the content of this lecture are taken from previous lectures and presentations given by Jennifer Foster, Declan Groves, Yvette Graham, Kevin Knight, Josef van Genabith, Andy Way. 20 March 2015 76