Università di Milano-Bicocca Laura Magistrale in Informatica
by user
Comments
Transcript
Università di Milano-Bicocca Laura Magistrale in Informatica
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Prof. Giancarlo Mauri Lezione 10-11 – Apprendimento statistico 1 Outline Bayes theorem MAP, ML hypotheses Minimum Description Length principle Optimal Bayes classifier Naive Bayes classifier Expectation Maximization (EM) algorithm 2 Goals Learn a “probabilistic theory” of the world for decisions with uncertain knowledge Learning as a probabilistic inference process 3 Two roles for Bayesian learning Provides practical learning algorithms Naive Bayes learning Bayesian belief network learning Combines prior knowledge (prior probabilities) with observed data Requires prior probabilities Provides useful conceptual framework “Gold standard” for evaluating other learning algorithms Additional insight into Occam’s razor 4 Bayesian learning Advantages No hypothesis is eliminated, even if inconsistent with data Each hypothesis h is given a probability P(h) to be the correct one P(h) is incrementally modified after seen an example Hypotheses that make probabilistic predictions (eg, for medical diagnosis) are allowed Drawbacks Not easy to estimate prior probabilities Huge computational cost in the general case 5 Bayesian learning View as updating of a probability distribution over the hypothesis space H H (random) hypothesis variable, values h1, h2, … jth observation gives the outcome dj of random variable Dj - Training data D = {d1, d2, …, dN} Use Bayes theorem to compute posterior probability of each hypothesis Start with prior distribution P(H) 6 Richiami di calcolo delle probabilità Regola del prodotto P(A∧B) = P(A|B)P(B) = P(B|A)P(A) Regola della somma P(A∨B) = P(A) + P(B) - P(A∧B) Teorema della probabilità totale Se gli eventi Ai sono mutuamente esclusivi e ΣP(Ai) = 1 P(B) = ΣP(B|Ai) P(Ai) 7 Bayes theorem Allows to compute probabilities of hypotheses, given training sample D and prior probabilities P(D|h) P(h) P(h|D) = -----------------P(D) P(h) = prior probability of hypothesis h P(D) = probability of training set D P(h|D) = posterior probability of h given D P(D|h) = likelihood of D given h NB: increases with P(D|h) and P(h), decreases when P(D) increases 8 Bayesian prediction Bayesian learning computes the conditional probability of each hypothesis, wrt observed data Prediction on the next value of X based on weighted average of likelihood wrt all hypotheses probabilities P(X|D) = i P(X|D,hi)P(hi|D) = i P(X|hi)P(hi|D) assuming that each hypothesis determines a probability distribution over X No need to pick one best-guess hypothesis! 9 Example Un paziente fa un esame di laboratorio per un marcatore tumorale. Sappiamo che: L’incidenza della malattia su tutta la popolazione è dell’8 per mille (probabilità a priori) L’esame dà un risultato positivo nel 98% dei casi in cui è presente la malattia (quindi 2% di falsi negativi) dà un risultato negativo nel 97% dei casi in cui non è presente la malattia (quindi 3% di falsi positivi) 10 Example Abbiamo le seguenti probabilità a priori e condizionate: P(c) = 0,008 P(c) = 0,992 P(+|c) = 0,98 P(-|c) = 0,02 P(+|c) = 0,03 P(-|c) = 0,97 Se il test per il nostro paziente risulta positivo (D = +), quali sono le probabilità a posteriori che abbia o non abbia il cancro ? P(+|c)P(c) = 0.98x0.008 = 0.0078 = P(c|+) P(+|c)P(c) = 0.03x0.992 = 0.0298 = P(c|+) 11 Example Dividendo per 0,0078+0,0298 per normalizzare a 1, otteniamo: P(c|+) = 0,21 P(c|+) = 0,79 E’ un risultato controintuitivo, che si spiega col fatto che i falsi positivi, su una popolazione in stragrande maggioranza sana, diventano molto numerosi 12 Example Suppose there are five kinds of bags of candies: 10% are h1: 100% cherry candies 20% are h2: 75% cherry candies + 25% lime candies 40% are h3: 50% cherry candies + 50% lime candies 20% are h4: 25% cherry candies + 75% lime candies 10% are h5: 100% lime candies Then we observe candies drawn from some bag: What kind of bag is it? What flavour will the next candy be? 13 Posterior probability of hypotheses Number of samples in d 14 Prediction probability 0 2 4 6 8 10 15 Posterior probability of hypotheses The correct hypothesis in the limit will dominate the prediction, independently of the prior distribution, provided that the correct hypothesis is not given 0 probability The bayesian prediction is optimal, i.e. it will be correct more often than each other prediction The optimality has a high computational cost (to compute i P(X|hi)P(hi|D)) ⇓ Need for approximate methods 16 MAP approximation Summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes) Maximum a posteriori (MAP) learning: choose the most probable hypothesis wrt training data P(D|h) P(h) hMAP = arg max P(h|D) = arg max -----------------P(D) hH = arg max P(D|h) P(h) = arg max (log P(D|h) + log P(h)) N.B. Log terms can be viewed as (negative of) bits to encode data given hypothesis + bits to encode hypothesis (basic idea of minimum description length (MDL) learning) For deterministic hypotheses, P(D|h) is 1 if consistent, 0 otherwise MAP = simplest consistent hypothesis 17 Example Number of samples in d After three samples, h5 will be used for prediction with probability 1 18 Brute force MAP learner 1. For each h in H, calculate the posterior probability P(D|h) P(h) P(h|D) = -----------------P(D) 2. Output the hypothesis hMAP with the highest posterior probability hMAP = arg max P(h|D) hH 19 Relation to Concept Learning Consider our usual concept learning task instance space X, hypothesis space H, training examples D Consider the FIND-S learning algorithm (outputs most specific hypothesis from the version space VSH.D) What would Bayes rule produce as the MAP hypothesis? Does Find-S output a MAP hypothesis? 20 Relation to Concept Learning Assume fixed set of instances ‹x1,…,xm› Assume D is the set of classifications D = ‹c(x1),…,c(xm)> Let P(D|h) = 1 if h consistent with D = 0 otherwise Choose P(h) to be uniform distribution P(h) = 1/|H| for all h in H Then P(h|D) = 1/|VSH,D| if h consistent with D =0 otherwise 21 Evolution of posterior probabilities 22 ML approximation Choose h that maximizes likelihood of D hML = arg max P(D|h) hH I.e., simply get the best fit to the data If prior probabilities are uniform (P(hi)=P(hj) i,j), then hMAP = argmax P(h|D) = argmax P(D|h) P(h) = argmax P(D|h) = hML For large data sets, prior becomes irrelevant ML is the “standard” (non-Bayesian) statistical learning method 23 ML parameter learning Bag from a new manufacturer; fraction of cherry candies? Maximize this w.r.t. , which is easier for the log-likelihood: Any is possible: continuum of hypotheses h is a parameter for this simple (binomial) family of models Suppose we unwrap N candies, c cherries and = N-c limes These are i.i.d. (independent, identically distributed) observations, so P(D|h) = Nj=1P(dj|h) = c (1-)α L(D|h) = log P(D|h) = Nj=1 log P(dj|h) = c log + log(1-) N dL(D|h)P(h) c α ---------------- = --- - ------ = 0 d c c c+ Seems sensible, but causes problems with 0 counts! 24 Learning a Real Valued Function Consider any real-valued target function f Training examples ‹xi,di›, where di is noisy training value di = f(xi) + ei with ei random variable (noise) drawn indipendently for each xi according to some Gaussian distribution with mean = 0 Then the maximum likelihood hypothesis hML is the one that minimizes the sum of squared errors: hML = arg min hH m 2 i=1(di-h(xi)) 25 Learning a real valued function 26 Minimum Description Length Principle (1) hMAP = arg max P(D|h) P(h) = arg max (log P(D|h) + log P(h)) = arg min(-log P(D|h) - log P(h)) Interesting fact from information theory: The optimal (shortest coding length) code for an event with probability p is -log2p bits So interpret (1): -log2P(h) is length of h under optimal code -log2P(D|h) is length of D given h under optimal code Prefer the hypothesis that minimizes Length(h) + length(misclassifications) 27 Minimum Description Length Principle Occam’s razor: prefer the shortest hypothesis MDL: prefer the hypothesis h that minimizes hMDL = arg min (Lc1(h), Lc2(D|h)) hH where LC(x) is the description length of x under enconding C 28 Minimum Description Length Principle Example H = decision trees, D = training data labels LC1(h) is # bits to describe tree h LC2(D|h) is # bits to describe D given h Note LC2(D|h) = 0 if examples classified perfectly by h. Need only describe exceptions Hence hMDL trades off tree size for training errors 29 Most Probable Classification of New Instances So far we’ve sought the most probable hypothesis given the data D (i.e., hMAP) Given new instance x, what is its most probable classification? hMAP(x) is not the most probable classification! Consider: Three possible hypotheses: P(h1|D) = 0,4 P(h2|D) = 0,3 P(h3|D) = 0,3 Given new instance x, h1(x) = +, h2(x) = -, h3(x) = What’s the most probable classification of x? 30 Bayes Optimal Classifier Bayes optimal classification: arg max Σ P(vj|hi)P(hi|D) vjV hiH Example: P(h1|D) = .4 P(h2|D) = .3 P(h3|D) = .3 Therefore P(-|h1) = 0 P(-|h2) = 1 P(-|h3) = 1 ΣP(+|hi)P(hi|D) = 0.4 hiH P(+|h1) = 1 P(+|h2) = 0 P(+|h3) = 0 ΣP(-|hi)P(hi|D) = 0.6 hiH arg max Σ P(vj|hi)P(hi|D) = vjV hiH 31 Gibbs Classifier Bayes optimal classifier provides best result, but can be expensive if many hypotheses Gibbs algorithm: Choose one hypothesis at random, according to P(h|D) Use this to classify new instance Surprising fact: assume target concepts are drawn at random from H according to priors on H. Then: E(errorGibbs) ≤ 2E(errorBayesOpt) Suppose correct, uniform prior distribution over H, then Pick any hypothesis from VS, with uniform probability Its expected error no worse than twice Bayes optimal 32 Naive Bayes Classifier Along with decision trees, neural networks, nearest nbr, one of the most practical learning methods When to use Moderate or large training set available Attributes that describe instances are conditionally independent given classification Successful applications: Diagnosis Classifying text documents 33 Naive Bayes Classifier Assume target function f : X V, where each instance x described by attributes <a1, a2… an> Most probable value of f(x) is: vMAP = arg max P(vj|a1, a2, … an) = arg max P(a1, a2, … an|vj)P(vj)/P(a1, a2, … an)= arg max P(a1, a2, … an|vj)P(vj) Naive Bayes assumption: P(a1, a2, … an|vj) = Πi P(ai|vj) Which gives Naive Bayes classifier: vNB = arg max P(vj) Πi P(ai|vj) 34 Naive Bayes Algorithm Naive_Bayes_Learn (examples) For each target value vj P(vj) estimate P(vj) For each attribute value ai of each attribute a P(ai|vj) estimate P(ai|vj ) Classify_New_Instance(x) vNB = arg max P(vj) Πi P(ai|vj) 35 Naive Bayes: Example Consider Playtennis again and new instance <Outlk = sun, Temp = cool, Humid = high, Wind = strong> Wanto to compute: vNB = arg max P(vj) Πi P(ai|vj) We have P(y) P(sun|y) P(cool|y) P(high|y) P(strong|y) = .005 P(n) P(sun|n) P(cool|n) P(high|n) P(strong|n) = .021 vNB = n 36 The tennis example - Training Set Day Outlook Temp. Humidity Wind Play Tennis D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Hot Hot Hot Mild Cool Cool Cool Mild Cold Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High Weak Strong Weak Weak Weak Strong Weak Weak Weak Strong Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Naive Bayes: Subtleties Conditional independence assumption is often violated P(a1, a2, … an|vj) = Πi P(ai|vj) …but it works surprisingly well anyway. Note don’t need estimated posteriors P(vj|x) to be correct; need only that arg max P(vj) Πi P(ai|vj) = arg max P(vj) P(a1, a2, … an|vj) Naive Bayes posteriors often unrealistically close to 1 or 0 38 Naive Bayes: Subtleties What if none of the training instances with target value vj have attribute value aj ? Then P(ai|vj) = 0 P(vj) Πi P(ai|vj) = 0 Typical solution is Bayesian estimate for P(ai| vj) P(ai|vj) = (nc+mp)/(n+m) where: n number of training examples for which v = vj nc number of examples for which v = vj and a = ai p is prior estimate for P(ai| vj) m is weight given to prior (i.e. number of “virtual” examples) 39 Learning to Classify Text Why? Learn which new articles are of interest Learn to classify web pages by topic Naive Bayes is among most effective algorithm What attributes shall we use to represent text documents? 40 Learning to Classify Text Target concept Interesting? : Document {+, -} Represent each document by vector of words one attribute per word position in document Learning: use training examples to estimate P(+) P(-) P(doc|+) P(doc|-) Naive Bayes conditional independence assumption P(doc|vj) =Πi=1, … ,length(doc) P(ai=wk|vj) where P(ai=wk|vj) = probability that word in position i is wk given vj One more assumption P(ai=wk|vj) = P(am=wk|vj) ∀i, m 41 Learn_Naive_Bayes_Text (Examples, V) Collect all words and other tokens that occur in Examples Vocabulary := all distinct words and other tokens in Examples Calculate the required P(vj) and P(wk|vj) probability terms For each target value vj in V do docsj := subset of Examples for which the target value is vj P(vj) := |docsj|/|Examples| Textj := a single document created by concatenating all members of docsj n := total number of words in Textj (counting duplicate words multiple times) For each word wk in Vocabulary *nk := number of times word wk occurs in Textj *P(wk|vj) := (nk+1)/(n+|Vocabulary|) 42 Classify_Naive_Bayes_Text (Doc) positions := all word positions in Doc that contain tokens found in Vocabulary Return vNB = arg max P(vj) Πi∈positions P(ai|vj) 43 Example Twenty NewsGroups Given 1000 training documents from each group, learn to classify new document according to which newsgroup it came from comp.graphics comps.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc. forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey alt. atheism soc.religion.christian talk.religion.misc talk.politics.mideast talk.politics.misc talk.politics.guns sci.space sci.crypt sci.electronics sci.med Naive Bayes: 89% classification accuracy 44 Reti Bayesiane Il classificatore ottimale di Bayes non fa assunzioni di indipendenza tra variabili ed è computazionalmente pesante Il classificatore “ingenuo” di Bayes è efficiente grazie all’ipotesi molto restrittiva di indipendenza condizionale di tutti gli attributi dato il valore obiettivo v Le reti Bayesiane descrivono l’indipendenza condizionale tra sottoinsiemi di variabili 45 Reti Bayesiane DEF. X, Y, Z variabili casuali discrete. X è condizionalmente indipendente da Y dato Z se: xi, yj, zk P(X= xi|Y= yj,Z= zk)= P(X= xi|Z= zk) Facilmente estendibile a insiemi di variabili: P(X1,…, Xl|Y1,…,Ym,Z1,…, Zn)=P(X1,…, Xl | Z1,…, Zn) 46 Reti Bayesiane - Rappresentazione Grafo aciclico orientato I nodi rappresentano le variabili casuali Gli archi rappresentano la struttura di dipendenza condizionale: ogni variabile è condizionalmente indipendente dai suoi non discendenti, dati i suoi genitori Tabella di probabilità condizionali locali per ogni nodo, con la distribuzione di probabilità della variabile associata dati i predecessori immediati Si può calcolare: P(y1,…, yn) = i P(yi|gen(Yi)) 47 Reti Bayesiane - Rappresentazione Temporale Gita T,G T,G T,G T,G Fuoco di campo F 0,4 0,1 0,8 0,2 F 0,6 0,9 0,2 0,8 Fulmine Tuono Incendio (Variabili a valori booleani) 48