...

Università di Milano-Bicocca Laura Magistrale in Informatica

by user

on
Category: Documents
24

views

Report

Comments

Transcript

Università di Milano-Bicocca Laura Magistrale in Informatica
Università di Milano-Bicocca
Laurea Magistrale in Informatica
Corso di
APPRENDIMENTO AUTOMATICO
Prof. Giancarlo Mauri
Lezione 10-11 – Apprendimento statistico
1
Outline

Bayes theorem

MAP, ML hypotheses

Minimum Description Length principle

Optimal Bayes classifier

Naive Bayes classifier

Expectation Maximization (EM) algorithm
2
Goals

Learn a “probabilistic theory” of the world for
decisions with uncertain knowledge

Learning as a probabilistic inference process
3
Two roles for Bayesian learning

Provides practical learning algorithms
 Naive Bayes learning
 Bayesian belief network learning
 Combines prior knowledge (prior probabilities) with observed
data
 Requires prior probabilities

Provides useful conceptual framework
 “Gold standard” for evaluating other learning algorithms
 Additional insight into Occam’s razor
4
Bayesian learning

Advantages
 No hypothesis is eliminated, even if inconsistent with data
 Each hypothesis h is given a probability P(h) to be the correct
one
 P(h) is incrementally modified after seen an example
 Hypotheses that make probabilistic predictions (eg, for
medical diagnosis) are allowed

Drawbacks
 Not easy to estimate prior probabilities
 Huge computational cost in the general case
5
Bayesian learning

View as updating of a probability distribution over the
hypothesis space H


H (random) hypothesis variable, values h1, h2, …

jth observation gives the outcome dj of random
variable Dj - Training data D = {d1, d2, …, dN}

Use Bayes theorem to compute posterior probability
of each hypothesis
Start with prior distribution P(H)
6
Richiami di calcolo delle probabilità

Regola del prodotto
P(A∧B) = P(A|B)P(B) = P(B|A)P(A)

Regola della somma
P(A∨B) = P(A) + P(B) - P(A∧B)

Teorema della probabilità totale
Se gli eventi Ai sono mutuamente esclusivi e ΣP(Ai) = 1
P(B) = ΣP(B|Ai) P(Ai)
7
Bayes theorem

Allows to compute probabilities of hypotheses, given training
sample D and prior probabilities
P(D|h) P(h)
P(h|D) = -----------------P(D)





P(h) = prior probability of hypothesis h
P(D) = probability of training set D
P(h|D) = posterior probability of h given D
P(D|h) = likelihood of D given h
NB: increases with P(D|h) and P(h), decreases when P(D)
increases
8
Bayesian prediction

Bayesian learning computes the conditional probability
of each hypothesis, wrt observed data

Prediction on the next value of X based on weighted
average of likelihood wrt all hypotheses probabilities
P(X|D) = i P(X|D,hi)P(hi|D) = i P(X|hi)P(hi|D)
assuming that each hypothesis determines a
probability distribution over X

No need to pick one best-guess hypothesis!
9
Example

Un paziente fa un esame di laboratorio per un
marcatore tumorale. Sappiamo che:
 L’incidenza della malattia su tutta la popolazione è dell’8 per
mille (probabilità a priori)
 L’esame dà un risultato positivo nel 98% dei casi in cui è
presente la malattia (quindi 2% di falsi negativi)
 dà un risultato negativo nel 97% dei casi in cui non è presente
la malattia (quindi 3% di falsi positivi)
10
Example

Abbiamo le seguenti probabilità a priori e condizionate:
P(c) = 0,008
P(c) = 0,992
P(+|c) = 0,98
P(-|c) = 0,02
P(+|c) = 0,03
P(-|c) = 0,97

Se il test per il nostro paziente risulta positivo (D = +),
quali sono le probabilità a posteriori che abbia o non
abbia il cancro ?
P(+|c)P(c) = 0.98x0.008 = 0.0078 = P(c|+)
P(+|c)P(c) = 0.03x0.992 = 0.0298 = P(c|+)
11
Example

Dividendo per 0,0078+0,0298 per normalizzare a 1,
otteniamo:
P(c|+) = 0,21
P(c|+) = 0,79

E’ un risultato controintuitivo, che si spiega col fatto
che i falsi positivi, su una popolazione in stragrande
maggioranza sana, diventano molto numerosi
12
Example

Suppose there are five kinds of bags of candies:
10% are h1: 100% cherry candies
20% are h2: 75% cherry candies + 25% lime candies
40% are h3: 50% cherry candies + 50% lime candies
20% are h4: 25% cherry candies + 75% lime candies
10% are h5: 100% lime candies

Then we observe candies drawn from some bag:

What kind of bag is it? What flavour will the next candy be?
13
Posterior probability of hypotheses
Number of samples in d
14
Prediction probability
0
2
4
6
8
10
15
Posterior probability of hypotheses

The correct hypothesis in the limit will dominate the
prediction, independently of the prior distribution,
provided that the correct hypothesis is not given 0
probability

The bayesian prediction is optimal, i.e. it will be
correct more often than each other prediction

The optimality has a high computational cost (to
compute i P(X|hi)P(hi|D))
⇓
Need for approximate methods
16
MAP approximation

Summing over the hypothesis space is often intractable (e.g.,
18,446,744,073,709,551,616 Boolean functions of 6 attributes)

Maximum a posteriori (MAP) learning: choose the most probable
hypothesis wrt training data
P(D|h) P(h)
hMAP = arg max P(h|D) = arg max -----------------P(D)
hH
= arg max P(D|h) P(h)
= arg max (log P(D|h) + log P(h))
N.B. Log terms can be viewed as (negative of) bits to encode data given
hypothesis + bits to encode hypothesis (basic idea of minimum
description length (MDL) learning)

For deterministic hypotheses, P(D|h) is 1 if consistent, 0 otherwise
 MAP = simplest consistent hypothesis
17
Example
Number of samples in d

After three samples, h5 will be used for prediction with
probability 1
18
Brute force MAP learner
1.
For each h in H, calculate the posterior probability
P(D|h) P(h)
P(h|D) = -----------------P(D)
2. Output the hypothesis hMAP with the highest
posterior probability
hMAP = arg max P(h|D)
hH
19
Relation to Concept Learning

Consider our usual concept learning task
 instance space X, hypothesis space H, training examples D

Consider the FIND-S learning algorithm (outputs most
specific hypothesis from the version space VSH.D)

What would Bayes rule produce as the MAP
hypothesis?

Does Find-S output a MAP hypothesis?
20
Relation to Concept Learning





Assume fixed set of instances ‹x1,…,xm›
Assume D is the set of classifications
D = ‹c(x1),…,c(xm)>
Let P(D|h) = 1 if h consistent with D
= 0 otherwise
Choose P(h) to be uniform distribution
P(h) = 1/|H| for all h in H
Then
P(h|D) = 1/|VSH,D| if h consistent with D
=0
otherwise
21
Evolution of posterior probabilities
22
ML approximation

Choose h that maximizes likelihood of D
hML = arg max P(D|h)
hH
I.e., simply get the best fit to the data

If prior probabilities are uniform (P(hi)=P(hj) i,j), then
hMAP = argmax P(h|D) = argmax P(D|h) P(h) = argmax P(D|h) = hML

For large data sets, prior becomes irrelevant

ML is the “standard” (non-Bayesian) statistical learning method
23
ML parameter learning





Bag from a new manufacturer; fraction  of cherry candies?

Maximize this w.r.t. , which is easier for the log-likelihood:
Any  is possible: continuum of hypotheses h
 is a parameter for this simple (binomial) family of models
Suppose we unwrap N candies, c cherries and  = N-c limes
These are i.i.d. (independent, identically distributed) observations, so
P(D|h) = Nj=1P(dj|h) = c (1-)α
L(D|h) = log P(D|h) = Nj=1 log P(dj|h) = c log  +  log(1-)
N
dL(D|h)P(h)
c
α
---------------- = --- - ------ = 0  
d


c
c
c+
Seems sensible, but causes problems with 0 counts!
24
Learning a Real Valued Function



Consider any real-valued target
function f
Training examples ‹xi,di›, where
di is noisy training value
di = f(xi) + ei
with ei random variable (noise)
drawn indipendently for each xi
according to some Gaussian
distribution with mean = 0
Then the maximum likelihood
hypothesis hML is the one that
minimizes the sum of squared
errors:
hML = arg min 
hH
m
2
i=1(di-h(xi))
25
Learning a real valued function
26
Minimum Description Length Principle
(1)



hMAP = arg max P(D|h) P(h)
= arg max (log P(D|h) + log P(h))
= arg min(-log P(D|h) - log P(h))
Interesting fact from information theory:
 The optimal (shortest coding length) code for an event with
probability p is -log2p bits
So interpret (1):
 -log2P(h) is length of h under optimal code
 -log2P(D|h) is length of D given h under optimal code
Prefer the hypothesis that minimizes
Length(h) + length(misclassifications)
27
Minimum Description Length Principle


Occam’s razor: prefer the shortest hypothesis
MDL: prefer the hypothesis h that minimizes
hMDL = arg min (Lc1(h), Lc2(D|h))
hH
where LC(x) is the description length of x under
enconding C
28
Minimum Description Length Principle

Example
 H = decision trees, D = training data labels
 LC1(h) is # bits to describe tree h
 LC2(D|h) is # bits to describe D given h
 Note LC2(D|h) = 0 if examples classified perfectly by h. Need
only describe exceptions
 Hence hMDL trades off tree size for training errors
29
Most Probable Classification of New
Instances




So far we’ve sought the most probable hypothesis
given the data D (i.e., hMAP)
Given new instance x, what is its most probable
classification?
hMAP(x) is not the most probable classification!
Consider:
 Three possible hypotheses:
P(h1|D) = 0,4
P(h2|D) = 0,3
P(h3|D) = 0,3
 Given new instance x,
h1(x) = +, h2(x) = -, h3(x) =  What’s the most probable classification of x?
30
Bayes Optimal Classifier

Bayes optimal classification:
arg max Σ P(vj|hi)P(hi|D)
vjV


hiH
Example:
P(h1|D) = .4
P(h2|D) = .3
P(h3|D) = .3
Therefore
P(-|h1) = 0
P(-|h2) = 1
P(-|h3) = 1
ΣP(+|hi)P(hi|D) = 0.4
hiH
P(+|h1) = 1
P(+|h2) = 0
P(+|h3) = 0
ΣP(-|hi)P(hi|D) = 0.6
hiH
arg max Σ P(vj|hi)P(hi|D) = vjV
hiH
31
Gibbs Classifier




Bayes optimal classifier provides best result, but can be
expensive if many hypotheses
Gibbs algorithm:
 Choose one hypothesis at random, according to P(h|D)
 Use this to classify new instance
Surprising fact: assume target concepts are drawn at random
from H according to priors on H. Then:
E(errorGibbs) ≤ 2E(errorBayesOpt)
Suppose correct, uniform prior distribution over H, then
 Pick any hypothesis from VS, with uniform probability
 Its expected error no worse than twice Bayes optimal
32
Naive Bayes Classifier


Along with decision trees, neural networks, nearest
nbr, one of the most practical learning methods
When to use
 Moderate or large training set available
 Attributes that describe instances are conditionally

independent given classification
Successful applications:
 Diagnosis
 Classifying text documents
33
Naive Bayes Classifier




Assume target function f : X  V, where each instance
x described by attributes <a1, a2… an>
Most probable value of f(x) is:
vMAP = arg max P(vj|a1, a2, … an) =
arg max P(a1, a2, … an|vj)P(vj)/P(a1, a2, … an)=
arg max P(a1, a2, … an|vj)P(vj)
Naive Bayes assumption:
P(a1, a2, … an|vj) = Πi P(ai|vj)
Which gives
Naive Bayes classifier:
vNB = arg max P(vj) Πi P(ai|vj)
34
Naive Bayes Algorithm

Naive_Bayes_Learn (examples)
For each target value vj
P(vj)  estimate P(vj)
For each attribute value ai of each attribute a
P(ai|vj)  estimate P(ai|vj )
Classify_New_Instance(x)
vNB = arg max P(vj) Πi P(ai|vj)
35
Naive Bayes: Example

Consider Playtennis again and new instance
<Outlk = sun, Temp = cool, Humid = high, Wind = strong>

Wanto to compute:
vNB = arg max P(vj) Πi P(ai|vj)

We have
P(y) P(sun|y) P(cool|y) P(high|y) P(strong|y) = .005
P(n) P(sun|n) P(cool|n) P(high|n) P(strong|n) = .021
vNB = n
36
The tennis example - Training Set
Day
Outlook
Temp.
Humidity
Wind
Play Tennis
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cold
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Naive Bayes: Subtleties

Conditional independence assumption is often violated
P(a1, a2, … an|vj) = Πi P(ai|vj)

…but it works surprisingly well anyway. Note don’t
need estimated posteriors P(vj|x) to be correct; need
only that
arg max P(vj) Πi P(ai|vj) = arg max P(vj) P(a1, a2, … an|vj)

Naive Bayes posteriors often unrealistically close to 1
or 0
38
Naive Bayes: Subtleties


What if none of the training instances with target
value vj have attribute value aj ? Then
P(ai|vj) = 0
P(vj) Πi P(ai|vj) = 0
Typical solution is Bayesian estimate for P(ai| vj)
P(ai|vj) = (nc+mp)/(n+m)
where:
 n number of training examples for which v = vj
 nc number of examples for which v = vj and a = ai
 p is prior estimate for P(ai| vj)
 m is weight given to prior (i.e. number of “virtual” examples)
39
Learning to Classify Text

Why?
 Learn which new articles are of interest
 Learn to classify web pages by topic

Naive Bayes is among most effective algorithm

What attributes shall we use to represent text
documents?
40
Learning to Classify Text




Target concept
Interesting? : Document  {+, -}
Represent each document by vector of words
 one attribute per word position in document
Learning: use training examples to estimate
P(+)
P(-)
P(doc|+)
P(doc|-)
Naive Bayes conditional independence assumption
P(doc|vj) =Πi=1, … ,length(doc) P(ai=wk|vj)

where P(ai=wk|vj) = probability that word in position i is wk given
vj
One more assumption
P(ai=wk|vj) = P(am=wk|vj) ∀i, m
41
Learn_Naive_Bayes_Text (Examples, V)

Collect all words and other tokens that occur in Examples
Vocabulary := all distinct words and other tokens in Examples

Calculate the required P(vj) and P(wk|vj) probability terms
For each target value vj in V do
docsj := subset of Examples for which the target value is vj
P(vj) := |docsj|/|Examples|
Textj := a single document created by concatenating all members of docsj
n := total number of words in Textj (counting duplicate words multiple times)
For each word wk in Vocabulary
*nk := number of times word wk occurs in Textj
*P(wk|vj) := (nk+1)/(n+|Vocabulary|)
42
Classify_Naive_Bayes_Text (Doc)
positions := all word positions in Doc that contain tokens found
in
Vocabulary
Return
vNB = arg max P(vj) Πi∈positions P(ai|vj)
43
Example



Twenty NewsGroups
Given 1000 training documents from each group, learn to classify
new document according to which newsgroup it came from
comp.graphics
comps.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc. forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
alt. atheism
soc.religion.christian
talk.religion.misc
talk.politics.mideast
talk.politics.misc
talk.politics.guns
sci.space
sci.crypt
sci.electronics
sci.med
Naive Bayes: 89% classification accuracy
44
Reti Bayesiane



Il classificatore ottimale di Bayes non fa assunzioni di
indipendenza tra variabili ed è computazionalmente
pesante
Il classificatore “ingenuo” di Bayes è efficiente grazie
all’ipotesi molto restrittiva di indipendenza
condizionale di tutti gli attributi dato il valore
obiettivo v
Le reti Bayesiane descrivono l’indipendenza
condizionale tra sottoinsiemi di variabili
45
Reti Bayesiane
DEF. X, Y, Z variabili casuali discrete.
X è condizionalmente indipendente da Y dato Z se:
 xi, yj, zk P(X= xi|Y= yj,Z= zk)= P(X= xi|Z= zk)
Facilmente estendibile a insiemi di variabili:
P(X1,…, Xl|Y1,…,Ym,Z1,…, Zn)=P(X1,…, Xl | Z1,…, Zn)
46
Reti Bayesiane - Rappresentazione

Grafo aciclico orientato
 I nodi rappresentano le variabili casuali
 Gli archi rappresentano la struttura di dipendenza


condizionale: ogni variabile è condizionalmente indipendente
dai suoi non discendenti, dati i suoi genitori
Tabella di probabilità condizionali locali per ogni nodo,
con la distribuzione di probabilità della variabile
associata dati i predecessori immediati
Si può calcolare:
P(y1,…, yn) = i P(yi|gen(Yi))
47
Reti Bayesiane - Rappresentazione
Temporale
Gita
T,G T,G T,G T,G
Fuoco di campo
F 0,4 0,1
0,8
0,2
F 0,6 0,9
0,2
0,8
Fulmine
Tuono
Incendio
(Variabili a valori booleani)
48
Fly UP