...

Bayesian Nonparametrics and DPMM Machine Learning: Jordan Boyd-Graber University of Colorado Boulder

by user

on
Category: Documents
11

views

Report

Comments

Transcript

Bayesian Nonparametrics and DPMM Machine Learning: Jordan Boyd-Graber University of Colorado Boulder
Bayesian Nonparametrics and
DPMM
Machine Learning: Jordan Boyd-Graber
University of Colorado Boulder
LECTURE 17
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
1 of 17
Clustering as Probabilistic Inference
• GMM is a probabilistic model (unlike K -means)
• There are several latent variables:
◦ Means
◦ Assignments
◦ (Variances)
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
2 of 17
Clustering as Probabilistic Inference
• GMM is a probabilistic model (unlike K -means)
• There are several latent variables:
◦ Means
◦ Assignments
◦ (Variances)
• Before, we were doing EM
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
2 of 17
Clustering as Probabilistic Inference
• GMM is a probabilistic model (unlike K -means)
• There are several latent variables:
◦ Means
◦ Assignments
◦ (Variances)
• Before, we were doing EM
• Today, new models and new methods
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
2 of 17
Nonparametric Clustering
• What if the number of clusters is not fixed?
• Nonparametric: can grow if data need it
• Probabilistic distribution over number of clusters
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
3 of 17
Dirichlet Process
• Distribution over distributions
• Parameterized by: α, G
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
4 of 17
Dirichlet Process
• Distribution over distributions
• Parameterized by: α, G
• Concentration parameter
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
4 of 17
Dirichlet Process
• Distribution over distributions
• Parameterized by: α, G
• Concentration parameter
• Base distribution
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
4 of 17
Dirichlet Process
• Distribution over distributions
• Parameterized by: α, G
• Concentration parameter
• Base distribution
• You can then draw observations from x ∼DP(α, G ).
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
4 of 17
Defining a DP
• Break off sticks
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
5 of 17
Defining a DP
• Break off sticks
• Draw atoms
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
5 of 17
Defining a DP
• Break off sticks
• Draw atoms
• Merge into complete distribution
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
5 of 17
Properties of a DPMM
• Expected value is the same as base distribution
EDP(α,G ) [x] = EG [x]
(1)
• As α → ∞, DP(α, G ) = G
• Number of components unbounded
• Impossible to represent fully on computer (truncation)
• You can nest DPs
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
6 of 17
Effect of scaling parameter α
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
7 of 17
DP as mixture Model
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
8 of 17
The Chinese Restaurant as a Distribution
To generate an observation, you first sit down at a table. You sit
down at a table proportional to the number of people sitting at the
table.
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
9 of 17
The Chinese Restaurant as a Distribution
To generate an observation, you first sit down at a table. You sit
down at a table proportional to the number of people sitting at the
table.
2
7
Machine Learning: Jordan Boyd-Graber
|
3
7
Boulder
2
7
Bayesian Nonparametrics and DPMM
|
9 of 17
The Chinese Restaurant as a Distribution
To generate an observation, you first sit down at a table. You sit
down at a table proportional to the number of people sitting at the
table.
2
7
Machine Learning: Jordan Boyd-Graber
|
3
7
Boulder
2
7
Bayesian Nonparametrics and DPMM
|
9 of 17
The Chinese Restaurant as a Distribution
To generate an observation, you first sit down at a table. You sit
down at a table proportional to the number of people sitting at the
table.
2
7
3
7
2
7
x ∼ µ1
x ∼ µ2
x ∼ µ3
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
9 of 17
The Chinese Restaurant as a Distribution
To generate an observation, you first sit down at a table. You sit
down at a table proportional to the number of people sitting at the
table.
2
7
3
7
2
7
x ∼ µ1
x ∼ µ2
x ∼ µ3
But this is just Maximum Likelihood
Why are we talking about Chinese Restaurants?
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
9 of 17
Always can squeeze in one more table . . .
• The posterior of a DP is CRP
• A new observation has a new table / cluster with probability
proportional to α
• But this must be balanced against the probability of an observation
given a cluster
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
10 of 17
Gibbs Sampling
• We want to know the cluster assignment of each observation
• Take a random guess initially
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
11 of 17
Gibbs Sampling
• We want to know the cluster assignment of each observation
• Take a random guess initially
• This provides a mean for each cluster
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
11 of 17
Gibbs Sampling
• We want to know the cluster assignment of each observation
• Take a random guess initially
• This provides a mean for each cluster
• Let the number of clusters grow
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
11 of 17
Gibbs Sampling
• We want to know the cluster assignment of each observation
(tables)
• Take a random guess initially
• This provides a mean for each cluster
• Let the number of clusters grow
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
11 of 17
Gibbs Sampling
• We want to know ~z
• Compute p(zi | z1 . . . zi−1 , zi+1 , . . . zm , x, α, G )
• Update zi by sampling from that distribution
• Keep going . . .
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
12 of 17
Gibbs Sampling
• We want to know ~z
• Compute p(zi | z1 . . . zi−1 , zi+1 , . . . zm , x, α, G )
• Update zi by sampling from that distribution
• Keep going . . .
Notation
p(zi = k | z−i ) ≡ p(zi | z1 . . . zi−1 , zi+1 , . . . zm )
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
(2)
|
12 of 17
Gibbs Sampling for DPMM
p(zi = k | ~z−i , ~x , {θk }, α)
(3)
(4)
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
13 of 17
Gibbs Sampling for DPMM
p(zi = k | ~z−i , ~x , {θk }, α)
(3)
=p(zi = k | ~z−i , xi , ~x , θk , α)
(4)
(5)
Dropping irrelevant terms
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
13 of 17
Gibbs Sampling for DPMM
p(zi = k | ~z−i , ~x , {θk }, α)
(3)
=p(zi = k | ~z−i , xi , ~x , θk , α)
(4)
=p(zi = k | ~z−i , α)p(xi | θk , ~x )
(5)
(6)
Chain rule
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
13 of 17
Gibbs Sampling for DPMM
p(zi = k | ~z−i , ~x , {θk }, α)
(3)
=p(zi = k | ~z−i , xi , ~x , θk , α)
(4)
=p(zi = k | ~z−i , α)p(xi | θk , ~x )
R
(
nk
x)
n· +α
θ p(xi | θ)p(θ | G , ~
=
R
α
n· +α θ p(xi | θ)p(θ | G )
(5)
existing
(6)
new
(7)
Applying CRP
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
13 of 17
Gibbs Sampling for DPMM
p(zi = k | ~z−i , ~x , {θk }, α)
(3)
=p(zi = k | ~z−i , xi , ~x , θk , α)
(4)
=p(zi = k | ~z−i , α)p(xi | θk , ~x )
R
(
nk
x ) existing
n· +α
θ p(xi | θ)p(θ | G , ~
=
R
α
new
n· +α θ p(xi | θ)p(θ | G )
(
nk
nx̄
existing
n· +α N x, n+1 , 1
=
α
new
n· +α N (x, 0, 1)
(5)
(6)
(7)
Scary integrals assuming G is normal distribution with mean zero and
unit variance. (Derived in optional reading.)
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
13 of 17
Algorithm for Gibbs Sampling
1. Random initial assignment to clusters
2. For iteration i:
2.1 “Unassign” observation n
2.2 Choose new cluster for that observation
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
14 of 17
Toy Example
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
15 of 17
Toy Example
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
15 of 17
Toy Example
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
15 of 17
Toy Example
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
15 of 17
Toy Example
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
15 of 17
Toy Example
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
15 of 17
Toy Example
New cluster created!
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
15 of 17
Toy Example
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
15 of 17
Toy Example
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
15 of 17
Toy Example
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
15 of 17
Toy Example
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
15 of 17
Toy Example
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
15 of 17
Toy Example
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
15 of 17
Toy Example
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
15 of 17
Toy Example
And repeat . . .
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
15 of 17
Differences between EM and Gibbs
• Gibbs often faster to implement
• EM easier to diagnose convergence
• EM can be parallelized
• Gibbs is more widely applicable
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
16 of 17
In class and next week
• Walking through DPMM clustering
• Clustering discrete data with more than one cluster per observation
Machine Learning: Jordan Boyd-Graber
|
Boulder
Bayesian Nonparametrics and DPMM
|
17 of 17
Fly UP