Variational Inference Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 19

by user

on 15 сентября 2016

Category: Documents

>> Downloads: 3

views

Report

Comments

Description

Download Variational Inference Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 19

Transcript

Variational Inference Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 19

Variational Inference
Machine Learning: Jordan Boyd-Graber
University of Colorado Boulder
LECTURE 19
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
1 of 29
Variational Inference
• Inferring hidden variables
• Unlike MCMC:
◦ Deterministic
◦ Easy to gauge convergence
◦ Requires dozens of iterations
• Doesn’t require conjugacy
• Slightly hairier math
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
2 of 29
Setup
• ~x = x1:n observations
• ~z = z1:m hidden variables
• α fixed parameters
• Want the posterior distribution
p(z | x, α) = R
Machine Learning: Jordan Boyd-Graber
|
Boulder
p(z, x | α)
z p(z, x | α)
(1)
Variational Inference
|
3 of 29
Motivation
• Can’t compute posterior for many interesting models
GMM (finite)
1. Draw µk ∼ N (0, τ 2 )
2. For each observation i = 1 . . . n:
2.1 Draw zi ∼ Mult(π)
2.2 Draw xi ∼ N (µzi , σ02 )
• Posterior is intractable for large n, and we might want to add priors
QK
p(µ1:K , z1:n | x1:n ) = R
µ1:K
Qn
k=1 p(µk )
i=1 p(zi )p(xi | zi , µ1:K )
P QK
Qn
z1:n
k=1 p(µk )
i=1 p(zi )p(xi | zi , µ1:K )
(2)
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
4 of 29
Motivation
• Can’t compute posterior for many interesting models
GMM (finite)
1. Draw µk ∼ N (0, τ 2 )
2. For each observation i = 1 . . . n:
2.1 Draw zi ∼ Mult(π)
2.2 Draw xi ∼ N (µzi , σ02 )
• Posterior is intractable for large n, and we might want to add priors
QK
p(µ1:K , z1:n | x1:n ) = R
µ1:K
Qn
k=1 p(µk )
i=1 p(zi )p(xi | zi , µ1:K )
Qn
P QK
z1:n
k=1 p(µk )
i=1 p(zi )p(xi | zi , µ1:K )
(2)
Consider all means
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
4 of 29
Motivation
• Can’t compute posterior for many interesting models
GMM (finite)
1. Draw µk ∼ N (0, τ 2 )
2. For each observation i = 1 . . . n:
2.1 Draw zi ∼ Mult(π)
2.2 Draw xi ∼ N (µzi , σ02 )
• Posterior is intractable for large n, and we might want to add priors
QK
p(µ1:K , z1:n | x1:n ) = R
µ1:K
Qn
k=1 p(µk )
i=1 p(zi )p(xi | zi , µ1:K )
P QK
Qn
z1:n
k=1 p(µk )
i=1 p(zi )p(xi | zi , µ1:K )
(2)
Consider all assignments
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
4 of 29
Main Idea
• We create a variational distribution over the latent variables
q(z1:m | ν)
(3)
• Find the settings of ν so that q is close to the posterior
• If q == p, then this is vanilla EM
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
5 of 29
What does it mean for distributions to be close?
• We measure the closeness of distributions using Kullback-Leibler
Divergence
KL(q || p) ≡ Eq
Machine Learning: Jordan Boyd-Graber
|
Boulder
q(Z )
log
p(Z | x)
(4)
Variational Inference
|
6 of 29
What does it mean for distributions to be close?
• We measure the closeness of distributions using Kullback-Leibler
Divergence
KL(q || p) ≡ Eq
q(Z )
log
p(Z | x)
(4)
• Characterizing KL divergence
◦ If q and p are high, we’re happy
◦ If q is high but p isn’t, we pay a price
◦ If q is low, we don’t care
◦ If KL = 0, then distribution are equal
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
6 of 29
What does it mean for distributions to be close?
• We measure the closeness of distributions using Kullback-Leibler
Divergence
KL(q || p) ≡ Eq
q(Z )
log
p(Z | x)
(4)
• Characterizing KL divergence
◦ If q and p are high, we’re happy
◦ If q is high but p isn’t, we pay a price
◦ If q is low, we don’t care
◦ If KL = 0, then distribution are equal
This behavior is often called “mode splitting”: we want a good
solution, not every solution.
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
6 of 29
Jensen’s Inequality: Concave Functions and Expectations
log(t · x1 + (1
t) · x2 )
When f is concave
t log(x1 ) + (1
x1
t) log(x2 )
f (E [X ]) ≥ E [f (X )]
x2
If you haven’t seen this before, spend fifteen minutes to convince
yourself that it’s true
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
7 of 29
Evidence Lower Bound (ELBO)
• Apply Jensen’s inequality on log probability of data
Z
p(x, z)
log p(x) = log
z
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
8 of 29
Evidence Lower Bound (ELBO)
• Apply Jensen’s inequality on log probability of data
Z
log p(x) = log
p(x, z)
z
Z
= log
p(x, z)
z
q(z)
q(z)
Add a term that is equal to one
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
8 of 29
Evidence Lower Bound (ELBO)
• Apply Jensen’s inequality on log probability of data
Z
log p(x) = log
p(x, z)
z
Z
q(z)
p(x, z)
q(z)
z
p(x, z)
= log Eq
q(z)
= log
Take the numerator to create an expectation
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
8 of 29
Evidence Lower Bound (ELBO)
• Apply Jensen’s inequality on log probability of data
Z
log p(x) = log
p(x, z)
z
Z
q(z)
p(x, z)
q(z)
z
p(x, z)
= log Eq
q(z)
≥Eq [log p(x, z)] − Eq [log q(z)]
= log
Apply Jensen’s equality and use log difference
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
8 of 29
Evidence Lower Bound (ELBO)
• Apply Jensen’s inequality on log probability of data
Z
log p(x) = log
p(x, z)
Zz
q(z)
= log
p(x, z)
q(z)
z
p(x, z)
= log Eq
q(z)
≥Eq [log p(x, z)] − Eq [log q(z)]
• Fun side effect: Entropy
• Maximizing the ELBO gives as tight a bound on on log probability
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
8 of 29
Evidence Lower Bound (ELBO)
• Apply Jensen’s inequality on log probability of data
Z
log p(x) = log
p(x, z)
Zz
q(z)
= log
p(x, z)
q(z)
z
p(x, z)
= log Eq
q(z)
≥Eq [log p(x, z)] − Eq [log q(z)]
• Fun side effect: Entropy
• Maximizing the ELBO gives as tight a bound on on log probability
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
8 of 29
Evidence Lower Bound (ELBO)
• Apply Jensen’s inequality on log probability of data
Z
log p(x) = log
p(x, z)
Zz
q(z)
= log
p(x, z)
q(z)
z
p(x, z)
= log Eq
q(z)
≥Eq [log p(x, z)] − Eq [log q(z)]
• Fun side effect: Entropy
• Maximizing the ELBO gives as tight a bound on on log probability
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
8 of 29
Relation to KL Divergence
• Conditional probability definition
p(z | x) =
Machine Learning: Jordan Boyd-Graber
|
Boulder
p(z, x)
p(x)
(5)
Variational Inference
|
9 of 29
Relation to KL Divergence
• Conditional probability definition
p(z | x) =
p(z, x)
p(x)
(5)
• Plug into KL divergence
KL(q(z) || p(z | x)) =Eq
Machine Learning: Jordan Boyd-Graber
|
Boulder
q(z)
log
p(z | x)
Variational Inference
|
9 of 29
Relation to KL Divergence
• Conditional probability definition
p(z | x) =
p(z, x)
p(x)
(5)
• Plug into KL divergence
q(z)
KL(q(z) || p(z | x)) =Eq log
p(z | x)
=Eq [log q(z)] − Eq [log p(z | x)]
Break quotient into difference
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
9 of 29
Relation to KL Divergence
• Conditional probability definition
p(z | x) =
p(z, x)
p(x)
(5)
• Plug into KL divergence
q(z)
KL(q(z) || p(z | x)) =Eq log
p(z | x)
=Eq [log q(z)] − Eq [log p(z | x)]
=Eq [log q(z)] − Eq [log p(z, x)] + log p(x)
Apply definition of conditional probability
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
9 of 29
Relation to KL Divergence
• Conditional probability definition
p(z | x) =
• Plug into KL divergence
p(z, x)
p(x)
(5)
q(z)
KL(q(z) || p(z | x)) =Eq log
p(z | x)
=Eq [log q(z)] − Eq [log p(z | x)]
=Eq [log q(z)] − Eq [log p(z, x)] + log p(x)
= − (Eq [log p(z, x)] − Eq [log q(z)]) + log p(x)
Reorganize terms
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
9 of 29
Relation to KL Divergence
• Conditional probability definition
p(z | x) =
p(z, x)
p(x)
(5)
• Plug into KL divergence
q(z)
KL(q(z) || p(z | x)) =Eq log
p(z | x)
=Eq [log q(z)] − Eq [log p(z | x)]
=Eq [log q(z)] − Eq [log p(z, x)] + log p(x)
= − (Eq [log p(z, x)] − Eq [log q(z)]) + log p(x)
• Negative of ELBO (plus constant); minimizing KL divergence is
the same as maximizing ELBO
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
9 of 29
Mean field variational inference
• Assume that your variational distribution factorizes
q(z1 , . . . , zm ) =
m
Y
q(zj )
(6)
j=1
• You may want to group some hidden variables together
• Does not contain the true posterior because hidden variables are
dependent
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
10 of 29
General Blueprint
• Choose q
• Derive ELBO
• Coordinate ascent of each qi
• Repeat until convergence
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
11 of 29
Example: Latent Dirichlet Allocation
TOPIC 1
TOPIC 2
TOPIC 3
computer,
technology,
system,
service, site,
phone,
internet,
machine
sell, sale,
store, product,
business,
advertising,
market,
consumer
play, film,
movie, theater,
production,
star, director,
stage
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
12 of 29
Example: Latent Dirichlet Allocation
Red Light, Green Light: A
2-Tone L.E.D. to
Simplify Screens
TOPIC 1
The three big Internet
portals begin to distinguish
among themselves as
shopping malls
TOPIC 2
Forget the Bootleg, Just
Download the Movie Legally
The Shape of Cinema,
Transformed At the Click of
a Mouse
TOPIC 3
Machine Learning: Jordan Boyd-Graber
Stock Trades: A Better Deal
For Investors Isn't Simple
|
Boulder
Multiplex Heralded As
Linchpin To Growth
A Peaceful Crew Puts
Muppets Where Its Mouth Is
Variational Inference
|
12 of 29
Example: Latent Dirichlet Allocation
computer,
technology,
system,
service, site,
phone,
internet,
machine
sell, sale,
store, product,
business,
advertising,
market,
consumer
play, film,
movie, theater,
production,
star, director,
stage
Hollywood studios are preparing to let people
download and buy electronic copies of movies over
the Internet, much as record labels now sell songs for
99 cents through Apple Computer's iTunes music store
and other online services ...
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
12 of 29
LDA Generative Model
βk
α
θd
zn
K
wn
N M
• For each topic k ∈ {1, . . . , K }, a multinomial distribution βk
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
13 of 29
LDA Generative Model
βk
α
θd
zn
K
wn
N M
• For each topic k ∈ {1, . . . , K }, a multinomial distribution βk
• For each document d ∈ {1, . . . , M}, draw a multinomial
distribution θd from a Dirichlet distribution with parameter α
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
13 of 29
LDA Generative Model
βk
α
θd
zn
K
wn
N M
• For each topic k ∈ {1, . . . , K }, a multinomial distribution βk
• For each document d ∈ {1, . . . , M}, draw a multinomial
distribution θd from a Dirichlet distribution with parameter α
• For each word position n ∈ {1, . . . , N}, select a hidden topic zn
from the multinomial distribution parameterized by θ.
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
13 of 29
LDA Generative Model
βk
α
θd
zn
K
wn
N M
• For each topic k ∈ {1, . . . , K }, a multinomial distribution βk
• For each document d ∈ {1, . . . , M}, draw a multinomial
distribution θd from a Dirichlet distribution with parameter α
• For each word position n ∈ {1, . . . , N}, select a hidden topic zn
from the multinomial distribution parameterized by θ.
• Choose the observed word wn from the distribution βzn .
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
13 of 29
LDA Generative Model
βk
α
θd
zn
K
wn
N M
• For each topic k ∈ {1, . . . , K }, a multinomial distribution βk
• For each document d ∈ {1, . . . , M}, draw a multinomial
distribution θd from a Dirichlet distribution with parameter α
• For each word position n ∈ {1, . . . , N}, select a hidden topic zn
from the multinomial distribution parameterized by θ.
• Choose the observed word wn from the distribution βzn .
Machine
Learning: Jordan Boyd-Graber
|
Boulder
Statistical
inference
uncovers most unobserved variables Variational
givenInference
data.|
13 of 29
Deriving Variational Inference for LDA
Joint distribution:
p(θ, z, w | α, β) =
Machine Learning: Jordan Boyd-Graber
|
Y
d
Boulder
p(θd | α)
Y
n
p(zd,n | θd )p(wd,n | β, zd,n )
Variational Inference
(7)
|
14 of 29
Deriving Variational Inference for LDA
Joint distribution:
p(θ, z, w | α, β) =
P
Y
Γ( i αi )
• p(θd | α) = Q Γ(α
i)
i
Machine Learning: Jordan Boyd-Graber
|
d
Q
Boulder
p(θd | α)
k
Y
n
p(zd,n | θd )p(wd,n | β, zd,n )
(7)
αk −1
θd,k
(Dirichlet)
Variational Inference
|
14 of 29
Deriving Variational Inference for LDA
Joint distribution:
p(θ, z, w | α, β) =
P
Y
d
p(θd | α)
Y
n
p(zd,n | θd )p(wd,n | β, zd,n )
(7)
Γ( i αi )
αk −1
• p(θd | α) = Q Γ(α
(Dirichlet)
k θd,k
i)
i
• p(zd,n | θd ) = θd,zd,n (Draw from Multinomial)
Machine Learning: Jordan Boyd-Graber
|
Q
Boulder
Variational Inference
|
14 of 29
Deriving Variational Inference for LDA
Joint distribution:
p(θ, z, w | α, β) =
P
Y
d
p(θd | α)
Y
n
p(zd,n | θd )p(wd,n | β, zd,n )
(7)
Γ( i αi )
αk −1
• p(θd | α) = Q Γ(α
(Dirichlet)
k θd,k
i)
i
• p(zd,n | θd ) = θd,zd,n (Draw from Multinomial)
Q
• p(wd,n | β, zd,n ) = βzd,n ,wd,n (Draw from Multinomial)
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
14 of 29
Deriving Variational Inference for LDA
Joint distribution:
p(θ, z, w | α, β) =
Y
d
p(θd | α)
Y
n
p(zd,n | θd )p(wd,n | β, zd,n )
(7)
Variational distribution:
q(θ, z) = q(θ | γ)q(z | φ)
Machine Learning: Jordan Boyd-Graber
|
Boulder
(8)
Variational Inference
|
14 of 29
Deriving Variational Inference for LDA
Joint distribution:
p(θ, z, w | α, β) =
Y
d
p(θd | α)
Y
n
p(zd,n | θd )p(wd,n | β, zd,n )
(7)
Variational distribution:
q(θ, z) = q(θ | γ)q(z | φ)
(8)
ELBO:
L(γ, φ; α, β) =Eq [log p(θ | α)] + Eq [log p(z | θ)] + Eq [log p(w | z, β)]
− Eq [log q(θ)] − Eq [log q(z)]
Machine Learning: Jordan Boyd-Graber
|
Boulder
(9)
Variational Inference
|
14 of 29
What is the variational distribution?
~ ~z ) =
q(θ,
Y
d
q(θd | γd )
Y
n
q(zd,n | φd,n )
(10)
• Variational document distribution over topics γd
◦ Vector of length K for each document
◦ Non-negative
◦ Doesn’t sum to 1.0
• Variational token distribution over topic assignments φd,n
◦ Vector of length K for every token
◦ Non-negative, sums to 1.0
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
15 of 29
Expectation of log Dirichlet
• Most expectations are straightforward to compute
• Dirichlet is harder
Edir


X
[log p(θi | α)] = Ψ (αi ) − Ψ 
αj 
(11)
j
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
16 of 29
Expectation 1
"
Eq [log p(θ | α)] =Eq log
(
)#
P
Γ( i αi ) Y αi −1
Q
θi
i Γ(αi )
(12)
i
(13)
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
17 of 29
Expectation 1
)#
P
Γ( i αi ) Y αi −1
Eq [log p(θ | α)] =Eq log Q
θi
i Γ(αi ) i
" P
#
X
Γ( i αi )
αi −1
=Eq log Q
+
log θi
i Γ(αi )
"
(
(12)
i
(13)
Log of products becomes sum of logs.
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
17 of 29
Expectation 1
)#
P
Γ( i αi ) Y αi −1
(12)
Eq [log p(θ | α)] =Eq log Q
θi
i Γ(αi ) i
" P
#
X
Γ( i αi )
=Eq log Q
+
log θiαi −1
i Γ(αi )
i
"
#
X
X
X
= log Γ(
αi ) −
log Γ(αi ) + Eq
(αi − 1) log θi
"
(
i
i
i
(13)
Log of exponent becomes product, expectation of constant is
constant
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
17 of 29
Expectation 1
"
)#
P
Γ( i αi ) Y αi −1
Eq [log p(θ | α)] =Eq log Q
(12)
θi
i Γ(αi ) i
#
" P
X
Γ( i αi )
αi −1
=Eq log Q
+
log θi
i Γ(αi )
i
"
#
X
X
X
= log Γ(
αi ) −
log Γ(αi ) + Eq
(αi − 1) log θi
= log Γ(
(
i
i
X
X
i
αi ) −
i
log Γ(αi )
i

+
X
i


X
(αi − 1) Ψ (γi ) − Ψ 
γj 
j
Expectation of log Dirichlet
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
17 of 29
Expectation 2
#
"
Eq [log p(z | θ)] =Eq log
YY
n
1[z ==i]
θi n
(13)
i
(14)
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
18 of 29
Expectation 2
"
#
Eq [log p(z | θ)] =Eq log
YY
n
1[z ==i]
θi n
"
=Eq
(13)
i
#
XX
n
1[z ==i]
log θi n
(14)
i
(15)
Products to sums
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
18 of 29
Expectation 2
"
#
Eq [log p(z | θ)] =Eq log
YY
n
1[z ==i]
θi n
"
=Eq
#
XX
n
=
XX
n
(13)
i
1[zn ==i]
log θi
(14)
i
h
i
1[z ==i]
Eq log θi n
(15)
i
(16)
Linearity of expectation
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
18 of 29
Expectation 2
#
"
Eq [log p(z | θ)] =Eq log
YY
n
1[z ==i]
θi n
#
"
=Eq
XX
n
=
XX
n
=
1[zn ==i]
log θi
(14)
i
h
i
1[z ==i]
Eq log θi n
(15)
φni Eq [log θi ]
(16)
i
XX
n
(13)
i
i
(17)
Independence of variational distribution, exponents become products
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
18 of 29
Expectation 2
"
#
Eq [log p(z | θ)] =Eq log
YY
n
1[z ==i]
θi n
"
=Eq
#
XX
n
=
XX
n
=
1[z ==i]
log θi n
h
i
1[z ==i]
Eq log θi n
(15)
φni Eq [log θi ]
(16)
i

XX
n
(14)
i
i
XX
n
=
(13)
i
i

φni Ψ (γi ) − Ψ 

X
γj  
(17)
j
Expectation of log Dirichlet
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
18 of 29
Expectation 3
Eq [log p(w | z, β)] =Eq log βzd,n ,wd,n
Machine Learning: Jordan Boyd-Graber
|
Boulder
(18)
(19)
Variational Inference
|
19 of 29
Expectation 3
Eq [log p(w | z, β)] =Eq log βzd,n ,wd,n
"
#
V Y
K
Y
1[v =wd,n ,zd,n =i ]
=Eq log
βi,v
v
(18)
(19)
i
(20)
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
19 of 29
Expectation 3
Eq [log p(w | z, β)] =Eq log βzd,n ,wd,n
"
#
V Y
K
Y
1[v =wd,n ,zd,n =i ]
=Eq log
βi,v
v
=
V X
K
X
v
(18)
(19)
i
Eq [1 [v = wd,n , zd,n = i]] log βi,v
(20)
i
(21)
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
19 of 29
Expectation 3
Eq [log p(w | z, β)] =Eq log βzd,n ,wd,n
"
#
V Y
K
Y
1[v =wd,n ,zd,n =i ]
=Eq log
βi,v
v
=
V X
K
X
(18)
(19)
i
Eq [1 [v = wd,n , zd,n = i]] log βi,v
(20)
v
φn,i wd,n
log βi,v
(21)
v
=
i
V
K
XX
v
Machine Learning: Jordan Boyd-Graber
|
Boulder
i
Variational Inference
|
19 of 29
Entropies
Entropy of Dirichlet

Hq [γ] = − log Γ 

X
γj  +
j
X
log Γ(γi )
i

−
Machine Learning: Jordan Boyd-Graber
|
X
Boulder
i


k
X
(γi − 1) Ψ (γi ) − Ψ 
γj  
j=1
Variational Inference
|
20 of 29
Entropies
Entropy of Dirichlet

Hq [γ] = − log Γ 

X
γj  +
j
X
log Γ(γi )
i

−
X
i


k
X
(γi − 1) Ψ (γi ) − Ψ 
γj  
j=1
Entropy of Multinomial
Hq [φd,n ] = −
Machine Learning: Jordan Boyd-Graber
|
Boulder
X
φd,n,i log φd,n,i
(22)
i
Variational Inference
|
20 of 29
Complete objective function
Note the entropy terms at the end (negative sign)
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
21 of 29
Deriving the algorithm
• Compute partial wrt to variable of interest
• Set equal to zero
• Solve for variable
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
22 of 29
Update for φ
Derivative of ELBO:


X
∂L
= Ψ (γi ) − Ψ 
γj  + log βi,v − log φni − 1 + λ
∂φni
(23)
j
Solution:


φni ∝ βiv exp Ψ (γi ) − Ψ 
Machine Learning: Jordan Boyd-Graber
|
Boulder

X
γj  
(24)
j
Variational Inference
|
23 of 29
Update for γ
Derivative of ELBO:
∂L
=Ψ0 (γi ) (αi + φn,i − γi )
∂γi


!
X
X
X
αj +
φnj − γj
− Ψ0 
γj 
j
Machine Learning: Jordan Boyd-Graber
|
Boulder
j
n
Variational Inference
|
24 of 29
Update for γ
Derivative of ELBO:
∂L
=Ψ0 (γi ) (αi + φn,i − γi )
∂γi


!
X
X
X
αj +
φnj − γj
− Ψ0 
γj 
j
Machine Learning: Jordan Boyd-Graber
|
Boulder
j
n
Variational Inference
|
24 of 29
Update for γ
Derivative of ELBO:
∂L
=Ψ0 (γi ) (αi + φn,i − γi )
∂γi


!
X
X
X
0
αj +
φnj − γj
−Ψ
γj 
j
n
j
Solution:
γi = αi +
X
φni
(25)
n
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
24 of 29
Update for β
Slightly more complicated (requires Lagrange parameter), but solution
is obvious:
XX
j
βij ∝
φdni wdn
(26)
d
Machine Learning: Jordan Boyd-Graber
|
Boulder
n
Variational Inference
|
25 of 29
Overall Algorithm
1. Randomly initialize variational parameters (can’t be uniform)
2. For each iteration:
2.1 For each document, update γ and φ
2.2 For corpus, update β
2.3 Compute L for diagnostics
3. Return expectation of variational parameters for solution to latent
variables
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
26 of 29
Relationship with Gibbs Sampling
• Gibbs sampling: sample from the conditional distribution of all
other variables
• Variational inference: each factor is set to the exponentiated log of
the conditional
• Variational is easier to parallelize, Gibbs faster per step
• Gibbs typically easier to implement
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
27 of 29
Implementation Tips
• Match derivation exactly at first
• Randomize initialization, but specify seed
• Use simple languages first
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
28 of 29
Implementation Tips
• Match derivation exactly at first
• Randomize initialization, but specify seed
• Use simple languages first . . . then match implementation
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
28 of 29
Implementation Tips
• Match derivation exactly at first
• Randomize initialization, but specify seed
• Use simple languages first . . . then match implementation
• Try to match variables with paper
• Write unit tests for each atomic update
• Monitor variational bound (with asserts)
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
28 of 29
Implementation Tips
• Match derivation exactly at first
• Randomize initialization, but specify seed
• Use simple languages first . . . then match implementation
• Try to match variables with paper
• Write unit tests for each atomic update
• Monitor variational bound (with asserts)
• Write the state (checkpointing and debugging)
• Visualize variational parameters
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
28 of 29
Implementation Tips
• Match derivation exactly at first
• Randomize initialization, but specify seed
• Use simple languages first . . . then match implementation
• Try to match variables with paper
• Write unit tests for each atomic update
• Monitor variational bound (with asserts)
• Write the state (checkpointing and debugging)
• Visualize variational parameters
• Cache / memoize gamma / digamma functions
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
28 of 29
Next class
• Example on toy LDA problem
• Current research in variational inference
Machine Learning: Jordan Boyd-Graber
|
Boulder
Variational Inference
|
29 of 29