Variational Inference Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 19
by user
Comments
Transcript
Variational Inference Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 19
Variational Inference Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 19 Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 1 of 29 Variational Inference • Inferring hidden variables • Unlike MCMC: ◦ Deterministic ◦ Easy to gauge convergence ◦ Requires dozens of iterations • Doesn’t require conjugacy • Slightly hairier math Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 2 of 29 Setup • ~x = x1:n observations • ~z = z1:m hidden variables • α fixed parameters • Want the posterior distribution p(z | x, α) = R Machine Learning: Jordan Boyd-Graber | Boulder p(z, x | α) z p(z, x | α) (1) Variational Inference | 3 of 29 Motivation • Can’t compute posterior for many interesting models GMM (finite) 1. Draw µk ∼ N (0, τ 2 ) 2. For each observation i = 1 . . . n: 2.1 Draw zi ∼ Mult(π) 2.2 Draw xi ∼ N (µzi , σ02 ) • Posterior is intractable for large n, and we might want to add priors QK p(µ1:K , z1:n | x1:n ) = R µ1:K Qn k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) P QK Qn z1:n k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) (2) Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 4 of 29 Motivation • Can’t compute posterior for many interesting models GMM (finite) 1. Draw µk ∼ N (0, τ 2 ) 2. For each observation i = 1 . . . n: 2.1 Draw zi ∼ Mult(π) 2.2 Draw xi ∼ N (µzi , σ02 ) • Posterior is intractable for large n, and we might want to add priors QK p(µ1:K , z1:n | x1:n ) = R µ1:K Qn k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) Qn P QK z1:n k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) (2) Consider all means Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 4 of 29 Motivation • Can’t compute posterior for many interesting models GMM (finite) 1. Draw µk ∼ N (0, τ 2 ) 2. For each observation i = 1 . . . n: 2.1 Draw zi ∼ Mult(π) 2.2 Draw xi ∼ N (µzi , σ02 ) • Posterior is intractable for large n, and we might want to add priors QK p(µ1:K , z1:n | x1:n ) = R µ1:K Qn k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) P QK Qn z1:n k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) (2) Consider all assignments Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 4 of 29 Main Idea • We create a variational distribution over the latent variables q(z1:m | ν) (3) • Find the settings of ν so that q is close to the posterior • If q == p, then this is vanilla EM Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 5 of 29 What does it mean for distributions to be close? • We measure the closeness of distributions using Kullback-Leibler Divergence KL(q || p) ≡ Eq Machine Learning: Jordan Boyd-Graber | Boulder q(Z ) log p(Z | x) (4) Variational Inference | 6 of 29 What does it mean for distributions to be close? • We measure the closeness of distributions using Kullback-Leibler Divergence KL(q || p) ≡ Eq q(Z ) log p(Z | x) (4) • Characterizing KL divergence ◦ If q and p are high, we’re happy ◦ If q is high but p isn’t, we pay a price ◦ If q is low, we don’t care ◦ If KL = 0, then distribution are equal Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 6 of 29 What does it mean for distributions to be close? • We measure the closeness of distributions using Kullback-Leibler Divergence KL(q || p) ≡ Eq q(Z ) log p(Z | x) (4) • Characterizing KL divergence ◦ If q and p are high, we’re happy ◦ If q is high but p isn’t, we pay a price ◦ If q is low, we don’t care ◦ If KL = 0, then distribution are equal This behavior is often called “mode splitting”: we want a good solution, not every solution. Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 6 of 29 Jensen’s Inequality: Concave Functions and Expectations log(t · x1 + (1 t) · x2 ) When f is concave t log(x1 ) + (1 x1 t) log(x2 ) f (E [X ]) ≥ E [f (X )] x2 If you haven’t seen this before, spend fifteen minutes to convince yourself that it’s true Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 7 of 29 Evidence Lower Bound (ELBO) • Apply Jensen’s inequality on log probability of data Z p(x, z) log p(x) = log z Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 8 of 29 Evidence Lower Bound (ELBO) • Apply Jensen’s inequality on log probability of data Z log p(x) = log p(x, z) z Z = log p(x, z) z q(z) q(z) Add a term that is equal to one Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 8 of 29 Evidence Lower Bound (ELBO) • Apply Jensen’s inequality on log probability of data Z log p(x) = log p(x, z) z Z q(z) p(x, z) q(z) z p(x, z) = log Eq q(z) = log Take the numerator to create an expectation Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 8 of 29 Evidence Lower Bound (ELBO) • Apply Jensen’s inequality on log probability of data Z log p(x) = log p(x, z) z Z q(z) p(x, z) q(z) z p(x, z) = log Eq q(z) ≥Eq [log p(x, z)] − Eq [log q(z)] = log Apply Jensen’s equality and use log difference Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 8 of 29 Evidence Lower Bound (ELBO) • Apply Jensen’s inequality on log probability of data Z log p(x) = log p(x, z) Zz q(z) = log p(x, z) q(z) z p(x, z) = log Eq q(z) ≥Eq [log p(x, z)] − Eq [log q(z)] • Fun side effect: Entropy • Maximizing the ELBO gives as tight a bound on on log probability Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 8 of 29 Evidence Lower Bound (ELBO) • Apply Jensen’s inequality on log probability of data Z log p(x) = log p(x, z) Zz q(z) = log p(x, z) q(z) z p(x, z) = log Eq q(z) ≥Eq [log p(x, z)] − Eq [log q(z)] • Fun side effect: Entropy • Maximizing the ELBO gives as tight a bound on on log probability Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 8 of 29 Evidence Lower Bound (ELBO) • Apply Jensen’s inequality on log probability of data Z log p(x) = log p(x, z) Zz q(z) = log p(x, z) q(z) z p(x, z) = log Eq q(z) ≥Eq [log p(x, z)] − Eq [log q(z)] • Fun side effect: Entropy • Maximizing the ELBO gives as tight a bound on on log probability Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 8 of 29 Relation to KL Divergence • Conditional probability definition p(z | x) = Machine Learning: Jordan Boyd-Graber | Boulder p(z, x) p(x) (5) Variational Inference | 9 of 29 Relation to KL Divergence • Conditional probability definition p(z | x) = p(z, x) p(x) (5) • Plug into KL divergence KL(q(z) || p(z | x)) =Eq Machine Learning: Jordan Boyd-Graber | Boulder q(z) log p(z | x) Variational Inference | 9 of 29 Relation to KL Divergence • Conditional probability definition p(z | x) = p(z, x) p(x) (5) • Plug into KL divergence q(z) KL(q(z) || p(z | x)) =Eq log p(z | x) =Eq [log q(z)] − Eq [log p(z | x)] Break quotient into difference Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 9 of 29 Relation to KL Divergence • Conditional probability definition p(z | x) = p(z, x) p(x) (5) • Plug into KL divergence q(z) KL(q(z) || p(z | x)) =Eq log p(z | x) =Eq [log q(z)] − Eq [log p(z | x)] =Eq [log q(z)] − Eq [log p(z, x)] + log p(x) Apply definition of conditional probability Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 9 of 29 Relation to KL Divergence • Conditional probability definition p(z | x) = • Plug into KL divergence p(z, x) p(x) (5) q(z) KL(q(z) || p(z | x)) =Eq log p(z | x) =Eq [log q(z)] − Eq [log p(z | x)] =Eq [log q(z)] − Eq [log p(z, x)] + log p(x) = − (Eq [log p(z, x)] − Eq [log q(z)]) + log p(x) Reorganize terms Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 9 of 29 Relation to KL Divergence • Conditional probability definition p(z | x) = p(z, x) p(x) (5) • Plug into KL divergence q(z) KL(q(z) || p(z | x)) =Eq log p(z | x) =Eq [log q(z)] − Eq [log p(z | x)] =Eq [log q(z)] − Eq [log p(z, x)] + log p(x) = − (Eq [log p(z, x)] − Eq [log q(z)]) + log p(x) • Negative of ELBO (plus constant); minimizing KL divergence is the same as maximizing ELBO Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 9 of 29 Mean field variational inference • Assume that your variational distribution factorizes q(z1 , . . . , zm ) = m Y q(zj ) (6) j=1 • You may want to group some hidden variables together • Does not contain the true posterior because hidden variables are dependent Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 10 of 29 General Blueprint • Choose q • Derive ELBO • Coordinate ascent of each qi • Repeat until convergence Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 11 of 29 Example: Latent Dirichlet Allocation TOPIC 1 TOPIC 2 TOPIC 3 computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 12 of 29 Example: Latent Dirichlet Allocation Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens TOPIC 1 The three big Internet portals begin to distinguish among themselves as shopping malls TOPIC 2 Forget the Bootleg, Just Download the Movie Legally The Shape of Cinema, Transformed At the Click of a Mouse TOPIC 3 Machine Learning: Jordan Boyd-Graber Stock Trades: A Better Deal For Investors Isn't Simple | Boulder Multiplex Heralded As Linchpin To Growth A Peaceful Crew Puts Muppets Where Its Mouth Is Variational Inference | 12 of 29 Example: Latent Dirichlet Allocation computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's iTunes music store and other online services ... Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 12 of 29 LDA Generative Model βk α θd zn K wn N M • For each topic k ∈ {1, . . . , K }, a multinomial distribution βk Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 13 of 29 LDA Generative Model βk α θd zn K wn N M • For each topic k ∈ {1, . . . , K }, a multinomial distribution βk • For each document d ∈ {1, . . . , M}, draw a multinomial distribution θd from a Dirichlet distribution with parameter α Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 13 of 29 LDA Generative Model βk α θd zn K wn N M • For each topic k ∈ {1, . . . , K }, a multinomial distribution βk • For each document d ∈ {1, . . . , M}, draw a multinomial distribution θd from a Dirichlet distribution with parameter α • For each word position n ∈ {1, . . . , N}, select a hidden topic zn from the multinomial distribution parameterized by θ. Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 13 of 29 LDA Generative Model βk α θd zn K wn N M • For each topic k ∈ {1, . . . , K }, a multinomial distribution βk • For each document d ∈ {1, . . . , M}, draw a multinomial distribution θd from a Dirichlet distribution with parameter α • For each word position n ∈ {1, . . . , N}, select a hidden topic zn from the multinomial distribution parameterized by θ. • Choose the observed word wn from the distribution βzn . Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 13 of 29 LDA Generative Model βk α θd zn K wn N M • For each topic k ∈ {1, . . . , K }, a multinomial distribution βk • For each document d ∈ {1, . . . , M}, draw a multinomial distribution θd from a Dirichlet distribution with parameter α • For each word position n ∈ {1, . . . , N}, select a hidden topic zn from the multinomial distribution parameterized by θ. • Choose the observed word wn from the distribution βzn . Machine Learning: Jordan Boyd-Graber | Boulder Statistical inference uncovers most unobserved variables Variational givenInference data.| 13 of 29 Deriving Variational Inference for LDA Joint distribution: p(θ, z, w | α, β) = Machine Learning: Jordan Boyd-Graber | Y d Boulder p(θd | α) Y n p(zd,n | θd )p(wd,n | β, zd,n ) Variational Inference (7) | 14 of 29 Deriving Variational Inference for LDA Joint distribution: p(θ, z, w | α, β) = P Y Γ( i αi ) • p(θd | α) = Q Γ(α i) i Machine Learning: Jordan Boyd-Graber | d Q Boulder p(θd | α) k Y n p(zd,n | θd )p(wd,n | β, zd,n ) (7) αk −1 θd,k (Dirichlet) Variational Inference | 14 of 29 Deriving Variational Inference for LDA Joint distribution: p(θ, z, w | α, β) = P Y d p(θd | α) Y n p(zd,n | θd )p(wd,n | β, zd,n ) (7) Γ( i αi ) αk −1 • p(θd | α) = Q Γ(α (Dirichlet) k θd,k i) i • p(zd,n | θd ) = θd,zd,n (Draw from Multinomial) Machine Learning: Jordan Boyd-Graber | Q Boulder Variational Inference | 14 of 29 Deriving Variational Inference for LDA Joint distribution: p(θ, z, w | α, β) = P Y d p(θd | α) Y n p(zd,n | θd )p(wd,n | β, zd,n ) (7) Γ( i αi ) αk −1 • p(θd | α) = Q Γ(α (Dirichlet) k θd,k i) i • p(zd,n | θd ) = θd,zd,n (Draw from Multinomial) Q • p(wd,n | β, zd,n ) = βzd,n ,wd,n (Draw from Multinomial) Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 14 of 29 Deriving Variational Inference for LDA Joint distribution: p(θ, z, w | α, β) = Y d p(θd | α) Y n p(zd,n | θd )p(wd,n | β, zd,n ) (7) Variational distribution: q(θ, z) = q(θ | γ)q(z | φ) Machine Learning: Jordan Boyd-Graber | Boulder (8) Variational Inference | 14 of 29 Deriving Variational Inference for LDA Joint distribution: p(θ, z, w | α, β) = Y d p(θd | α) Y n p(zd,n | θd )p(wd,n | β, zd,n ) (7) Variational distribution: q(θ, z) = q(θ | γ)q(z | φ) (8) ELBO: L(γ, φ; α, β) =Eq [log p(θ | α)] + Eq [log p(z | θ)] + Eq [log p(w | z, β)] − Eq [log q(θ)] − Eq [log q(z)] Machine Learning: Jordan Boyd-Graber | Boulder (9) Variational Inference | 14 of 29 What is the variational distribution? ~ ~z ) = q(θ, Y d q(θd | γd ) Y n q(zd,n | φd,n ) (10) • Variational document distribution over topics γd ◦ Vector of length K for each document ◦ Non-negative ◦ Doesn’t sum to 1.0 • Variational token distribution over topic assignments φd,n ◦ Vector of length K for every token ◦ Non-negative, sums to 1.0 Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 15 of 29 Expectation of log Dirichlet • Most expectations are straightforward to compute • Dirichlet is harder Edir X [log p(θi | α)] = Ψ (αi ) − Ψ αj (11) j Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 16 of 29 Expectation 1 " Eq [log p(θ | α)] =Eq log ( )# P Γ( i αi ) Y αi −1 Q θi i Γ(αi ) (12) i (13) Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 17 of 29 Expectation 1 )# P Γ( i αi ) Y αi −1 Eq [log p(θ | α)] =Eq log Q θi i Γ(αi ) i " P # X Γ( i αi ) αi −1 =Eq log Q + log θi i Γ(αi ) " ( (12) i (13) Log of products becomes sum of logs. Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 17 of 29 Expectation 1 )# P Γ( i αi ) Y αi −1 (12) Eq [log p(θ | α)] =Eq log Q θi i Γ(αi ) i " P # X Γ( i αi ) =Eq log Q + log θiαi −1 i Γ(αi ) i " # X X X = log Γ( αi ) − log Γ(αi ) + Eq (αi − 1) log θi " ( i i i (13) Log of exponent becomes product, expectation of constant is constant Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 17 of 29 Expectation 1 " )# P Γ( i αi ) Y αi −1 Eq [log p(θ | α)] =Eq log Q (12) θi i Γ(αi ) i # " P X Γ( i αi ) αi −1 =Eq log Q + log θi i Γ(αi ) i " # X X X = log Γ( αi ) − log Γ(αi ) + Eq (αi − 1) log θi = log Γ( ( i i X X i αi ) − i log Γ(αi ) i + X i X (αi − 1) Ψ (γi ) − Ψ γj j Expectation of log Dirichlet Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 17 of 29 Expectation 2 # " Eq [log p(z | θ)] =Eq log YY n 1[z ==i] θi n (13) i (14) Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 18 of 29 Expectation 2 " # Eq [log p(z | θ)] =Eq log YY n 1[z ==i] θi n " =Eq (13) i # XX n 1[z ==i] log θi n (14) i (15) Products to sums Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 18 of 29 Expectation 2 " # Eq [log p(z | θ)] =Eq log YY n 1[z ==i] θi n " =Eq # XX n = XX n (13) i 1[zn ==i] log θi (14) i h i 1[z ==i] Eq log θi n (15) i (16) Linearity of expectation Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 18 of 29 Expectation 2 # " Eq [log p(z | θ)] =Eq log YY n 1[z ==i] θi n # " =Eq XX n = XX n = 1[zn ==i] log θi (14) i h i 1[z ==i] Eq log θi n (15) φni Eq [log θi ] (16) i XX n (13) i i (17) Independence of variational distribution, exponents become products Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 18 of 29 Expectation 2 " # Eq [log p(z | θ)] =Eq log YY n 1[z ==i] θi n " =Eq # XX n = XX n = 1[z ==i] log θi n h i 1[z ==i] Eq log θi n (15) φni Eq [log θi ] (16) i XX n (14) i i XX n = (13) i i φni Ψ (γi ) − Ψ X γj (17) j Expectation of log Dirichlet Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 18 of 29 Expectation 3 Eq [log p(w | z, β)] =Eq log βzd,n ,wd,n Machine Learning: Jordan Boyd-Graber | Boulder (18) (19) Variational Inference | 19 of 29 Expectation 3 Eq [log p(w | z, β)] =Eq log βzd,n ,wd,n " # V Y K Y 1[v =wd,n ,zd,n =i ] =Eq log βi,v v (18) (19) i (20) Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 19 of 29 Expectation 3 Eq [log p(w | z, β)] =Eq log βzd,n ,wd,n " # V Y K Y 1[v =wd,n ,zd,n =i ] =Eq log βi,v v = V X K X v (18) (19) i Eq [1 [v = wd,n , zd,n = i]] log βi,v (20) i (21) Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 19 of 29 Expectation 3 Eq [log p(w | z, β)] =Eq log βzd,n ,wd,n " # V Y K Y 1[v =wd,n ,zd,n =i ] =Eq log βi,v v = V X K X (18) (19) i Eq [1 [v = wd,n , zd,n = i]] log βi,v (20) v φn,i wd,n log βi,v (21) v = i V K XX v Machine Learning: Jordan Boyd-Graber | Boulder i Variational Inference | 19 of 29 Entropies Entropy of Dirichlet Hq [γ] = − log Γ X γj + j X log Γ(γi ) i − Machine Learning: Jordan Boyd-Graber | X Boulder i k X (γi − 1) Ψ (γi ) − Ψ γj j=1 Variational Inference | 20 of 29 Entropies Entropy of Dirichlet Hq [γ] = − log Γ X γj + j X log Γ(γi ) i − X i k X (γi − 1) Ψ (γi ) − Ψ γj j=1 Entropy of Multinomial Hq [φd,n ] = − Machine Learning: Jordan Boyd-Graber | Boulder X φd,n,i log φd,n,i (22) i Variational Inference | 20 of 29 Complete objective function Note the entropy terms at the end (negative sign) Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 21 of 29 Deriving the algorithm • Compute partial wrt to variable of interest • Set equal to zero • Solve for variable Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 22 of 29 Update for φ Derivative of ELBO: X ∂L = Ψ (γi ) − Ψ γj + log βi,v − log φni − 1 + λ ∂φni (23) j Solution: φni ∝ βiv exp Ψ (γi ) − Ψ Machine Learning: Jordan Boyd-Graber | Boulder X γj (24) j Variational Inference | 23 of 29 Update for γ Derivative of ELBO: ∂L =Ψ0 (γi ) (αi + φn,i − γi ) ∂γi ! X X X αj + φnj − γj − Ψ0 γj j Machine Learning: Jordan Boyd-Graber | Boulder j n Variational Inference | 24 of 29 Update for γ Derivative of ELBO: ∂L =Ψ0 (γi ) (αi + φn,i − γi ) ∂γi ! X X X αj + φnj − γj − Ψ0 γj j Machine Learning: Jordan Boyd-Graber | Boulder j n Variational Inference | 24 of 29 Update for γ Derivative of ELBO: ∂L =Ψ0 (γi ) (αi + φn,i − γi ) ∂γi ! X X X 0 αj + φnj − γj −Ψ γj j n j Solution: γi = αi + X φni (25) n Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 24 of 29 Update for β Slightly more complicated (requires Lagrange parameter), but solution is obvious: XX j βij ∝ φdni wdn (26) d Machine Learning: Jordan Boyd-Graber | Boulder n Variational Inference | 25 of 29 Overall Algorithm 1. Randomly initialize variational parameters (can’t be uniform) 2. For each iteration: 2.1 For each document, update γ and φ 2.2 For corpus, update β 2.3 Compute L for diagnostics 3. Return expectation of variational parameters for solution to latent variables Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 26 of 29 Relationship with Gibbs Sampling • Gibbs sampling: sample from the conditional distribution of all other variables • Variational inference: each factor is set to the exponentiated log of the conditional • Variational is easier to parallelize, Gibbs faster per step • Gibbs typically easier to implement Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 27 of 29 Implementation Tips • Match derivation exactly at first • Randomize initialization, but specify seed • Use simple languages first Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 28 of 29 Implementation Tips • Match derivation exactly at first • Randomize initialization, but specify seed • Use simple languages first . . . then match implementation Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 28 of 29 Implementation Tips • Match derivation exactly at first • Randomize initialization, but specify seed • Use simple languages first . . . then match implementation • Try to match variables with paper • Write unit tests for each atomic update • Monitor variational bound (with asserts) Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 28 of 29 Implementation Tips • Match derivation exactly at first • Randomize initialization, but specify seed • Use simple languages first . . . then match implementation • Try to match variables with paper • Write unit tests for each atomic update • Monitor variational bound (with asserts) • Write the state (checkpointing and debugging) • Visualize variational parameters Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 28 of 29 Implementation Tips • Match derivation exactly at first • Randomize initialization, but specify seed • Use simple languages first . . . then match implementation • Try to match variables with paper • Write unit tests for each atomic update • Monitor variational bound (with asserts) • Write the state (checkpointing and debugging) • Visualize variational parameters • Cache / memoize gamma / digamma functions Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 28 of 29 Next class • Example on toy LDA problem • Current research in variational inference Machine Learning: Jordan Boyd-Graber | Boulder Variational Inference | 29 of 29