Bayesian Nonparametrics and DPMM Machine Learning: Jordan Boyd-Graber University of Colorado Boulder
by user
Comments
Transcript
Bayesian Nonparametrics and DPMM Machine Learning: Jordan Boyd-Graber University of Colorado Boulder
Bayesian Nonparametrics and DPMM Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 17 Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 1 of 17 Clustering as Probabilistic Inference • GMM is a probabilistic model (unlike K -means) • There are several latent variables: ◦ Means ◦ Assignments ◦ (Variances) Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 2 of 17 Clustering as Probabilistic Inference • GMM is a probabilistic model (unlike K -means) • There are several latent variables: ◦ Means ◦ Assignments ◦ (Variances) • Before, we were doing EM Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 2 of 17 Clustering as Probabilistic Inference • GMM is a probabilistic model (unlike K -means) • There are several latent variables: ◦ Means ◦ Assignments ◦ (Variances) • Before, we were doing EM • Today, new models and new methods Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 2 of 17 Nonparametric Clustering • What if the number of clusters is not fixed? • Nonparametric: can grow if data need it • Probabilistic distribution over number of clusters Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 3 of 17 Dirichlet Process • Distribution over distributions • Parameterized by: α, G Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 4 of 17 Dirichlet Process • Distribution over distributions • Parameterized by: α, G • Concentration parameter Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 4 of 17 Dirichlet Process • Distribution over distributions • Parameterized by: α, G • Concentration parameter • Base distribution Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 4 of 17 Dirichlet Process • Distribution over distributions • Parameterized by: α, G • Concentration parameter • Base distribution • You can then draw observations from x ∼DP(α, G ). Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 4 of 17 Defining a DP • Break off sticks Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 5 of 17 Defining a DP • Break off sticks • Draw atoms Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 5 of 17 Defining a DP • Break off sticks • Draw atoms • Merge into complete distribution Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 5 of 17 Properties of a DPMM • Expected value is the same as base distribution EDP(α,G ) [x] = EG [x] (1) • As α → ∞, DP(α, G ) = G • Number of components unbounded • Impossible to represent fully on computer (truncation) • You can nest DPs Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 6 of 17 Effect of scaling parameter α Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 7 of 17 DP as mixture Model Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 8 of 17 The Chinese Restaurant as a Distribution To generate an observation, you first sit down at a table. You sit down at a table proportional to the number of people sitting at the table. Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 9 of 17 The Chinese Restaurant as a Distribution To generate an observation, you first sit down at a table. You sit down at a table proportional to the number of people sitting at the table. 2 7 Machine Learning: Jordan Boyd-Graber | 3 7 Boulder 2 7 Bayesian Nonparametrics and DPMM | 9 of 17 The Chinese Restaurant as a Distribution To generate an observation, you first sit down at a table. You sit down at a table proportional to the number of people sitting at the table. 2 7 Machine Learning: Jordan Boyd-Graber | 3 7 Boulder 2 7 Bayesian Nonparametrics and DPMM | 9 of 17 The Chinese Restaurant as a Distribution To generate an observation, you first sit down at a table. You sit down at a table proportional to the number of people sitting at the table. 2 7 3 7 2 7 x ∼ µ1 x ∼ µ2 x ∼ µ3 Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 9 of 17 The Chinese Restaurant as a Distribution To generate an observation, you first sit down at a table. You sit down at a table proportional to the number of people sitting at the table. 2 7 3 7 2 7 x ∼ µ1 x ∼ µ2 x ∼ µ3 But this is just Maximum Likelihood Why are we talking about Chinese Restaurants? Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 9 of 17 Always can squeeze in one more table . . . • The posterior of a DP is CRP • A new observation has a new table / cluster with probability proportional to α • But this must be balanced against the probability of an observation given a cluster Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 10 of 17 Gibbs Sampling • We want to know the cluster assignment of each observation • Take a random guess initially Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 11 of 17 Gibbs Sampling • We want to know the cluster assignment of each observation • Take a random guess initially • This provides a mean for each cluster Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 11 of 17 Gibbs Sampling • We want to know the cluster assignment of each observation • Take a random guess initially • This provides a mean for each cluster • Let the number of clusters grow Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 11 of 17 Gibbs Sampling • We want to know the cluster assignment of each observation (tables) • Take a random guess initially • This provides a mean for each cluster • Let the number of clusters grow Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 11 of 17 Gibbs Sampling • We want to know ~z • Compute p(zi | z1 . . . zi−1 , zi+1 , . . . zm , x, α, G ) • Update zi by sampling from that distribution • Keep going . . . Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 12 of 17 Gibbs Sampling • We want to know ~z • Compute p(zi | z1 . . . zi−1 , zi+1 , . . . zm , x, α, G ) • Update zi by sampling from that distribution • Keep going . . . Notation p(zi = k | z−i ) ≡ p(zi | z1 . . . zi−1 , zi+1 , . . . zm ) Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM (2) | 12 of 17 Gibbs Sampling for DPMM p(zi = k | ~z−i , ~x , {θk }, α) (3) (4) Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 13 of 17 Gibbs Sampling for DPMM p(zi = k | ~z−i , ~x , {θk }, α) (3) =p(zi = k | ~z−i , xi , ~x , θk , α) (4) (5) Dropping irrelevant terms Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 13 of 17 Gibbs Sampling for DPMM p(zi = k | ~z−i , ~x , {θk }, α) (3) =p(zi = k | ~z−i , xi , ~x , θk , α) (4) =p(zi = k | ~z−i , α)p(xi | θk , ~x ) (5) (6) Chain rule Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 13 of 17 Gibbs Sampling for DPMM p(zi = k | ~z−i , ~x , {θk }, α) (3) =p(zi = k | ~z−i , xi , ~x , θk , α) (4) =p(zi = k | ~z−i , α)p(xi | θk , ~x ) R ( nk x) n· +α θ p(xi | θ)p(θ | G , ~ = R α n· +α θ p(xi | θ)p(θ | G ) (5) existing (6) new (7) Applying CRP Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 13 of 17 Gibbs Sampling for DPMM p(zi = k | ~z−i , ~x , {θk }, α) (3) =p(zi = k | ~z−i , xi , ~x , θk , α) (4) =p(zi = k | ~z−i , α)p(xi | θk , ~x ) R ( nk x ) existing n· +α θ p(xi | θ)p(θ | G , ~ = R α new n· +α θ p(xi | θ)p(θ | G ) ( nk nx̄ existing n· +α N x, n+1 , 1 = α new n· +α N (x, 0, 1) (5) (6) (7) Scary integrals assuming G is normal distribution with mean zero and unit variance. (Derived in optional reading.) Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 13 of 17 Algorithm for Gibbs Sampling 1. Random initial assignment to clusters 2. For iteration i: 2.1 “Unassign” observation n 2.2 Choose new cluster for that observation Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 14 of 17 Toy Example Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 15 of 17 Toy Example Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 15 of 17 Toy Example Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 15 of 17 Toy Example Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 15 of 17 Toy Example Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 15 of 17 Toy Example Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 15 of 17 Toy Example New cluster created! Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 15 of 17 Toy Example Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 15 of 17 Toy Example Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 15 of 17 Toy Example Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 15 of 17 Toy Example Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 15 of 17 Toy Example Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 15 of 17 Toy Example Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 15 of 17 Toy Example Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 15 of 17 Toy Example And repeat . . . Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 15 of 17 Differences between EM and Gibbs • Gibbs often faster to implement • EM easier to diagnose convergence • EM can be parallelized • Gibbs is more widely applicable Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 16 of 17 In class and next week • Walking through DPMM clustering • Clustering discrete data with more than one cluster per observation Machine Learning: Jordan Boyd-Graber | Boulder Bayesian Nonparametrics and DPMM | 17 of 17