LDA is a classical algorithm for probabilistic graphical models. It assumes hierarchical Bayes models with discrete variables on sparse doc/word graphs. This example shows how it can be done on DGL, where the corpus is represented as a bipartite multi-graph G. There is no back-propagation, because gradient descent is typically considered inefficient on probability simplex. On the provided small-scale example on 20 news groups dataset, our DGL-LDA model runs 50% faster on GPU than sklearn model without joblib parallel. For larger graphs, thanks to subgraph sampling and low-memory implementation, we may fit 100 million unique words with 256 topic dimensions on a large multi-gpu machine. (The runtime memory is often less than 2x of parameter storage.)
Let k be the topic index variable with one-hot encoded vector representation z. The rest of the variables are:
z_d~p(θ_d) | w_k~p(β_k) | z_dw~q(ϕ_dw) | |
---|---|---|---|
Prior | Dir(α) | Dir(η) | (n/a) |
Posterior | Dir(γ_d) | Dir(λ_k) | (n/a) |
We overload w with bold-symbol-w, which represents the entire observed document-world multi-graph. The difference is better shown in the original paper.
Multinomial PCA
Multinomial PCA is a "latent allocation" model without the "Dirichlet". Its data likelihood sums over the latent topic-index variable k, , where θ_d and β_k are shared within the same document and topic, respectively.
If we perform gradient descent, we may need additional steps to project the parameters to the probability simplices: and . Instead, a more efficient solution is to borrow ideas from evidence lower-bound (ELBO) decomposition:
The solutions for and follow from the maximization of cross-entropy loss. The solution for follows from Kullback-Leibler divergence. After normalizing to , the difference becomes constant in k, which is connected to the likelihood for the observed document-word pairs.
Note that after learning, the document vector θ_d considers the correlation between all words in d and similarly the topic distribution vector β_k considers the correlations in all observed documents.
Variational Bayes
A Bayesian model adds Dirichlet priors to θ_d and β_z, which leads to a similar ELBO if we assume independence , i.e.:
Solutions
The solutions to VB subsumes the solutions to multinomial PCA when n goes to infinity. The solution for ϕ is , where the additional expectation can be expressed via digamma functions and is the log-partition function. The solutions for and come from direct gradient calculation. After substituting the optimal solutions, we compute the marginal likelihood by adding the three terms, which are all connected to (the negative of) Kullback-Leibler divergence.
The corpus is represented as a bipartite multi-graph G. We use DGL to propagate information through the edges and aggregate the distributions at doc/word nodes. For scalability, the phi variables are transient and updated during message passing. The gamma / lambda variables are updated after the nodes receive all edge messages. Following the conventions in [1], the gamma update is called E-step and the lambda update is called M-step. The lambda variable is further recorded by the trainer. A separate function is used to produce perplexity, which is based on the ELBO objective function divided by the total numbers of word/doc occurrences.
%run example_20newsgroups.py
- Approximately matches scikit-learn training perplexity after 10 rounds of training.
- Exactly matches scikit-learn training perplexity if word_z is set to lda.components_.T
- There is a difference in how we compute testing perplexity. We weigh the beta contributions by the training word counts, whereas sklearn weighs them by test word counts.
- The DGL-LDA model runs 50% faster on GPU devices compared with sklearn without joblib parallel.
- Set
0<rho<1
for online learning with partial_fit. - Set
mult["doc"]=100
ormult["word"]=100
or some large value to disable the corresponding Bayesian priors.
- Matthew Hoffman, Francis Bach, David Blei. Online Learning for Latent Dirichlet Allocation. Advances in Neural Information Processing Systems 23 (NIPS 2010).
- Reactive LDA Library blogpost by Yingjie Miao for a similar Gibbs model