-
Notifications
You must be signed in to change notification settings - Fork 49
[Feature Request] Adding Optimization as Inference and extending Optimization to Compositional Inference #1361
Comments
👍 I think this would be 🔥 🔥 🔥 as well. There are some experimental APIs for VI which are much less mature than the references you've provided.
This is an interesting idea, and I think there's some room to nail down how stepping a Markov Chain vs stepping an optimizer for variational parameters should interact. One way off the top of my head could be for something like: CompositionalInference({
model.foo: bm.SingleSiteAncestralMetropolisHastings(),
model.bar: bm.SVI(),
}) to mean that each "iteration" iteration consists of one of two possibilities:
Is this close to what you had in mind? |
Yes, I think this is the most straightforward way to accomplish this, and best for a first iteration. Additionally, maybe we could directly pass in a CompositionalInference({
beta_1: bm.SingleSiteAncestralMetropolisHastings(),
beta_0: bm.SVI(guide=dist.Delta()), # treat this as a MAP estimate
}) This would be useful in the sense that I may not need uncertainty for Perhaps a more interesting case would be from the HMM tutorial. In this case, rather than assume class HiddenMarkovModel:
def __init__(
self,
N: int,
# K: int,
concentration: float,
mu_loc: float,
mu_scale: float,
sigma_shape: float,
sigma_rate: float,
) -> None:
...
@bm.random_variable
def lambda(self):
return dist.Exponential(1.)
@bm.functional
def K(self):
return dist.Poisson(self.lambda()) + 1. # Not sure if this would actually work, but some dist controlling K
@bm.random_variable
def Theta(self, k):
return dist.Dirichlet(torch.ones(self.K) * self.concentration / self.K)
... In which case I would love to be able to call: compositional = bm.CompositionalInference({
(model.K): bm.SVI(guide=None), # just maximize the likelihood of K, Type II MLE/ Empirical Bayes
(model.X): bm.SVI(guide=dist.Delta()), # MAP estimate
(model.Sigma, model.Mu): ...
}) And as you suggest, this would invoke a coordinate wise sampling/optimization mixture that would optimize first over my number of components K, and then MCMC updates for the descendants of K. This would even be a better solution to this TFP example! And maybe one additional API design thought, you can change the ratio of optimization/ sample steps just by increasing the learning rate. But maybe we could conversely say, take 10 MCMC samples for every one VI step. I have seen this here, although this was for using MCMC to estimate the ELBO itself I believe. |
I took some time to play with the ideas around MLE/MAP as a special case of VI with
The prototype doesn't include anything about MixedHMC yet, but I think @zaxtax and I will be having some upcoming discussions about this in the next week or two. |
I think your tutorial solidly covers the first two uses cases in my original post 🥳. Since BM natively covers Some thoughts:
@bm.model # maybe defaults a __call__() or predict() method that simply returns the "guts" of self.y(X)
class LinearRegressionFreq:
@bm.random_variable
def sigma_beta_1(self):
return Flat()
@bm.random_variable
def beta_1(self):
return Flat()
@bm.random_variable
def beta_0(self):
return Flat()
@bm.random_variable
def epsilon(self):
return Flat()
@bm.random_variable
def y(self, X):
# This behavior is automatically added to the "predict" method but returns torch.distributions instead of RVIdentifier
return dist.Normal(self.beta_1() * X + self.beta_0(), torch.nn.functional.softplus(self.epsilon())) |
Issue Description
TL; DR
Support for MLE, MAP, and Variational inference!
Context
In situations where scalability and speed need to be balanced with posterior sample quality, various optimization routines can provide useful approximations for fully Bayesian inference. A good summary of what I am referring to is taken from page 141 of this textbook:
In the simplest form, simply applying a torch.optim to the computational graph of a Bean Machine model should in theory reproduce
MLE
estimation for a given Bean Machine model. Also,MAP
estimation can be reproduced using a Delta Distribution as a variational distribution in SVI or by again directly optimizing thelog_prob
computational graph when priors are included. These types of relaxations could be good if you were interested in "productionizing" a static analysis (maybe initially done withHMC
) into a service where latency requirements rule out these more expensive Bayesian inference methods.Steps to Reproduce
N/A
Expected Behavior
Users have the ability to take one functional declaration of a model, i.e.:
And apply inference methods such as
SingleMaximumLikelihood()
,MaximumAPosteriori()
,MaxumumLikelihoodII()
, etc. And these added inference methods can be used withCompositionalInference
in the usual way. I believe that since this would be moving from a "samples as a model" to a "parameters as a model" paradigm, there would need to be some exposed properties returning the optimized parameters from these routines.So we could access something like:
System Info
N/A
Additional Context
It seems that comparable packages and this one offer these types of "relaxations" from a fully Bayesian inference approach. Offering this functionality and then also allowing users to mix and match these methods with the existing
CompositionalInference
framework would be 🔥🔥🔥The text was updated successfully, but these errors were encountered: