SBGLM: Sparse Bayes Generalized Linear Models

Hey! I’ve just released a package on Github called SBGLM, which stands for Sparse Bayes Generalized Linear Models. The purpose of this package is mainly a way to store a lot of code that I have rotting in my laptop, which I think is relevant and useful. The models will follow the Bayesian tradition of the “spike-and-slab” prior for sparsity (Mitchell and Beauchamp 1988), so do not expect to see the so called “Bayesian Lasso” here because it doesn’t work. The idea is that, as time permits, I will be adding more models to the package.

The only model that is currently supported in this first version is the one I analyzed for my 3rd report of my PhD qualifying course (see my blog post on this). It is the Non-parametric Sparse Factor Analysis (NSFA) model, by Knowles and Ghahramani (2011). You can download the report here. At the risk of appearing arrogant, I would argue that my report does a better job in describing the essential aspects of the NSFA model, while at the same time being more thorough in the derivation of Gibbs updates.

In brief, the NSFA is a non-parametric Bayes model for performing Factor Analysis (FA) $$ Y = GX + E $$

where $Y$ is the original data matrix, containing $D$ rows of observations of $N$ variables. Traditionally, Factor Analysis aims to express $Y$ as a linear combination of latent factors. In this case, $X$ is a matrix of size $K \times N$ with each row being a latent factor, and $G$ is the weight or loading matrix. The matrix $E$ captures the reconstruction error. The innovation in NSFA corresponds to allowing for an unbounded number of latent factors to be discovered, by using the Indian Buffet Process (IBP) in the prior for $G$. More specifically, the probabilistic definition of the model is the following: $$ \begin{aligned} \alpha &\sim \text{Gamma}(e, f) \\ Z|\alpha &\sim \text{IBP}(\alpha) \\ d &\sim \text{Gamma}(c_0, d_0) \\ \lambda_{k}|d &\overset{iid}{\sim} \text{Gamma}(c,d) \\ g_{dk}|Z_{dk}, \lambda_k &\overset{ind}{\sim} z_{dk} \mathcal{N}(g_{dk}; 0, \lambda_k^{-1}) + (1-z_{dk})\delta_0(g_{dk}) \\ x_{kn} &\overset{iid}{\sim} \mathcal{N}(0, 1) \\ b &\sim \text{Gamma}(a_0, b_0) \\ \psi_d|b &\overset{iid}{\sim} \text{Inv-Gamma}(a,b) \\ y_n|G, x_n, \Psi &\overset{ind}{\sim} \mathcal{N}(Gx_n, \Psi) \end{aligned} $$

By letting the model define the proper $K$, the user needn’t worry about tuning this otherwise key parameter. Moreover, by using a set of hierarchical priors, the default values for the hyperparameters tend to work well in a variety of situations.

In spite of the above advantage, I found in experiments using the MNIST dataset (LeCun, Cortes, and Burges 1999) that the model tends to overfit the training data. In fact, although the reconstruction becomes almost perfect in the training set (the obligatory pretty pictures can be found in the report), the test set log-likelihood only decreases as iterations go by. I offer some solutions but they’re not pretty and require basically starting from scratch, which includes throwing away the IBP prior.

Nevertheless, the implementation in SBGLM works pretty well. It is written completely in base R, so as to minimize dependency issues in the future. However, the way it handles the variability of $K$ is very naive: instead of pre-allocating sufficiently large matrices and vectors, it dinamically grows and shrinks them at each iteration of sampler. This is of course very inefficient, but was cleaner and somehow works for $Y$ sizes on the order of $400 \times 784$, which is the subsampled matrix I used from the full MNIST training dataset.

In the future I will aim to fix the above issue, and also add more models to SBGLM. In particular, I have a sparse linear regression model that is almost ready for inclusion, but needs a little more polishing. After that, I would like to write an implementation of the sparse logistic regression.

So, if you’re interested, give the package a try and please raise any issues you might encounter.

Happy holidays!

References

Knowles, David, and Zoubin Ghahramani. 2011. “Nonparametric Bayesian Sparse Factor Models with Application to Gene Expression Modeling.” The Annals of Applied Statistics. JSTOR, 1534–52.

LeCun, Yann, Corinna Cortes, and CJ Burges. 1999. “The Mnist Dataset of Handwritten Digits (Images).” http://yann.lecun.com/exdb/mnist.

Mitchell, Toby J, and John J Beauchamp. 1988. “Bayesian Variable Selection in Linear Regression.” Journal of the American Statistical Association 83 (404). Taylor & Francis Group: 1023–32.

comments powered by Disqus