PhD Qualifying Course Reports: Term 1

December 9, 2018

It’s crazy how fast this first term went by! But I loved it. The Department of Statistics at UBC is insanely awesome. The people are the best, from students to faculty and staff. The academic environment is very stimulating, especially in my area of interest, with lots of active reading groups and frequent seminars related to Bayesian Statistics, Probabilistic Models and Statistical Learning. We even had the privilege of attending a talk by the expert in Hamiltonian Monte Carlo Michael Betancourt. We also have lots of interaction with people from Frank Wood’s group at the CS department. And finally, the UBC Vancouver campus is so beautiful! Definitely worth a visit if you happen to be in the city (which also deserves to be visited in itself!).

In this post I wanted to give an overview of the work I did for the Qualifying Course (STAT548) in this first term. The idea of this course is that the student works on 5 papers with 5 different faculty, between term 1 and 2. Every professor posts a list of the papers he or she is interested in and the students choose among them or can also propose one too.

For each paper, the student has to write a report that shows, first, a thorough description of the method, including complete proofs of results whose derivations are usually only partially shown in the original manuscripts. Secondly, one has to implement the method in R or other programming language. Thirdly, the student must evaluate the method on both simulated and real datasets. Finally, one is to give a critique of the method based on the findings of the previous sections, suggesting avenues of improvement.

Paper 1: Black box variational inference

Download report 1.

In my first report, I worked with Prof. Trevor Campbell, who had just arrived to UBC after finishing his post-doc with Prof. Tamara Broderick at MIT. Interesting fact: before arriving to UBC, I didn’t know that Trevor had joined the department, but I knew about his research because Prof. Broderick gave a talk in Chile at the Statistics Department at PUC in 2017 (Spanish). She gave an excellent introduction to the topic of nonparametric Bayes, focusing on the case of the Dirichlet process for mixture models. It was an amazing talk overall. After that, I googled her and found about Trevor’s research, which can be grossly oversimplified as focusing on scalable Bayesian inference with theoretical guarantees. I found this very interesting, so it was exciting to find out he had moved to UBC during the summer. In fact, I joined his reading group right away and had the honor of giving the first talk, which was also about this work.

Back to the paper. After a very motivating talk with Trevor, I decided to work on the paper “Black-box Variational Inference” (BBVI) (Ranganath, Gerrish, and Blei 2014). I chose it because I had used ADVI in Stan before, which I found to be very useful, and BBVI was a previous, slightly different attempt by David Blei’s group at tackling the problem of automatic variational inference (VI). The rest of the papers also sounded very interesting, but more mathematically intensive. Given that nor my undergrad nor my masters degree was in math, I have some gaps in Analysis and Measure Theory that I will be trying to fill during my stay here at UBC (for example, this term I attended the lectures of MATH 320 Real Variables I given by Joshua Zahl, which were amazing, and I plan to do the same with MATH 321).

I won’t go into the details of the report. The main findings can be summarized as:

It is very difficult to actually get BBVI to yield meaningful results, even after using all the improvements proposed by the authors, like Rao-Blackwellization and Control Variates for reducing the variance of the estimator of the gradient of the ELBO (I actually discovered these variance-reducing techniques because of this paper), and AdaGrad for adaptively setting the learning rate.
One practical reason for the above is that the algorithm doesn’t deal well with bounded variational parameters. Hence, you have to manually define hard-thresholds for the updates. This leads to a trade-off: on the one hand, the gradients explode at the theoretical boundary, so you want to be far from it. On the other hand, the optimal value may lie arbitrarily close to that boundary, so you don’t want to impose a threshold that is too far away. Striking the balance for all models and datasets is not trivial!
Even for unbounded parameters, the magnitudes of the gradients become very large when the current parameter lies too far away from the optimum. It is therefore useful to set reasonable bounds for all parameters, which faces the same trade-off described above.
Even though it doesn’t fully solve the above problems, AdaGrad becomes extremely useful, as it dramatically reduces the step size when facing exploding gradients. However, one therefore needs to also tune its parameter, which can greatly affect the speed at which BBVI converges.

I suggested some trivial improvements, which I later found out were already implemented in ADVI. However, it is important to note a crucial distinction between ADVI and BBVI: the former uses the so called “reparameterization gradient trick”, which I won’t describe here but has the implication that the posterior density needs to be differentiable with respect to the latent variables. Thus, no discrete latent variables allowed! In contrast, BBVI uses the “REINFORCE” trick, which only requires the ELBO to be differentiable with respect to the variational parameters, so discrete latent variables can be used. Hence, it would still be meaningful to ask for an implementation of BBVI in a probabilistic programming language like Stan, because it can accommodate a broader class of models than ADVI.

Now that I think about, it is possible that using natural gradients would yield an improvement. I would to check if they could be implemented in a “black box” way, such that the adjustment only depended on the variational family used.

Paper 2: The elastic net

Download report 2.

In my second paper I had to switch to the dark (i.e. frequentist) side of the stats. Although, I have to say I was very lucky to be able to work with Prof. Gabriela Cohen. She is one of the Canada Research Chair (CRC) Tier 2 that our department is proud to have, so her curriculum was already impressive. Like me, she also comes from the south, but she’s from our neighbor country Argentina. So one day, probably the first week of class, I showed up at her office and she invited me in and we talked for hours about how was for her coming here to work and live with her family. I though it was really nice of her to just lend me her time for a conversation like this, and thus I immediately thought about working with her for the second paper.

So I emailed her right after finishing my first paper and she was happy to advise me. We agreed that I was gonna work on the classic Elastic Net paper by Zou and Hastie (2005). I was thrilled, because I had used this model and the Lasso countless times while working at SBIF back in Chile, but never had the time to actually go over the theory.

The way of working with Gabriela was very different than with Trevor. She actually gave me a very specific list of questions I had to answer, as opposed to the more open-ended way in which the first report went. In any case, having a list of questions reduces anxiety in relation to what is expected from the one’s work, so I was fine with it.

In my opinion, the most fun I had with it was giving a proof of the fact that, in the $p>n$ situation (i.e., more covariates than observations), the Lasso can select at most $n$ covariates, while showing that the Elastic Net does not suffer from this issue. It was interesting because it forced me to learn about KKT conditions for subgradients (I have never taken a course on convex optimization), while at the same time refreshing my linear algebra. I also remember being greatly surprised when I found out that orthonormal design matrices had analytic solutions for the Elastic Net. So, even though this was not part of the questions, I gave a proof of the solution, which is in fact univariate soft-thresholding.

On the other hand, I’d say that the most difficult part of this report was finding a way to give a meaningful critique to a 13 years old, seminal work. Nevertheless, as a Bayesian, the most troubling aspect of this paper is that it does not give any advice on how to obtain uncertainty measures of the estimated parameters, so that the user is forced to use a bootstrap strategy. This can be prohibitive for large datasets, since for each bootstrap run, one has to perform cross-validation to find the optimal parameters. In contrast, the “spike-and-slab” prior, the gold standard of Bayesian variable selection which dates back to Mitchell and Beauchamp (1988), can yield exact post selection inference effortlessly.

In fact, after handing in the report, I went and implemented a spike-and-slab model for a linear regression. I used the more modern implementation, which uses an auxiliary binary vector $z$ with entries in $\{0, 1\}$ , so that the model becomes: \[ \begin{aligned} \sigma_b^2 &\sim \text{Inv-Gamma}(e, f) \\ \sigma_e^2 &\sim \text{Inv-Gamma}(g, h) \\ \beta|\sigma_b^2 &\sim \mathcal{N}(0_p, \sigma_b^2 I_p) \\ \pi_k &\overset{iid}{\sim} \text{Beta}(a,b) \\ z_k|\pi_k &\overset{ind}{\sim} \text{Bernoulli}(\pi_k) \\ y_i|z, \beta, \sigma_e &\overset{ind}{\sim} \mathcal{N}(x_i^T(\beta \odot z), \sigma_e^2) \end{aligned} \]

where $\odot$ is the Hadamard or element-wise product. Because of the conjugacies present, a Gibbs sampler is straightforward to derive. In this setup, $\tilde{\beta} \triangleq \beta \odot z$ becomes the object of post-selection inference. In terms of performance and number of variables selected, I obtained very similar results to the ones that Elastic Net produces, which was satisfactory. I may upload this implementation to a Github repository with an accompanying blog post when I have time, since I couldn’t find an R implementation for it. However, in the next paper you can see how Beta-Bernoulli processes, the limit for the Beta-Bernoulli conjugate variables used above, can be used to construct infinite sparse binary matrices.

Paper 3: Nonparametric Bayesian sparse factor models

Download report 3.

The fact that I chose not to be a TA this term, and also that I had only one course, let me focus 80% of my time to the Qualifying Course. Becasue of this, I was able to get 3 papers done in the first term, something that was surprising, according to the graduate advisor and staff.

Luckily, after finishing my second report, Prof. Alex Bouchard-Côté had just returned from a sabbatical. I have to say that Alex was a crucial factor in my decision to come to UBC, because he was one of the few Bayesians here, his curriculum is impressive, and his research on MCMC methods is very motivating. So I went to talk to him and offered to work on the paper by Knowles and Ghahramani (2011), which he accepted. I also took the liberty of asking him permission to join his reading group, and he was very welcoming about it. I have thoroughly enjoyed every meeting (in fact, I will be presenting next week about perfect sampling).

Back to paper. This work proposes the Nonparametric Sparse Factor Analysis (NSFA) model. Recall the standard construction of Factor Analysis: \[ Y = GX + E \]

where $Y$ is the data (a $D \times N$ matrix), $X$ is matrix of size $K \times N$ containing $K$ latent factors, $G$ is the loading matrix of the $D$ objects onto the $K$ factors. In short, the NSFA uses an Indian Buffet Process (IBP) in the prior of $G$ to allow for an unbounded number of latent factors, while at the same time imposing sparsity on the weights.

This report was definitely the one in which I spent more time. I had to quickly learn about the IBP (my only exposure to nonparametric Bayes had been the lecture by Prof. Broderick I mentioned above), as it was a key aspect of the paper. Thus, I spent a lof of time deriving its properties, some of which are missing from the papers in which it is presented. So if you are also looking for those derivations, take a look at the report! Although, there was one particular equation that I could never derive. I even made a question in stackexchange about it which still hasn’t been asnwered and that earned me the “tumbleweed badge”!.

After spending 10 pages solely on the IBP, I went on to describing the NSFA. It took me another 13 pages to derive all the Gibbs updates! It was actually very hard, because most of them were simply stated in the original paper. Also, it contained many typos, so I spent a lot of time simply making sure that my derivations were correct. The worst part was deriving the most complex update, which relates to growing the active set of latent factors. Here I found errors in the derivation of the acceptance ratio of a Metropolis-Hastings-within-Gibbs step used. Because I couldn’t believe that this published paper had an error on it, I freaked out, and basically derived everything from scratch many times, until I convinced myself that my result was the correct one.

From this, I concluded that there was a clear lesson for my future as an academic: the attitude towards every paper should be one of close scrutiny, of doubting everything, as cliche as it may sound. Review processes are less than ideal, even in the most respected journals. That is why like so much both Trevor’s and Alex’s reading groups, because we go over every detail (of at least the main results) of the papers we select. In the end, this critical approach is benefitial for everyone.

Besides the typos and errors in the derivations that I found, my main critique towards the paper was that it performed very poorly in terms of out-of-sample evaluation. I applied it to a subset of the MNIST dataset, basically because I had always wanted an excuse to analyze that famous dataset. The NSFA fitted the training set very well (check out the pictures in the report!), but the predictive log-likelihood in the test set only decreased as iterations went by. I gave reasons for this behavior, emphasizing the need to give more structure to the model.

Final remarks

Overall, I have to say that this term exceeded any expectations I had about my first year as a PhD student. I absolutely loved it, in spite of all the work it involved. I just hope that the following terms be as fun as this was for me. Happy holidays!

References

Knowles, David, and Zoubin Ghahramani. 2011. “Nonparametric Bayesian Sparse Factor Models with Application to Gene Expression Modeling.” The Annals of Applied Statistics. JSTOR, 1534–52.

Mitchell, Toby J, and John J Beauchamp. 1988. “Bayesian Variable Selection in Linear Regression.” Journal of the American Statistical Association 83 (404). Taylor & Francis Group: 1023–32.

Ranganath, Rajesh, Sean Gerrish, and David Blei. 2014. “Black Box Variational Inference.” In Artificial Intelligence and Statistics, 814–22.

Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2). Wiley Online Library: 301–20.