Bayesian Model Selection for Logistic Regression via Approximate Bayesian Inference

\(\newcommand{\Bf}[1]{\mathbf{#1}}\) \(\newcommand{\Bs}[1]{\boldsymbol{#1}}\)

Supervised by A/Prof John Ormerod ~ Dec 2023

This is a short summary and introduction to the work. For the full project, see here.

For the complementary R package, see here.

Table of Contents

  1. Background
  2. Questions
  3. Literature Review
  4. Our Extension

Background

Our main goal with this project is to study the problem of modelling the relationship between some numerical predictors $(\mathbf{X})$ and a binary response variable $(\mathbf y)$.

For motivation, lets’ say that we are a car company, and we are testing the temperature at which a component will fail. Suppose we have collected some data, and have plotted it below in Figure 1

fig1
Figure 1: Relationship between working temperature and failure

One method of performing binary classification is called Logistic Regression. The idea is to fit a curve to the data

\[f(x) = \frac{1}{1+e^{-(\beta_0 + \beta_1 x)}}\]

by calculating coefficients $\beta_0$ and $\beta_1$. Here, we used $(\beta_0,\beta_1) = (15.0429, -0.2321627).$ Where the curve is above one half (the red region) we would predict that the component will fail, and where the curve is below one half (the green region) we predict that the component will be alright.

fig2
Figure 2: Areas where we predict failure and no failure

The example we gave has one dimension (temperature) but we can generalise this concept to many dimensions. For a data matrix $\Bf X_{n\times p+1}$, we can fit instead fit a coefficient vector $\Bs{\beta} = (\beta_0,\dotsb, \beta_p)$. Then we obtain an estimator

\[f(\mathbf{X}) = \frac{1}{1+e^{-\Bf X \Bs \beta}}\,.\]

For example, consider the Pima Indians Diabetes dataset, which measures several predictors of diabetes, $(\mathbf X)$ and whether the test subject does have diabetes as the response $(\mathbf y)$.

tab1
Table 1: Pima Indians Diabetes dataset

However, in the case of many dimensions, we should be careful, because it is likely that not all of the predictors are relevant. Hence, we need to select for the most relevant features. To do this, let $\mathbf{w} \in \{0,1\}^{p+1}$ be a vector and $\Bf W = \mathrm{diag}(\Bf w)$. We call the matrix $\Bf W$ a binary mask, and consider instead

\[f(\mathbf{X}) = \frac{1}{1+e^{-\Bf X \Bf W \Bs \beta}}\,.\]

Now, if the $i$-th entry $w_i$ of the binary mask $\mathbf w$ is $0$, then the $i$-th predictor $\beta_i$ in the regression coefficient vector $\Bs \beta$ will be multiplied by $0$, and have no effect on $f(\mathbf X)$. This way, the feature is dropped from the estimator.

\[\Bf X \Bf W \Bs \beta = \underbrace{\begin{bmatrix} 1 & \mathrm{preg}_0 & \cdots & \mathrm{age}_0 \\ \vdots & \vdots & \ddots & \vdots \\ 1 & \mathrm{preg}_p& \cdots & \mathrm{age}_p \\ \end{bmatrix} }_{X} \underbrace{\begin{bmatrix} w_0 & 0 & 0 \\ 0 & \ddots & 0 \\ 0 & 0 & w_p \\ \end{bmatrix} }_{W \text{ where } w_i \in \{0,1\}} \underbrace{\begin{bmatrix} \beta_0 \\ \vdots \\ \beta_p \\ \end{bmatrix} }_{\Bs \beta}\,.\]

The process of selecting the relevant predictors for the model is also called model selection.

Fitting the Model

In order to perform inference with our estimator, we need two sets of parameters: some regression coefficients $\Bs \beta$, and a binary mask $\mathbf w$ for selecting our features. We are particularly interested with computing these parameters via Bayesian inference.

The main idea is to use Bayes’ Rule.

Definition (Bayes' Rule)

Suppose we want to estimate a parameter $\Bs \theta$, having observed data $\Bs x$. If we assume a \textit{prior} distribution $p(\Bs \theta)$, then we have $$ p(\Bs \theta \mid \Bf x) =\frac{p( \Bf x \mid \Bs \theta ) p(\Bs \theta)}{ \int p(\Bf x \mid \Bs \theta ) p(\Bs \theta) d\Bs \theta }\,. $$ We call $p(\Bs\theta \mid \mathbf x)$ the posterior density.

The idea is that, if we have some prior belief or knowledge about the location of our parameter $\Bs \theta$ within our parameter space, and then we observe some data, then we can use Bayes’ Rule to update our best guess of where $\Bs \theta$ lies.

picture of Bayes rule
Figure 3: Effect of oberved data on the posterior

As shown in the figure, we might take the mode of the posterior density as our estimate for the parameter $\Bs \theta$. This esimtate is called the Maximium Posterior Estimate.

However, there is a big challenge. It is often extremely difficult to compute the integrals involved. For these reason, real world implementations of Bayesian inference revolves around methods of approximating the posterior. One such method is called variational Bayes.

For an arbitrary density $q(\theta)$, define the Kullback-Leiber ($\mathrm{KL}$) divergence to be

\[\mathrm{KL}(q(\Bs \theta),p(\Bs \theta \mid \Bf x)) = \int q(\Bs \theta) \log\left(\frac{q(\Bs \theta)}{p(\Bs \theta \mid \Bf x)}\right) d\Bs \theta \,.\]

The idea of variational Bayes is to transform the integration problem into an optimisation problem, and approximate the posterior by finding densities $q(\Bs \theta)$ that minimise the $\mathrm{KL}$ divergence instead.

In particular, if we let $(\Bs \theta_1,\dotsb,\Bs \theta_M)$ be a partition of our parameter $\Bs \theta$, and assume that each partition is independent (called the mean-field approximation), we would have

\[q(\Bs \theta) = \prod_{i=1}^M q_i(\Bs \theta_i)\,,\]

and then the optimal $q^*(\theta)$ for minimising $\mathrm{KL}$ would therefore be

\[\begin{equation} \label{eq:vb} q_i^*(\theta_i) \propto \exp\mathbb E_{-\theta_i}\log(p(\Bf x, \Bs \theta))\,. \end{equation}\]

First approach: Use Variational Bayes

Now, we can define our problem. We want to create a predictive model for a logistic regression experiment with:

From the parameter space, we will partition $(\Bs \beta,\Gamma)$ into $(\Bs \beta, \gamma_0,\dotsb,\gamma_p)$. Using \ref{eq:vb}, we derive the following variational Bayes estimates for the parameter densities

\[\begin{aligned} q(\Bs \beta) &\propto \exp\left[ \mathcal{L}(\Bs \beta, \Bf X \Bf W , \Bf y) \right] \\ q(\gamma_j) &= \frac{\exp(\tau(\gamma_j))}{\exp(\tau(0))+\exp(\tau(1))} \end{aligned}\]

where

\[\tau(\gamma_j) := (\Bf X^T \Bf y)_j \mu_j\gamma_j -1^T \log(1+ \exp(\Bf X_j \gamma_j \mu_j + \Bf X_{-j} \Bf W_{-j} \Bs\mu_{-j}) + \gamma_j \log{\left(\frac{\rho}{1-\rho}\right)}\]

and

\[\mathcal{L}(\Bs \beta, \Bf X \Bf W , \Bf y) := \Bf y^T \Bf X \Bf W \Bs \beta - \Bf 1^T \log(\Bf 1 +\exp(\Bf X\Bf W \Bs \beta)) -\frac{1}{2\sigma^2}\| \Bs \beta \|^2\]

However, when we tried to perform parameter estimation on the pima dataset with this system, we found that the model selection coefficients could not converge.

model coefficients cant converge
Figure 4: Model coefficients do not converge

Second Approach: Use Reverse Collapsed Variational Bayes

In response, we alter our approach slightly. We perform variational Bayes estimates at both $\gamma_j =0$ and $\gamma_j = 1$ and then combine the data by integrating out $\Bs \beta$.

\[q(\gamma_j = k) \propto \int_{\mathbb{R}^p} \exp{ \left[ \mathcal{L}(\Bs \beta, \Bf Z , \Bf y) + k\log{\frac{\rho}{1-\rho}} \right]} d\Bs \beta \,.\]

Again, this integral is difficult, so we approximate using the first order Laplace approximation $\Bs \beta \sim N( \widetilde{\Bs \beta} , \widetilde{\Bs \Sigma})$.

We get

\[q(\gamma_j = k) = \frac{|\Bs \Sigma_k|^{\frac12} \exp{\nu{(1)}} }{|\Bs\Sigma_1|^{\frac12}\exp{\nu{(1)}} + |\Bs\Sigma_0|^{\frac12}\exp{\nu{(0)}}}\]

where $\Bs \Sigma_k$ is the covariance matrix of the $\Bs \beta$ Normal approximation at $\gamma_j = k$ and

\[\begin{aligned} \nu(k) := \mathcal{L}(\Bs \beta, \Bf Z , \Bf y) - \frac12 \log{\det(-\Bf \Sigma_k)} + k\log\left(\frac{\rho}{ 1-\rho}\right) \,. \end{aligned}\]

This new method was indeed able to converge, and quickly settle on the significant predictors.

model coefficients now converge
Figure 5: Model coefficients converge under RCVB scheme

Now, we can perform simultaneous model fitting and selection. For analysis of our results, see the full report here.