When fitting discrete response variables, probit regresssion seems to have some advantages over logistic regression in that you can very easily use Gibbs sampling to fit your unknown parameters.
In probit regression your likelihood model is:
P(Y=1∣X,β)=Φ(X′β)where Φ is the cdf of the standard normal distribution.
Probit regression with automatic smoothness determination
You’d like to make a Bayesian estimate of β, with a zero-mean gaussian prior on β of the form P(β)=N(0,C).
In automatic smoothness determination (ASD), the prior on your weights β is set by two hyperparameters ρ,δ determining the shrinkage and smoothness of the weights. In particular, the covariance between any two weights is:
Ci,j=exp(−ρ−Δi,j2δ2).
where Δi,j is the squared distance between the weights βi and βj in some space.
Here’s the full model:
y∗i∣xi,β∼N(x′iβ,1)Gibbs sampling
Now, to do Gibbs sampling we need to know how to sample each unknown conditional upon knowing every other unknown. Here’s the easy part:
Σ=(C−1+X′X)−1where [y∗i<0] and [y∗i≥0] are indicator variables specifying where to truncate the respective normal distributions.
But there’s two parts missing! We need to know how to sample ρ and δ. And this is where things gets tricky. So this might not necessarily work in practice…
We know that the distribution of each of ρ and δ depends only on the weights, β, and the other hyperparameter. This distribution resembles a Gamma with unknown shape and scale parameters, k and θ[^1]. (Well actually, for ρ, you need to exponentiate it first to have it look like a Gamma.)
exp(ρ)∣β,δ∼Gamma(kρ,θρ)We can fit (k,θ) for one of these distributions using what we know about a Gamma’s mode and mean. Supposing X∼Gamma(k,θ), and letting A be its mode, and B its mean, then we know:
A=θ(k−1)=θk−θSolving both of these equations for k and setting them equal to one another gives us:
A+θθ=BθAnd so:
θ=B−ASo let’s say we’re calculating kρ,δρ. Then we can find the mode by minimizing the negative log probability of the weights β coming from a multivariate normal with mean 0 and covariance given by the ASD prior, with respect to varying the ρ. This mode is A above. The mean, B, can similarly be found by evaluating the probability at a range of valid ρ (around A) and simply calculating the resulting expectation.
In Matlab this might look like:
% given: weights (wts), smoothness hyperparameter (delta),
% and squared distance matrix (Delta)
% estimating the shape and scale parameters of a Gamma distribution for exp(rho)
P = @(rho) mvnpdf(wts, zeros(numel(wts),1), asd.prior(rho, Delta, delta));
% calculate mode of gamma
obj = @(x) -log(P(x)); rho0 = 0; lb = -20; ub = 20;
A = fmincon(obj, rho0, [],[],[],[], lb, ub);
% calculate mean of gamma
rhos = linspace(lb, ub, 1e3);
ps = arrayfun(P, rhos);
ps = ps ./ sum(ps); % normalize
B = (exp(rhos)*ps')/sum(ps); % exp because exp(rho) is gamma
% draw a sample for rho
theta_rho = B - A;
k_rho = B / (B - A);
rho = log(gamrnd(k_rho, theta_rho));
Now whether this sampling method will actually converge or not…that’s another matter.