This one is about a great new package just released for R called Rstan. The package provides a simple inline interface to Stan which takes BUGS like code, translates it into C++, compiles and loads the dynamic library into R and runs your MCMC for you (phew!) (BTW: The guts are based on the inline, Rcpp and RcppEigen packages that are fantastic as well!) What’s more relevant for applied researchers like me is that the algorithms used are cutting edge and use modified HMC coupled with Automatic Differentiation to achieve rather quick mixing. The particular implementation is based on the Hoffman and Gelman No-U-Turn (NUTS) sampler.

To test things out I decided to implement a simple Hierarchical Linear model (or a Linear Mixed Model) with a single outcome Y and a single covariate X. To make things interesting I chose to adopt an unbalanced panel setup, so different “households” are observed for different lengths of time. The model is simply

$$ y_{it} = \beta_{0i} + \beta_{1i} x_{it} + \varepsilon_{it}.$$

I use **H** to represent the number of households and assume (as is typical) that the $\beta$ are generated from Normal distributions (independent for now, in a later blog I’ll discuss relaxing this assumption.). The R code to generate data is below…

# Set Seed set.seed(1234) # Number of Households H=100 # True parameters true.b.mu = c(10,-5) true.b.sig = c(3,2) true.y.sig = 3 # Storage b = matrix(0,100,2) ctr = 0 # Each "household" can have between 5 and 10 reps Ti = sample(5:10,H,replace=T) id = x = y = matrix(0,sum(Ti),1) # Simulate Data for(i in 1:100) { b[i,1] = rnorm(1,true.b.mu[1],true.b.sig[1]) b[i,2] = rnorm(1,true.b.mu[2],true.b.sig[2]) for(j in 1:Ti[i]) { ctr = ctr + 1 x[ctr] = runif(1)*3 + 1 y[ctr] = b[i,1]+b[i,2]*x[ctr]+rnorm(1,0,true.y.sig) id[ctr] = i } } |

So, there we have it. The data is generated and we can move to doing some analysis.

First, lets just make sure that a simple linear regression provides results that are in the ballpark (It should since OLS is consistent).

lm(y~x) Call: lm(formula = y ~ x) Coefficients: (Intercept) x 9.364 -4.710 |

So far so good. Now for the Stan code.

hcode = " data { int<lower=0> N; // Observations int<lower=0> H; // persons int<lower=0> id[N]; // ID variable real y[N]; // Y Vector real x[N]; // X Vector } parameters { vector[2] beta[H]; // [H,2] dim matrix for b real beta_mu[2]; // means for b real<lower=0> beta_sig[2]; // var parameters for b real<lower=0> y_sig; // regression variance } model { //Priors y_sig~uniform(0,1000); for(j in 1:2) { beta_mu[j] ~ normal(0,100); beta_sig[j] ~ uniform(0,1000); for(h in 1:H) beta[h,j] ~ normal(beta_mu[j], beta_sig[j]); } // Likelihood for(n in 1:N) y[n] ~ normal(beta[id[n],1]+beta[id[n],2] .* x[n], y_sig); } " |

Ok, lets go over this piece by piece. First, notice that **hcode** is simply a string. That’s because I will be using the model.code argument to pass my code to Stan. An alternative would be to place this code in a file and use the model.file option. Stan code is broken up into blocks – in the code above we have three such blocks *data*, *parameters* and of course the *model* block. Lets look at each.

data { int<lower=0> N; // Observations int<lower=0> H; // persons int<lower=0> id[N]; // ID variable real y[N]; // Y Vector real x[N]; // X Vector } |

The data block informs Stan of the SEXP objects that we will be passing to it. In this case we have a few constants and a few vectors. That’s it, simple. Now we tell Stan the parameters that we will be “estimating.” All unknown quantities go here, including the individual household level regression parameters. The lower=0 keeps the variances from being initialized using Stan’s default (-2,2) range.

parameters { vector[2] beta[H]; // [H,2] dim matrix for b real beta_mu[2]; // means for b real<lower=0> beta_sig[2]; // var parameters for b real<lower=0> y_sig; // regression variance } |

Lastly, we have the model spec. The first part specifies the priors we would like to use and the latter the likelihood specification. The code below should look familiar to anyone who has used BUGS. The syntax is a bit different but overall quite straightforward. The manual has details (although I would treat it as a beta version too.)

model { //Priors y_sig~uniform(0,1000); for(j in 1:2) { beta_mu[j] ~ normal(0,100); beta_sig[j] ~ uniform(0,1000); for(h in 1:H) beta[h,j] ~ normal(beta_mu[j], beta_sig[j]); } // Likelihood for(n in 1:N) y[n] ~ normal(beta[id[n],1]+beta[id[n],2] .* x[n], y_sig); } |

We can now use the inline ‘stan’ command in Rstan to sample from the specified model.

dat = list(N=length(y),H=dim(b)[1],y=y,x=x,id=id) hlm = stan(model_name="Hierarchical Linear Model", model_code = hcode, data=dat , iter = 2000, n_chains = 2, verbose = FALSE) |

And we are off. A note, depending on the version you have installed the syntax in the ‘stan’ function may need to be tweaked. For example, model_code would be model.code, iter would be n.iter and so on. I am told that the next release of Stan/Rstan (v.1.0.1) will correct these and other issues. On my MAC-PRO this code takes about a minute to compile and sample the two chains. The chains mix quickly as we can see from the traceplot and it seems like all is well!

In the next blog I will compare Stan to alternative approaches to estimating Multinomial Conditional Logit (ala McFadden) with Unobserved Heterogeneity. Stay tuned.

]]>