What’s this about?

Multilevel models are fit to data that can be divided into groups. These may be patients treated at the same hospital, cars manufactured at the same plant, students attending the same school, and so on.

 

As a more concrete example, suppose an educational researcher has given a test to a sample of students in Texas and wants to analyze the results. The students can be grouped into schools, and the schools can be grouped into school districts. If we believe unobserved characteristics of the individual schools as well as characteristics of the school districts are likely to impact the test results, we can fit a multilevel model with school-level and district-level random effects.

 

What if we want to fit a multilevel model to data collected using a complex survey design rather than a simple random sample? We need to take into account characteristics of the survey design—clustering, stratification, sampling weights, and finite-population corrections—to obtain appropriate point estimates and standard errors. Adjusting for survey design in multilevel models is unique in that we need weights for each level of the model, assuming those levels correspond to stages of the sampling design.

 

Continuing with our testing example, we will suppose that the researcher first took a sample of school districts. Then, schools were sampled from within each selected school district. Finally, students were selected from within each selected school. We have a multiple-stage sampling design. We also have sampling weights for each stage of the design related to the probabilities of school districts, individual schools, and students being included in the sample.

 

Throughout Stata, analyzing complex survey data is as simple as using svyset to declare aspects of the survey design and then adding the svy: prefix to the estimation command for the model you want to fit. We can now use svyset and the svy: prefix when fitting multilevel models to survey data.

 

Let’s see it work

To demonstrate, we use a dataset arising from a two-stage sampling design. Here, schools are selected at the first stage. Then, students are sampled from within the selected schools. Our data contain sampling weights for both schools and students. We can type


. svyset school_id, weight(wt_school) || _n, weight(wt_student)

 

to specify that school_id and _n (the observation number) identify schools and students, the first- and second-stage sampling units. The school-stage sampling weight, wt_school, records the inverse of the probability that the school was included in the sample. The wt_student variable records the inverse of the probability that the student was included, conditional on the student’s school having already been selected.

 

We are interested in the effects of sex, socioeconomic status, and speaking English at home on reading. We fit a two-level logit model for pass_read, which is coded as one if a student passes a reading proficiency threshold and zero otherwise. We allow for school-level random intercepts. To fit this model, we type


. svy: melogit pass_read female sei home_eng || school_id:

 

Because we specified the svy: prefix, the results from melogit are automatically adjusted for our survey design.


. svy: melogit pass_read female sei home_eng || school_id:
(running melogit on estimation sample)


Survey: Mixed-effects logistic regression

Number of strata   =         1                  Number of obs     =      2,069
Number of PSUs     =       148                  Population size   = 346,373.74
                                                Design df         =        147
                                                F(   3,    145)   =      21.03
                                                Prob > F          =     0.0000


Linearized
pass_read Coef.               Std. Err.            t               P>|t|             [95% Conf. Interval]
female .6008465     .1536047         3.91          0.000             .2972878        .9044052
sei .0311463      .0047519         6.55          0.000             .0217554        .0405373
home_eng 1.005684     .3315877         3.03          0.003             .3503888        1.660978
_cons -3.517315     .4169515       -8.44          0.000             -4.341308       -2.693321
school_id
var(_cons) .5348872     .2409983                                                  .2195645          1.303054

 

We find that being female, higher socioeconomic status, and speaking English at home are all associated with a higher probability of passing the reading proficiency threshold. We also find a moderate amount of variation across schools—the variance of the random effects for schools is .535.

 

We demonstrated how to analyze survey data with a multilevel logit model. Stata’s commands for fitting multilevel probit, complementary log-log, ordered logit, ordered probit, Poisson, negative binomial, parametric survival, and generalized linear models also support complex survey data.

 

gsem can also fit multilevel models, and it extends the type of models that can be fit in many ways. For instance, gsem can fit multilevel multinomial logit models, multivariate multilevel models, and multilevel structural equation models. gsem also supports estimation with complex survey data.