What’s this about?

Neighboring towns have more influence on each other than on towns far away. The same is true of countries that are close to each other and of closely connected friends on social media.

Spatial autoregressive models are fit using datasets that contain observations on geographical areas. Observations are called spatial units and might be countries, states, counties, postal codes, or city blocks. Alternatively, they might not be geographically based at all; they could be nodes of a social network.

Datasets contain a continuous outcome variable—such as incidence of disease, output of farms, or crime rate—along with other variables to predict the outcome. For cross-sectional data, each variable has one value per spatial unit. For panel data, there are typically multiple values for different time points.

There is a new manual entirely devoted to fitting SAR models, working with spatial data, and creating and managing spatial weighting matrices. The new commands are called the Sp commands. See the Spatial Autoregressive Models Reference Manual.

Let’s see it work

There are three steps to fitting SAR models:

Getting your data ready for analysis.

Creating the spatial weighting matrices your model needs.

Running your SAR model.

 

Stata’s Sp commands will work with or without shapefiles, files commonly used to define maps. They will work with other location data or even work with data without locations at all, such as social network data.

Here’s an analysis from start to finish.

 

Step 1. Find and translate shapefile

You do not have to use shapefiles, but we will. We downloaded a standard-format shapefile, tl_2016_us_county.zip, that we found at https://www.census.gov/geo/maps-data/data/tiger.html.

We copied the file to our current directory, then typed in Stata:

 

. unzipfile tl_2016_us_county.zip
  (output omitted)


. spshape2dta tl_2016_us_county
  (importing .shp file)
  (importing .dbf file)
  (creating _ID spatial-unit id)
  (creating _CX coordinate)
  (creating _CY coordinate)

  file tl_2016_us_county_shp.dta created
  file tl_2016_us_county.dta     created

spshape2dta did its magic and created two Stata datasets for us. One is a Stata-format shapefile:

 

tl_2016_us_county_shp.dta

 

The other is a Stata dataset containing the other data that were in the shapefile bundle:

 

tl_2016_us_county.dta

 

spshape2dta also linked the two files.

 

Step 2: Prepare the data for analysis

We have our own analysis data for these counties in texas_ue.dta. We could just use them and skip the shapefile, but our data do not have the coordinates of the counties. We could not calculate distances or find neighbors. We could not do an SAR analysis. That’s why we got the shapefile from the U.S. Census; it will provide all of this information.

We will merge our data with tl_2016_us_county.dta. We will first create an ID variable for merging the files. We also tell Sp that the Census provided the coordinates in latitude and longitude and that we want distances reported in miles.

 

. use tl_2016_us_county


. generate long fips = real(STATEFP + COUNTYFP)


. spset fips, modify replace
(output omitted)

. spset, modify coordsys(latlong, miles)
  Sp dataset tl_2016_us_county.dta
                data:  cross sectional
     spatial-unit id:  _ID (equal to fips)
         coordinates:  _CY, _CX (latitude-and-longitude, miles)
    linked shapefile:  tl_2016_us_county_shp.dta

. save, replace
file tl_2016_us_county.dta saved

. use texas_ue, clear


. merge 1:1 fips using tl_2016_us_county, keep(match)
(output omitted)

Our data are ready for analysis.

Step 3: Creating the spatial weighting matrices

We plan on fitting a model with spatial lags of the dependent variable, spatial lags of a covariate, and spatial autoregressive errors. Spatial lags are defined by spatial weighting matrices.

We will use one matrix for the variables and another for the errors.

Sp provides many ways to create spatial weighting matrices. We will use just two of its predefined formulations:

 

. spmatrix create contiguity W

. spmatrix create idistance M

. spmatrix dir

Weighting matrix name N x N Type Normalization
M 254 x 254 idistance spectral
W 254 x 254 contiguity spectral

 

We created W to be a contiguity matrix based on nearest neighbors.

We created M to be the inverse of the distance between counties.

We let Sp perform its default normalization, which is spectral (largest eigenvalue). We could have chosen row or min–max normalization.

 

Sp also provides commands that let you create custom weighting matrices. You can create them from Stata data by writing Mata code or by importing them from a text file. The Mata capability is of special interest because it is so easy to use.

 

Step 4: Fitting the model

We have prepared our data.

We have defined the spatial weighting matrices we need.

We can now fit our model.

 

. spregress unemployment college, gs2sls dvarlag(W) ivarlag(W:college) errorlag(M)


Spatial autoregressive model                    Number of obs     =        254
GS2SLS estimates                                Wald chi2(3)      =      43.52
                                                Prob > chi2       =     0.0000
                                                Pseudo R2         =     0.2081

unemployment Coef. Std. Err. z P>|z| [95% Conf. Interval]
unemployment
college -.067419 .0122785 -5.49 0.000 -.0914843 -.0433536
_cons 5.715733 .4246597 13.46 0.000 4.883415 6.54805
W
college -.0424388 .0213227 -1.99 0.047 -.0842306 -.000647
unemployment .2058481 .0961201 2.14 0.032 .0174562 .39424
W
e.unemploy~t 3.247298 1.369204 2.37 0.018 .5637078 5.930888
Wald test of spatial terms: chi2(3) = 9.77 Prob > chi2 = 0.0206

 

We used the generalized spatial two-stage least-squares (GS2SLS) estimator. The GS2SLS estimator lets us fit multiple spatial lags, potentially allowing us to better approximate the true spatial dependence. Alternatively, we could have used a maximum likelihood estimator to fit the model.

Our dependent variable is the unemployment rate (unemployment) and we think that unemployment is affected by the proportion of the adult population who hold college degrees (college).

 

Step 5: Interpreting the model

If you are familiar with SAR models, you know they are difficult to interpret because the coefficients are a combination of direct and indirect effects. Direct effects are the effects of the spatial unit on itself. Indirect effects are the effects spatial units have on other spatial units, also known as spillover effects.

 

New command estat impact splits out those effects.

 

. estat impact

Average impacts                                 Number of obs     =        254

Delta-Method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]
direct
college -.0690788 .0121434 -5.69 0.000 -.0928795 -.0452781
indirect
college -.0586373 .0299383 -1.96 0.050 -.1173152 .0000407
total
college -.1277161 .0320147 -3.99 0.000 -.1904638 -.0649684

 

estat impact is essential for interpreting the results of SAR models and works after all Sp estimators. We see that a 1-percentage point increase in those holding college degrees in a county reduces unemployment by 0.07 percentage points in that same county, the direct effect. The spillover effect to neighboring counties is almost as large—a 0.06 percentage point expected reduction in unemployment.

 

Tell me more

Learn more about Stata’s spatial autogressive models features.

There is an entire manual dedicated to SAR, and it has friendly introductions to the subject. See the new Spatial Autoregressive Models Reference Manual.