Required current contributed CRAN packages:

spatstat sf gstat spdep tmap spatialreg igraph hglm metafor sp spData RANN

and spDataLarge - install.packages("spDataLarge", repos = "https://nowosad.github.io/drat/", type = "source").

Script

Script at https://raw.githubusercontent.com/edzer/UseR2019/master/part2.R. Dowload to suitable location and use as basis.

Is spatial autocorrelation just poor data collection design and/or model mis-specification?

Spatial modelling: why?

  • If we want to detect and classify patterns (ML), infer about covariates, or interpolate from a fitted model, we need models that take account of the possible interdependency of spatial, time series and spatio-temporal observations

  • Here we are focussing on the spatial case; time differs in having a direction of flow, and spatio-temporal processes are necessarily more complex, especially when non-separable

  • We will not look at machine learning issues, although the partition into training and test sets raises important spatial questions for tuning (Schratz et al. 2019)

Spatial modelling: which kinds?

  • Spatial point processes are most closely tied to the relative positions of the observations, but may accommodate inhomegeneities

  • Geostatistics is concerned with interpolating values of a variable of interest at unobserved locations, and the relative positions of the observations contribute through the modelling a function of differences in observed values of that variable between observations at increasing distances from each other

  • Disease mapping, spatial econometrics/regression and the application of generalized linear mixed models (GLMM, GAMM) in for example ecology are more interested in inferences about the spatial processes in play and the included covariates; some approaches may use distance based spatial correlation functions, others use graph or lattice neighbourhoods

When is a “spatial” process actually induced by the analyst

  • Sometimes when we use spatial data, we jump to spatial statistical methods because of the widely cited first law of geography, that nearby observations are more likely to be similar than those further away

  • But maybe what we see as clustering, patchiness, pattern, that looks spatial is actually mis-specification, such as a missing covariate, and/or inappropriate functional form, and/or including variables acting at different scales, and/or consequences of suboptimal bounding of tesselated observations …

  • Here, we’ll first look at the consequences of treating inhomogeneous points as homogeneous (the intensity trends upwards with x)

  • Then we’ll use those points to see what happens in adding a similar trend to a random variable; empirical variograms are constructed ignoring and including the trend, and tests for global and local spatial autocorrelation are calculated

Spatial point processes

We can start by using spatstat (Baddeley, Rubak, and Turner 2015) to generate completely spatially random (CSR) points with intensity increasing with x to introduce trend inhomogeneity in a unit square (note the use of set.seed():

suppressPackageStartupMessages(library(spatstat))
intenfun <- function(x, y) 200 * x
set.seed(1)
(ihpp <- rpoispp(intenfun, lmax = 200))
## Planar point pattern: 95 points
## window: rectangle = [0, 1] x [0, 1] units
plot(density(ihpp), axes=TRUE)
points(ihpp, col="green2", pch=19)

We can use \(\hat{K}\) tests ignoring and including inhomogeneity to adapt the test to the underlying data generation process. The homogeneous test examines the divergence from a theoretical CSR at distance bins, the inhomogeneous tries to guess the kind of patterning in the relative positions of the point observations. If we ignore the inhomogeneity, we find significant clustering at most distances, with the opposite finding when using the test attempting to accommodate inhomogeneity:

opar <- par(mfrow=c(1,2))
plot(envelope(ihpp, Kest, verbose=FALSE), main="Homogeneous")
plot(envelope(ihpp, Kinhom, verbose=FALSE), main="Inhomogeneous")

par(opar)

Adding a y trend variable

Coercing the spatstat "ppp" object to an sf "sf" object is trivial, but first adds the window of the point process, which we need to drop by retaining only those labelled "point". Then we take the x coordinate and use it to create y, which trends with x; the DGP is not very noisy.

library(sf)
## Linking to GEOS 3.7.2, GDAL 3.0.0, PROJ 6.1.1
#st_as_sf(ihpp) %>% dplyr::filter(label == "point") -> sf_ihpp
st_as_sf(ihpp) ->.; .[.$label == "point",] -> sf_ihpp
crds <- st_coordinates(sf_ihpp)
sf_ihpp$x <- crds[,1]
sf_ihpp$y <- 100 + 50 * sf_ihpp$x + 20 * rnorm(nrow(sf_ihpp))
plot(sf_ihpp[,"y"], pch=19)

Variogram model

Variograms for models ignoring (y ~ 1) and including (y ~ x) the trend; if we ignore the trend, we find spurious relationships, shown by loess() fits

suppressPackageStartupMessages(library(gstat))
vg0 <- variogram(y ~ 1, sf_ihpp)
vg1 <- variogram(y ~ x, sf_ihpp)
opar <- par(mfrow=c(1,2))
plot(gamma ~ dist, vg0, ylim=c(0,550), main="Trend ignored")
s <- seq(0, 0.5, 0.01)
p0 <- predict(loess(gamma ~ dist, vg0, weights=np), newdata=data.frame(dist=s), se=TRUE)
lines(s, p0$fit)
lines(s, p0$fit-2*p0$se.fit, lty=2)
lines(s, p0$fit+2*p0$se.fit, lty=2)
plot(gamma ~ dist, vg1, ylim=c(0,550), main="Trend included")
p1 <- predict(loess(gamma ~ dist, vg1, weights=np), newdata=data.frame(dist=s), se=TRUE)
lines(s, p1$fit)
lines(s, p1$fit-2*p1$se.fit, lty=2)
lines(s, p1$fit+2*p1$se.fit, lty=2)

par(opar)

Spatial autocorrelation

Going on from a continuous to a discrete treatment of space, we can use functions in spdep to define neighbours and then test for global and local spatial autocorrelation. Making first a triangulated neighbour objects, we can thin out improbable neighbours lying too far from one another using a sphere of influence graph to make a symmetric neighbour object:

suppressPackageStartupMessages(library(spdep))
nb_tri <- tri2nb(crds)
(nb_soi <- graph2nb(soi.graph(nb_tri, crds), sym=TRUE))
## Neighbour list object:
## Number of regions: 95 
## Number of nonzero links: 302 
## Percentage nonzero weights: 3.34626 
## Average number of links: 3.178947

The sphere of influence graph generates three subgraphs; a bit unfortunate, but we can live with that (for now):

plot(nb_soi, crds)

comps <- n.comp.nb(nb_soi)
sf_ihpp$comps <- comps$comp.id
comps$nc
## [1] 3

Using binary weights, looking at the variable but ignoring the trend, we find strong global spatial autocorrelation using global Moran’s \(I\) (tests return "htest" objects tidied here using broom)

lwB <- nb2listw(nb_soi, style="B")
out <- broom::tidy(moran.test(sf_ihpp$y, listw=lwB, randomisation=FALSE, alternative="two.sided"))[1:5]
names(out)[1:3] <- c("Moran's I", "Expectation", "Variance"); out
## # A tibble: 1 x 5
##   `Moran's I` Expectation Variance statistic    p.value
##         <dbl>       <dbl>    <dbl>     <dbl>      <dbl>
## 1       0.342     -0.0106  0.00633      4.44 0.00000917

If, however, we take the residuals after including the trend using a linear model, the apparent spatial autocorrelation evaporates

lm_obj <- lm(y ~ x, data=sf_ihpp)
out <- broom::tidy(lm.morantest(lm_obj, listw=lwB, alternative="two.sided"))[1:5]
names(out)[1:3] <- c("Moran's I", "Expectation", "Variance"); out
## # A tibble: 1 x 5
##   `Moran's I` Expectation Variance statistic[,1] p.value[,1]
##         <dbl>       <dbl>    <dbl>         <dbl>       <dbl>
## 1      0.0579     -0.0199  0.00623         0.986       0.324

The same happens if we calculate local Moran’s \(I\) ignoring the trend (randomisation assumption), and including the trend (normality assumption and saddlepoint approximation)

lmor0 <- localmoran(sf_ihpp$y, listw=lwB, alternative="two.sided")
lmor1 <- as.data.frame(localmoran.sad(lm_obj, nb=nb_soi, style="B", alternative="two.sided"))
sf_ihpp$z_value <- lmor0[,4]
sf_ihpp$z_lmor1_N <- lmor1[,2]
sf_ihpp$z_lmor1_SAD <- lmor1[,4]
suppressPackageStartupMessages(library(tmap))
tm_shape(sf_ihpp) + tm_symbols(col=c("z_value", "z_lmor1_N", "z_lmor1_SAD"), midpoint=0) + tm_facets(free.scales=FALSE, nrow=1) + tm_layout(panel.labels=c("No trend", "Trend, normal", "Trend, SAD"))

Exercise + review

Try doing this in pairs or small groups and discuss what is going on; the underlying ideas are more important than the code.

Look at the North Carolina SIDS data vignette for background:

library(sf)
nc <- st_read(system.file("shapes/sids.shp", package="spData")[1], quiet=TRUE)
st_crs(nc) <- "+proj=longlat +datum=NAD27"
row.names(nc) <- as.character(nc$FIPSNO)
head(nc)
## Simple feature collection with 6 features and 22 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
##       CNTY_ID  AREA PERIMETER CNTY_        NAME  FIPS FIPSNO CRESS_ID
## 37009    1825 0.114     1.442  1825        Ashe 37009  37009        5
## 37005    1827 0.061     1.231  1827   Alleghany 37005  37005        3
## 37171    1828 0.143     1.630  1828       Surry 37171  37171       86
## 37053    1831 0.070     2.968  1831   Currituck 37053  37053       27
## 37131    1832 0.153     2.206  1832 Northampton 37131  37131       66
## 37091    1833 0.097     1.670  1833    Hertford 37091  37091       46
##       BIR74 SID74 NWBIR74 BIR79 SID79 NWBIR79 east north      x       y
## 37009  1091     1      10  1364     0      19  164   176 -81.67 4052.29
## 37005   487     0      10   542     3      12  183   182 -50.06 4059.70
## 37171  3188     5     208  3616     6     260  204   174 -16.14 4043.76
## 37053   508     1     123   830     2     145  461   182 406.01 4035.10
## 37131  1421     9    1066  1606     3    1197  385   176 281.10 4029.75
## 37091  1452     7     954  1838     5    1237  411   176 323.77 4028.10
##             lon      lat L_id M_id                       geometry
## 37009 -81.48594 36.43940    1    2 MULTIPOLYGON (((-81.47276 3...
## 37005 -81.14061 36.52443    1    2 MULTIPOLYGON (((-81.23989 3...
## 37171 -80.75312 36.40033    1    2 MULTIPOLYGON (((-80.45634 3...
## 37053 -76.04892 36.45655    1    4 MULTIPOLYGON (((-76.00897 3...
## 37131 -77.44057 36.38799    1    4 MULTIPOLYGON (((-77.21767 3...
## 37091 -76.96474 36.38189    1    4 MULTIPOLYGON (((-76.74506 3...

The variables are largely count data, L_id and M_id are grouping variables. We can also read the original neighbour object:

library(spdep)
gal_file <- system.file("weights/ncCR85.gal", package="spData")[1]
ncCR85 <- read.gal(gal_file, region.id=nc$FIPSNO)
ncCR85
## Neighbour list object:
## Number of regions: 100 
## Number of nonzero links: 492 
## Percentage nonzero weights: 4.92 
## Average number of links: 4.92
plot(st_geometry(nc), border="grey")
plot(ncCR85, st_centroid(st_geometry(nc), of_largest_polygon), add=TRUE, col="blue")
## Warning in st_centroid.sfc(st_geometry(nc), of_largest_polygon):
## st_centroid does not give correct centroids for longitude/latitude data

Now generate a random variable. Here I’ve set the seed - maybe choose your own, and compare outcomes with the people around you. With many people in the room, about 5 in 100 may get a draw that is autocorrelated when tested with Moran’s \(I\) (why?).

set.seed(1)
nc$rand <- rnorm(nrow(nc))
lw <- nb2listw(ncCR85, style="B")
moran.test(nc$rand, listw=lw, alternative="two.sided")
## 
##  Moran I test under randomisation
## 
## data:  nc$rand  
## weights: lw    
## 
## Moran I statistic standard deviate = -0.82038, p-value = 0.412
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.060754158      -0.010101010       0.003812269

Now we’ll create a trend (maybe try plotting LM to see the trend pattern). Do we get different test outcomes by varying beta and sigma (alpha is constant).

nc$LM <- as.numeric(interaction(nc$L_id, nc$M_id))
alpha <- 1
beta <- 0.5
sigma <- 2
nc$trend <- alpha + beta*nc$LM + sigma*nc$rand
moran.test(nc$trend, listw=lw, alternative="two.sided")
## 
##  Moran I test under randomisation
## 
## data:  nc$trend  
## weights: lw    
## 
## Moran I statistic standard deviate = 6.1163, p-value = 9.58e-10
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.368410659      -0.010101010       0.003829901

To get back to reality, include the trend in a linear model, and test again.

lm.morantest(lm(trend ~ LM, nc), listw=lw, alternative="two.sided")
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = trend ~ LM, data = nc)
## weights: lw
## 
## Moran I statistic standard deviate = -0.50788, p-value = 0.6115
## alternative hypothesis: two.sided
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     -0.049567147     -0.018718176      0.003689356

So we can manipulate a missing variable mis-specification to look like spatial autocorrelation. Is this informative?

Extra problem (own time after we’re done if you like):

Sometimes we only have data on a covariate for aggregates of our units of observation. What happens when we “copy out” these aggregate values to the less aggregated observations? First we’ll aggregate nc by LM, then make a neighbour object for the aggregate units

aggLM <- aggregate(nc[,"LM"], list(nc$LM), head, n=1)
(aggnb <- poly2nb(aggLM))
## Neighbour list object:
## Number of regions: 12 
## Number of nonzero links: 46 
## Percentage nonzero weights: 31.94444 
## Average number of links: 3.833333

Next, draw a random sample for the aggregated units:

set.seed(1)
LMrand <- rnorm(nrow(aggLM))

Check that it does not show any spatial autocorrelation:

moran.test(LMrand, nb2listw(aggnb, style="B"))
## 
##  Moran I test under randomisation
## 
## data:  LMrand  
## weights: nb2listw(aggnb, style = "B")    
## 
## Moran I statistic standard deviate = -0.24911, p-value = 0.5984
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       -0.13101880       -0.09090909        0.02592483

Copy it out to the full data set, indexing on values of LM; the pattern now looks pretty autocorrelated

nc$LMrand <- LMrand[match(nc$LM, aggLM$LM)]
plot(nc[,"LMrand"])

which it is:

moran.test(nc$LMrand, listw=lw, alternative="two.sided")
## 
##  Moran I test under randomisation
## 
## data:  nc$LMrand  
## weights: lw    
## 
## Moran I statistic standard deviate = 9.4301, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.575091242      -0.010101010       0.003850884

Again, we’ve manipulated ourselves into a situation with abundant spatial autocorrelation at the level of the counties, but only by copying out from a more aggregated level. What is going on?

Graph/weight-based neighbours; spatially structured random effects

Graph/weight-based neighbours

  • Some spatial processes are best represented by decreasing functions of distance

  • Other spatial processes best represent proximity, contiguity, neighbourhood; the functions then relate to steps along edges on graphs of neighbours

  • Under certain conditions, the power series \(\rho^i {\mathbf W}^i\) declines in intensity as \(i\) increases from \(0\)

  • Here we will use areal support, corresponding to a proximity view of spatial processes (Wall 2004)

Polish presidential election 2015 data set

We’ll use a typical moderate sized data set of Polish municipalities (including boroughs in the capital, Warsaw), included in spDataLarge as used in Geocomputation with R (Lovelace, Nowosad, and Muenchow 2019)

library(sf)
# if(!require("spData", quietly=TRUE)) install.packages("spData")
# if(!require("spDataLarge", quietly=TRUE)) install.packages("spDataLarge",
#   repos = "https://nowosad.github.io/drat/", type = "source")
data(pol_pres15, package="spDataLarge")
head(pol_pres15[, c(1, 4, 6)])
## Simple feature collection with 6 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 235157.1 ymin: 366913.3 xmax: 281431.7 ymax: 413016.6
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=0 +lon_0=18.99999999999998 +k=0.9993 +x_0=500000 +y_0=-5300000 +ellps=GRS80 +towgs84=0,0,0 +units=m +no_defs
##    TERYT                name       types                       geometry
## 1 020101         BOLESŁAWIEC       Urban MULTIPOLYGON (((261089.5 38...
## 2 020102         BOLESŁAWIEC       Rural MULTIPOLYGON (((254150 3837...
## 3 020103            GROMADKA       Rural MULTIPOLYGON (((275346 3846...
## 4 020104        NOWOGRODZIEC Urban/rural MULTIPOLYGON (((251769.8 37...
## 5 020105          OSIECZNICA       Rural MULTIPOLYGON (((263423.9 40...
## 6 020106 WARTA BOLESŁAWIECKA       Rural MULTIPOLYGON (((267030.7 38...

The dataset has 2495 observations with areal support on 65 variables, most counts of election results aggregated from the election returns by polling station. See:

?spDataLarge::pol_pres15

for details.

Contiguous, proximate neighbours

The spdep function poly2nb() finds proximate, contiguous neighbours by finding one (queen) or two (rook) boundary points within a snap distance of each other for polygons in a candidate list based on intersecting bounding boxes:

suppressPackageStartupMessages(library(spdep))
system.time(nb_q <- poly2nb(pol_pres15, queen=TRUE))
##    user  system elapsed 
##   1.520   0.016   1.542

The object returned is very sparse; the print method reports asymmetric objects and objects with observations with no neighbours.

nb_q
## Neighbour list object:
## Number of regions: 2495 
## Number of nonzero links: 14242 
## Percentage nonzero weights: 0.2287862 
## Average number of links: 5.708216
opar <- par(mar=c(0,0,0,0)+0.5)
plot(st_geometry(pol_pres15), border="grey", lwd=0.5)
coords <- st_centroid(st_geometry(pol_pres15),
  of_largest_polygon=TRUE)
plot(nb_q, coords=st_coordinates(coords), add=TRUE, points=FALSE, lwd=0.5)

par(opar)

We can use GEOS through sf to improve the speed of detection of candidate neighbours, by building a geometry column of bounding boxes of the underlying polygons and finding which intersect. We pass this object to poly2nb() through the foundInBox= argument. This may be useful for larger data sets.

system.time({
  fB1 <- st_intersects(st_as_sfc(lapply(st_geometry(pol_pres15), function(x) {
    st_as_sfc(st_bbox(x))[[1]]
  })))
  fB1a <- lapply(seq_along(fB1), function(i) fB1[[i]][fB1[[i]] > i])
  fB1a <- fB1a[-length(fB1a)]
  nb_sf_q1 <- poly2nb(pol_pres15, queen=TRUE, foundInBox=fB1a)
})
##    user  system elapsed 
##   0.962   0.022   0.988

The two neighbour objects are the same:

all.equal(nb_q, nb_sf_q1, check.attributes=FALSE)
## [1] TRUE

We can further check the object for the number of distinct graph components; there is only one, so each observation can be reached from every other observation by traversing the graph edges:

n.comp.nb(nb_q)$nc
## [1] 1

The "listw" spatial weights object needed for testing or modelling can be created using nb2listw(), here with binary weights, \(1\) for first-order contiguity, \(0\) for not a neighbour:

lw <- nb2listw(nb_q, style="B")

We can coerce this further to a sparse matrix as defined in the Matrix package:

library(Matrix)
W <- as(lw, "CsparseMatrix")

and check symmetry directly, (t() is the transpose method):

isTRUE(all.equal(W, t(W)))
## [1] TRUE

Powering up the sparse matrix fills in the higher order neighbours:

image(W)

WW <- W %*% W
image(WW)

W3 <- WW %*% W
image(W3)

W4 <- W3 %*% W
image(W4)

There is an spdep vignette considering the use of the igraph adjacency representation through sparse matrices as intermediate objects:

library(igraph)
g1 <- graph.adjacency(W, mode="undirected")
class(g1)
## [1] "igraph"

We can get an "nb" neighbour object back from the adjacency representation:

B1 <- get.adjacency(g1)
mat2listw(B1)$neighbours
## Neighbour list object:
## Number of regions: 2495 
## Number of nonzero links: 14242 
## Percentage nonzero weights: 0.2287862 
## Average number of links: 5.708216

There is a single graph component:

c1 <- clusters(g1)
c1$no
## [1] 1

The graph is connected:

is.connected(g1)
## [1] TRUE

with diameter:

dg1 <- diameter(g1)
dg1
## [1] 52

The diameter is the longest path across the graph, where shortest.paths() returns a matrix of all the edge counts on shortest paths between observations:

sp_mat <- shortest.paths(g1)
max(sp_mat)
## [1] 52
str(sp_mat)
##  num [1:2495, 1:2495] 0 1 2 2 2 2 10 10 9 10 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:2495] "1" "2" "3" "4" ...
##   ..$ : chr [1:2495] "1" "2" "3" "4" ...

So graphs really do contain a lot of structural information, rendering most IDW cases superfluous.

Spatially structured random effects

The hglm hierarchical generalized linear model package uses the "CsparseMatrix" version of spatial weights for fitting "SAR" or "CAR" spatially structured random effects

The CARBayes package appears to take dense matrices using listw2mat() or nb2mat(), preferring binary matrices

The "mrf" random effect in the mgcv package for use with gam() takes an "nb" object directly

The R2BayesX::nb2gra() function converts an "nb" neighbour object into a graph for use with BayesX through R2BayesX, but creates a dense matrix; the graph is passed as the map= argument to the "mrf" structured random effect

gra <- R2BayesX::nb2gra(ncCR85)
str(gra)
##  'gra' num [1:100, 1:100] 3 -1 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:100] "37009" "37005" "37171" "37053" ...
##   ..$ : chr [1:100] "37009" "37005" "37171" "37053" ...

The nb2INLA() function in spdep writes a file that INLA can use through the INLA package, used with the "besag" model, for example:

tf <- tempfile()
nb2INLA(tf, ncCR85)
file.size(tf)
## [1] 1945

Exercise + review

What does it mean that this neighbour object not symmetric?

coords <- st_centroid(st_geometry(nc), of_largest_polygon)
(knn_5_nb <- knn2nb(knearneigh(st_coordinates(coords), k=5)))
## Neighbour list object:
## Number of regions: 100 
## Number of nonzero links: 500 
## Percentage nonzero weights: 5 
## Average number of links: 5 
## Non-symmetric neighbours list
klw <- nb2listw(knn_5_nb, style="B")
kW <- as(klw, "CsparseMatrix")
isTRUE(all.equal(kW, t(kW)))
## [1] FALSE
image(kW)

We need to recall that non-symmetric neighbours give directed graphs; if we forget, and treat it as undirected, the extra edges get added (try it):

library(igraph)
g1 <- graph.adjacency(kW, mode="directed")
B1 <- get.adjacency(g1)
mat2listw(B1)$neighbours
## Neighbour list object:
## Number of regions: 100 
## Number of nonzero links: 500 
## Percentage nonzero weights: 5 
## Average number of links: 5 
## Non-symmetric neighbours list

Use igraph functions to explore the ncCR85 "nb" object:

diameter(g1)
## [1] 20
nc_sps <- shortest.paths(g1)
mr <- which.max(apply(nc_sps, 2, max))
nc$sps1 <- nc_sps[,mr]
plot(nc[,"sps1"], breaks=0:21)

plot(nc$sps1, c(st_distance(coords[mr], coords))/1000, xlab="shortest path count", ylab="km distance")

Discuss whether measured distance is really needed to express proximity; the graph shortest paths look fine.

Make objects from the imported neighbours for BayesX and INLA, and as a sparse and dense matrix:

ncCR85a <- ncCR85
attr(ncCR85a, "region.id") <- as.character(nc$CRESS_ID)
nc_gra <- R2BayesX::nb2gra(ncCR85a)
nc_tf <- tempfile()
nb2INLA(nc_tf, ncCR85)
nc_lw <- nb2listw(ncCR85, style="B")
nc_W <- as(nc_lw, "CsparseMatrix")
nc_mat <- listw2mat(nc_lw)

Simultaneous and conditional autoregressive approaches

Let’s start by seeing whether there is any spatial autocorrelation in the SIDS rate (modified Freeman-Tukey square root transformation):

nc$ft.SID74 <- sqrt(1000)*(sqrt(nc$SID74/nc$BIR74) + sqrt((nc$SID74+1)/nc$BIR74))
nc$ft.NWBIR74 <- sqrt(1000)*(sqrt(nc$NWBIR74/nc$BIR74) + sqrt((nc$NWBIR74+1)/nc$BIR74))
tm_shape(nc) + tm_fill(c("ft.SID74", "ft.NWBIR74"))

plot(ft.SID74 ~ ft.NWBIR74, nc)

First Moran’s \(I\) with the intercept as

moran.test(nc$ft.SID74, nc_lw, alternative="two.sided", randomisation=FALSE)
## 
##  Moran I test under normality
## 
## data:  nc$ft.SID74  
## weights: nc_lw    
## 
## Moran I statistic standard deviate = 3.5842, p-value = 0.0003381
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.211279611      -0.010101010       0.003814926

Now for a mean model accommodating case weights:

lm.morantest(lm(ft.SID74 ~ 1, weights=BIR74, data=nc), nc_lw, alternative="two.sided")
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = ft.SID74 ~ 1, data = nc, weights = BIR74)
## weights: nc_lw
## 
## Moran I statistic standard deviate = 4.5965, p-value = 4.297e-06
## alternative hypothesis: two.sided
## sample estimates:
## Observed Moran I      Expectation         Variance 
##      0.273332891     -0.009631207      0.003789801

Next just with the original covariate (transformed non-white births as a proportion of all births):

lm.morantest(lm(ft.SID74 ~ ft.NWBIR74, data=nc), nc_lw, alternative="two.sided")
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = ft.SID74 ~ ft.NWBIR74, data = nc)
## weights: nc_lw
## 
## Moran I statistic standard deviate = 1.2148, p-value = 0.2245
## alternative hypothesis: two.sided
## sample estimates:
## Observed Moran I      Expectation         Variance 
##      0.057042251     -0.016844710      0.003699627

Finally with the covariate and case weights:

lm.morantest(lm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc), nc_lw, alternative="two.sided")
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, weights =
## BIR74)
## weights: nc_lw
## 
## Moran I statistic standard deviate = 1.7839, p-value = 0.07443
## alternative hypothesis: two.sided
## sample estimates:
## Observed Moran I      Expectation         Variance 
##      0.091350263     -0.016506328      0.003655419

So something spatial appears to be present, but how to model it?

Simultaneous autoregressive approaches

The approach taken by social scientists including economists, and some others has been to approach this through simultaneous autoregressive approaches, where the response is modelled using fixed covariates, and the residual process is modelled by optimising a log likelihood function. The spatialreg package provides spautolm() and errorsarlm():

library(spatialreg)
m1 <- spautolm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc, listw=nc_lw, family="SAR")

When we include the covariate, the spatial error coefficient contributes little:

summary(m1)
## 
## Call: spautolm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, listw = nc_lw, 
##     weights = BIR74, family = "SAR")
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.83149 -0.44934  0.09456  0.56278  2.90208 
## 
## Coefficients: 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) 1.5816687  0.2507427  6.3079 2.828e-10
## ft.NWBIR74  0.0366864  0.0069608  5.2704 1.361e-07
## 
## Lambda: 0.035437 LR test value: 1.787 p-value: 0.18129 
## Numerical Hessian standard error of lambda: 0.026096 
## 
## Log likelihood: -119.6704 
## ML residual variance (sigma squared): 1304.4, (sigma: 36.117)
## Number of observations: 100 
## Number of parameters estimated: 4 
## AIC: 247.34

It is unusual to present maps of the spatially structured random effect in cases where the simultaneous autoregressive approach is used, but it is fully possible using components of the returned model object (the vector is doubled for comparison with the CAR version):

nc$SAR_ssre <- 2*as.vector((m1$lambda * nc_W) %*% m1$Y - (m1$lambda * nc_W) %*% (m1$X %*% m1$fit$coefficients))
tm_shape(nc) + tm_fill(c("ft.SID74", "SAR_ssre"), midpoint=c(NA, 0))

The other maximum likelihood implementation gives the same results, but provides a Hausman test for shifts in the covariate coefficients between the aspatial and spatial estimates (Pace and LeSage 2008); there is none:

m1a <- errorsarlm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc, listw=nc_lw)
summary(m1a, Hausman=TRUE)
## 
## Call:
## errorsarlm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, listw = nc_lw, 
##     weights = BIR74)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.83149 -0.44934  0.09456  0.56278  2.90208 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) 1.5816687  0.2507427  6.3079 2.828e-10
## ft.NWBIR74  0.0366864  0.0069608  5.2704 1.361e-07
## 
## Lambda: 0.035437, LR test value: 1.787, p-value: 0.18129
## Asymptotic standard error: 0.029167
##     z-value: 1.2149, p-value: 0.22439
## Wald statistic: 1.4761, p-value: 0.22439
## 
## Log likelihood: -119.6704 for error model
## ML residual variance (sigma squared): 1304.4, (sigma: 36.117)
## Number of observations: 100 
## Number of parameters estimated: 4 
## AIC: 247.34, (AIC for lm: 247.13)
## Hausman test: 1.565, df: 2, p-value: 0.45726

It also lets us add Durbin= terms, that is spatially lagged covariates:

m1b <- errorsarlm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc, listw=nc_lw, Durbin=TRUE)
summary(m1b, Hausman=TRUE)
## 
## Call:
## errorsarlm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, listw = nc_lw, 
##     weights = BIR74, Durbin = TRUE)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.8705347 -0.4488357  0.0094254  0.4864939  2.7406880 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                    Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)      2.28212888  0.33548214  6.8025 1.028e-11
## ft.NWBIR74       0.03532229  0.00811453  4.3530 1.343e-05
## lag.(Intercept) -0.15336745  0.06073121 -2.5253   0.01156
## lag.ft.NWBIR74   0.00098902  0.00159099  0.6216   0.53418
## 
## Lambda: 0.028053, LR test value: 1.036, p-value: 0.30874
## Asymptotic standard error: 0.029824
##     z-value: 0.94064, p-value: 0.34689
## Wald statistic: 0.8848, p-value: 0.34689
## 
## Log likelihood: -115.1974 for error model
## ML residual variance (sigma squared): 1195.8, (sigma: 34.58)
## Number of observations: 100 
## Number of parameters estimated: 6 
## AIC: 242.39, (AIC for lm: 241.43)
## Hausman test: 1.8315, df: 4, p-value: 0.76671

and to present the by-covariate effects taking into account the unlagged (direct) and lagged (indirect) covariates and the sum of the coefficients (total):

summary(impacts(m1b))
## Impact measures (SDEM, estimable, n):
##                Direct   Indirect      Total
## ft.NWBIR74 0.03532229 -0.1533675 -0.1180452
## ========================================================
## Standard errors:
##                 Direct   Indirect      Total
## ft.NWBIR74 0.008114531 0.06073121 0.06438103
## ========================================================
## Z-values:
##              Direct  Indirect     Total
## ft.NWBIR74 4.352968 -2.525348 -1.833539
## 
## p-values:
##            Direct     Indirect Total   
## ft.NWBIR74 1.3431e-05 0.011558 0.066722

However, our model may suffer from not using a mixed model approach to a count response; the simultaneous autoregressive models are mostly used with Gaussian responses. One possibility is to employ the hierarchical generalized linear model approach from the hglm package. First we’ll fit an unstructured IID (independent and identically distributed) random effect (Alam, Rönnegård, and Shen 2015):

library(hglm)
E <- nc$BIR74 * sum(nc$SID74)/sum(nc$BIR74)
HGLM_iid <- hglm(fixed=SID74 ~ ft.NWBIR74, random= ~ 1|CRESS_ID, offset=log(E), weights=BIR74,
                 data=nc, family=poisson(link=log))

The random effects also have their own standard errors, so we can order and display them with error bars in a forest or caterpillar plot:

ranef_iid <- unname(summary(HGLM_iid, print.ranef=TRUE)$RandCoefMat)
metafor::forest(ranef_iid[,1], ranef_iid[,2], subset=order(ranef_iid[,1], decreasing=TRUE), 
        slab=NA, annotate=FALSE, lty=c("solid","blank"), pch=19, psize=2, cex.lab=1, cex.axis=1)

We can also fit a weighted Poisson simultaneous autoregressive model, and examine the random effects:

HGLM_sar <- hglm(fixed=SID74 ~ ft.NWBIR74, random= ~ 1|CRESS_ID, offset=log(E), weights=BIR74, 
                 data=nc, family=poisson(link=log), rand.family=SAR(D=nc_W))
ranef_sar <- unname(summary(HGLM_sar, print.ranef=TRUE)$RandCoefMat)
metafor::forest(ranef_sar[,1], ranef_sar[,2], subset=order(ranef_sar[,1], decreasing=TRUE), 
        slab=NA, annotate=FALSE, lty=c("solid","blank"), pch=19, psize=2, cex.lab=1, cex.axis=1)

There is not much difference between the IID and the SAR spatially structured random effects:

nc$HGLM_re <- ranef_iid[,1]
nc$HGLM_ss_SAR <- ranef_sar[,1]
tm_shape(nc) + tm_fill(c("HGLM_re", "HGLM_ss_SAR"), midpoint=c(0), title="Poisson HGLM RE") +
  tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("IID", "SAR SSRE"))

Conditional and MRF autoregressive approaches

Most epidemiological applications use conditional autoregressive approaches, where some (like spautolm() and the hglm() implementations) fit a spatial coefficient, but many fit an intrinsic CAR. First the spautolm() and the hglm() implementations:

m1c <- spautolm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc, listw=nc_lw, family="CAR")
summary(m1c)
## 
## Call: spautolm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, listw = nc_lw, 
##     weights = BIR74, family = "CAR")
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.890805 -0.483471  0.070376  0.575906  2.824220 
## 
## Coefficients: 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) 1.4561763  0.2499558  5.8257 5.686e-09
## ft.NWBIR74  0.0410281  0.0069417  5.9103 3.414e-09
## 
## Lambda: 0.06028 LR test value: 1.6364 p-value: 0.20082 
## Numerical Hessian standard error of lambda: 0.053566 
## 
## Log likelihood: -119.7458 
## ML residual variance (sigma squared): 1302, (sigma: 36.083)
## Number of observations: 100 
## Number of parameters estimated: 4 
## AIC: 247.49

Again, we can calculate something that represents the spatial patterning of the spatial process (termed “signal” in the documentation of spatialreg::predict.sarlm), but it is not scaled in the same way as the hgml() random effects (and importantly we do not have standard errors):

nc$CAR_ssre <- as.vector((m1c$lambda * nc_W) %*% m1c$Y - 
                           (m1c$lambda * nc_W) %*% (m1c$X %*% m1c$fit$coefficients))
tm_shape(nc) + tm_fill(c("SAR_ssre", "CAR_ssre"), midpoint=c(0), title="Gauss ML RE") +
  tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("SAR SSRE", "CAR SSRE"))

Fitting the HGLM CAR model is just like the SAR model, and the forest plot of the spatially structured random effect is similar. Recall that spautolm() is fitting a Gaussian model, but hglm() is fitting a Poisson model, arguably better suited to count data. This means that the scalings of the random effects will vary in scale:

HGLM_car <- hglm(fixed=SID74 ~ ft.NWBIR74, random= ~ 1|CRESS_ID, offset=log(E), weights=BIR74, 
                 data=nc, family=poisson(link=log), rand.family=CAR(D=nc_W))
ranef_car <- unname(summary(HGLM_car, print.ranef=TRUE)$RandCoefMat)
metafor::forest(ranef_car[,1], ranef_car[,2], subset=order(ranef_car[,1], decreasing=TRUE), 
        slab=NA, annotate=FALSE, lty=c("solid","blank"), pch=19, psize=2, cex.lab=1, cex.axis=1)

nc$HGLM_ss_CAR <- ranef_car[,1]
tm_shape(nc) + tm_fill(c("HGLM_ss_CAR", "HGLM_ss_SAR"), midpoint=c(0), title="Poisson HGLM RE") +
  tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("CAR SSRE", "SAR SSRE"))

To use a generalized additive mixed model (mgcv::gam() with an "mrf" random effect), and some other mixed models, the areal entities need to be grouped (done in the first exercise), and we can try a flexible fit on the covariate:

library(mgcv)
names(aggnb) <- as.character(aggLM$Group.1)
nc$LM <- as.factor(nc$LM)
GAM_mrf <- gam(SID74 ~ s(ft.NWBIR74) + s(LM, bs="mrf", xt=list(nb=aggnb)), offset=log(E), weights=BIR74, data=nc, family=poisson(link=log))
summary(GAM_mrf)

And plot the covariate smooth term:

plot(GAM_mrf)

The forest plot is obviously grouped too:

GAM_mrf_re <- predict(GAM_mrf, type="terms", se=TRUE)
metafor::forest(GAM_mrf_re$fit[,2], GAM_mrf_re$se.fit[,2], subset=order(GAM_mrf_re$fit[,1], decreasing=TRUE), 
        slab=NA, annotate=FALSE, lty=c("solid","blank"), pch=19, psize=2, cex.lab=1, cex.axis=1)

as is the RE map:

nc$GAM_mrf_re <- GAM_mrf_re$fit[,2]
tm_shape(nc) + tm_fill(c("GAM_mrf_re"), midpoint=c(0), title="Poisson GAM MRF RE")

Exercise + review

The New York 8 county data set contains population standardized leukemia cases, with Z as a transformed rate:

NY8 <- st_read(system.file("shapes/NY8_utm18.shp", package="spData"))
## Reading layer `NY8_utm18' from data source `/home/rsb/lib/r_libs/spData/shapes/NY8_utm18.shp' using driver `ESRI Shapefile'
## Simple feature collection with 281 features and 17 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 358241.9 ymin: 4649755 xmax: 480393.1 ymax: 4808545
## epsg (SRID):    32618
## proj4string:    +proj=utm +zone=18 +ellps=WGS84 +units=m +no_defs
tm_shape(NY8) + tm_fill("Z")
## Variable "Z" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Create a neighbour object:

NY_nb <- poly2nb(NY8)
NY_lw <- nb2listw(NY_nb, style="B")

Check how the SAR and CAR models behave, with and without case weights:

mod1 <- spautolm(Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data=NY8, family="CAR", listw=NY_lw, weights=POP8)
summary(mod1)
## 
## Call: 
## spautolm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = NY8, 
##     listw = NY_lw, weights = POP8, family = "CAR")
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.500566 -0.274566  0.087494  0.454262  4.257081 
## 
## Coefficients: 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -0.783775   0.142645 -5.4946 3.916e-08
## PEXPOSURE    0.078887   0.027922  2.8253  0.004724
## PCTAGE65P    3.841221   0.573083  6.7027 2.046e-11
## PCTOWNHOME  -0.392319   0.154995 -2.5312  0.011368
## 
## Lambda: 0.011751 LR test value: 0.11668 p-value: 0.73266 
## Numerical Hessian standard error of lambda: 0.039471 
## 
## Log likelihood: -251.7067 
## ML residual variance (sigma squared): 1105.1, (sigma: 33.243)
## Number of observations: 281 
## Number of parameters estimated: 6 
## AIC: 515.41

This data set is used Waller and Gotway (2004), and in both editions of ASDAR. It is harder to deploy Poisson models because the cases are not integer counts.

R’s sessionInfo()

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 29 (Workstation Edition)
## 
## Matrix products: default
## BLAS:   /home/rsb/topics/R/R360-share/lib64/R/lib/libRblas.so
## LAPACK: /home/rsb/topics/R/R360-share/lib64/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] mgcv_1.8-28         hglm_2.2-1          hglm.data_1.0-1    
##  [4] MASS_7.3-51.4       spatialreg_1.1-4    igraph_1.2.4.1     
##  [7] Matrix_1.2-17       tmap_2.2            spdep_1.1-2        
## [10] spData_0.3.0        sp_1.3-1            gstat_2.0-2        
## [13] sf_0.7-5            spatstat_1.60-1     rpart_4.1-15       
## [16] nlme_3.1-140        spatstat.data_1.4-0
## 
## loaded via a namespace (and not attached):
##  [1] satellite_1.0.1       xts_0.11-2            webshot_0.5.1        
##  [4] gmodels_2.18.1        RColorBrewer_1.1-2    mapview_2.7.0        
##  [7] tools_3.6.0           backports_1.1.4       metafor_2.1-0        
## [10] utf8_1.1.4            rgdal_1.4-5           R6_2.4.0             
## [13] KernSmooth_2.23-15    colorspace_1.4-1      rgeos_0.4-3          
## [16] DBI_1.0.0             raster_2.9-5          tidyselect_0.2.5     
## [19] leaflet_2.0.2         compiler_3.6.0        cli_1.1.0            
## [22] expm_0.999-4          scales_1.0.0          classInt_0.3-4       
## [25] goftest_1.1-1         stringr_1.4.0         digest_0.6.19        
## [28] spatstat.utils_1.13-0 rmarkdown_1.13        base64enc_0.1-3      
## [31] dichromat_2.0-0       pkgconfig_2.0.2       htmltools_0.3.6      
## [34] htmlwidgets_1.3       rlang_0.4.0           FNN_1.1.3            
## [37] shiny_1.3.2           BayesXsrc_3.0-1       generics_0.0.2       
## [40] zoo_1.8-6             crosstalk_1.0.0       gtools_3.8.1         
## [43] dplyr_0.8.2           magrittr_1.5          munsell_0.5.0        
## [46] Rcpp_1.0.1            fansi_0.4.0           abind_1.4-5          
## [49] stringi_1.4.3         yaml_2.2.0            tmaptools_2.0-1      
## [52] grid_3.6.0            gdata_2.18.0          promises_1.0.1       
## [55] crayon_1.3.4          deldir_0.1-21         lattice_0.20-38      
## [58] splines_3.6.0         tensor_1.5            zeallot_0.1.0        
## [61] knitr_1.23            pillar_1.4.2          boot_1.3-22          
## [64] spacetime_1.2-2       stats4_3.6.0          codetools_0.2-16     
## [67] LearnBayes_2.15.1     XML_3.98-1.20         glue_1.3.1           
## [70] evaluate_0.14         png_0.1-7             vctrs_0.1.0          
## [73] httpuv_1.5.1          RANN_2.6.1            purrr_0.3.2          
## [76] polyclip_1.10-0       tidyr_0.8.3           assertthat_0.2.1     
## [79] R2BayesX_1.1-1        xfun_0.8              mime_0.7             
## [82] lwgeom_0.1-7          xtable_1.8-4          broom_0.5.2          
## [85] e1071_1.7-2           coda_0.19-2           later_0.8.0          
## [88] viridisLite_0.3.0     class_7.3-15          tibble_2.1.3         
## [91] intervals_0.15.1      units_0.6-3           spDataLarge_0.3.1

References

Alam, Moudud, Lars Rönnegård, and Xia Shen. 2015. “Fitting Conditional and Simultaneous Autoregressive Spatial Models in Hglm.” The R Journal 7 (2): 5–18. https://doi.org/10.32614/RJ-2015-017.

Baddeley, Adrian, Ege Rubak, and Rolf Turner. 2015. Spatial Point Patterns: Methodology and Applications with R. Boca Raton, FL: Chapman and Hall/CRC.

Lovelace, Robin, Jakub Nowosad, and Jannes Muenchow. 2019. Geocomputation with R. Boca Raton, FL: Chapman and Hall/CRC.

Müller, Werner G. 2007. Collecting Spatial Data: Optimum Design of Experiments for Random Fields. Berlin: Springer-Verlag.

Pace, RK, and JP LeSage. 2008. “A Spatial Hausman Test.” Economics Letters 101: 282–84.

Ripley, B. D. 1981. Spatial Statistics. New York: Wiley.

Schratz, Patrick, Jannes Muenchow, Eugenia Iturritxa, Jakob Richter, and Alexander Brenning. 2019. “Hyperparameter Tuning and Performance Assessment of Statistical and Machine-Learning Algorithms Using Spatial Data.” Ecological Modelling 406: 109–20. https://doi.org/https://doi.org/10.1016/j.ecolmodel.2019.06.002.

Wall, M. M. 2004. “A Close Look at the Spatial Structure Implied by the CAR and SAR Models.” Journal of Statistical Planning and Inference 121: 311–24.

Waller, Lance A., and Carol A. Gotway. 2004. Applied Spatial Statistics for Public Health Data. Hoboken, NJ: John Wiley & Sons.