sessionInfo()
spatstat sf gstat spdep tmap spatialreg igraph hglm metafor sp spData RANN
and spDataLarge - install.packages("spDataLarge", repos = "https://nowosad.github.io/drat/", type = "source")
.
Script at https://raw.githubusercontent.com/edzer/UseR2019/master/part2.R. Dowload to suitable location and use as basis.
Spatial, time series, and spatio-temporal data break the fundamental rule of independence of observations (social networks do too)
Often, we are not sampling from a known population to get observations, so caution is needed for inference
Often again, the observation units or entities are not chosen by us, but by others
Designing spatial samples has become a less important part of the field than previously (Ripley 1981; Müller 2007)
If we want to detect and classify patterns (ML), infer about covariates, or interpolate from a fitted model, we need models that take account of the possible interdependency of spatial, time series and spatio-temporal observations
Here we are focussing on the spatial case; time differs in having a direction of flow, and spatio-temporal processes are necessarily more complex, especially when non-separable
We will not look at machine learning issues, although the partition into training and test sets raises important spatial questions for tuning (Schratz et al. 2019)
Spatial point processes are most closely tied to the relative positions of the observations, but may accommodate inhomegeneities
Geostatistics is concerned with interpolating values of a variable of interest at unobserved locations, and the relative positions of the observations contribute through the modelling a function of differences in observed values of that variable between observations at increasing distances from each other
Disease mapping, spatial econometrics/regression and the application of generalized linear mixed models (GLMM, GAMM) in for example ecology are more interested in inferences about the spatial processes in play and the included covariates; some approaches may use distance based spatial correlation functions, others use graph or lattice neighbourhoods
Sometimes when we use spatial data, we jump to spatial statistical methods because of the widely cited first law of geography, that nearby observations are more likely to be similar than those further away
But maybe what we see as clustering, patchiness, pattern, that looks spatial is actually mis-specification, such as a missing covariate, and/or inappropriate functional form, and/or including variables acting at different scales, and/or consequences of suboptimal bounding of tesselated observations …
Here, we’ll first look at the consequences of treating inhomogeneous points as homogeneous (the intensity trends upwards with x
)
Then we’ll use those points to see what happens in adding a similar trend to a random variable; empirical variograms are constructed ignoring and including the trend, and tests for global and local spatial autocorrelation are calculated
We can start by using spatstat (Baddeley, Rubak, and Turner 2015) to generate completely spatially random (CSR) points with intensity increasing with x
to introduce trend inhomogeneity in a unit square (note the use of set.seed()
:
suppressPackageStartupMessages(library(spatstat))
intenfun <- function(x, y) 200 * x
set.seed(1)
(ihpp <- rpoispp(intenfun, lmax = 200))
## Planar point pattern: 95 points
## window: rectangle = [0, 1] x [0, 1] units
plot(density(ihpp), axes=TRUE)
points(ihpp, col="green2", pch=19)
We can use \(\hat{K}\) tests ignoring and including inhomogeneity to adapt the test to the underlying data generation process. The homogeneous test examines the divergence from a theoretical CSR at distance bins, the inhomogeneous tries to guess the kind of patterning in the relative positions of the point observations. If we ignore the inhomogeneity, we find significant clustering at most distances, with the opposite finding when using the test attempting to accommodate inhomogeneity:
opar <- par(mfrow=c(1,2))
plot(envelope(ihpp, Kest, verbose=FALSE), main="Homogeneous")
plot(envelope(ihpp, Kinhom, verbose=FALSE), main="Inhomogeneous")
par(opar)
y
trend variableCoercing the spatstat "ppp"
object to an sf "sf"
object is trivial, but first adds the window of the point process, which we need to drop by retaining only those labelled "point"
. Then we take the x
coordinate and use it to create y
, which trends with x
; the DGP is not very noisy.
library(sf)
## Linking to GEOS 3.7.2, GDAL 3.0.0, PROJ 6.1.1
#st_as_sf(ihpp) %>% dplyr::filter(label == "point") -> sf_ihpp
st_as_sf(ihpp) ->.; .[.$label == "point",] -> sf_ihpp
crds <- st_coordinates(sf_ihpp)
sf_ihpp$x <- crds[,1]
sf_ihpp$y <- 100 + 50 * sf_ihpp$x + 20 * rnorm(nrow(sf_ihpp))
plot(sf_ihpp[,"y"], pch=19)
Variograms for models ignoring (y ~ 1
) and including (y ~ x
) the trend; if we ignore the trend, we find spurious relationships, shown by loess()
fits
suppressPackageStartupMessages(library(gstat))
vg0 <- variogram(y ~ 1, sf_ihpp)
vg1 <- variogram(y ~ x, sf_ihpp)
opar <- par(mfrow=c(1,2))
plot(gamma ~ dist, vg0, ylim=c(0,550), main="Trend ignored")
s <- seq(0, 0.5, 0.01)
p0 <- predict(loess(gamma ~ dist, vg0, weights=np), newdata=data.frame(dist=s), se=TRUE)
lines(s, p0$fit)
lines(s, p0$fit-2*p0$se.fit, lty=2)
lines(s, p0$fit+2*p0$se.fit, lty=2)
plot(gamma ~ dist, vg1, ylim=c(0,550), main="Trend included")
p1 <- predict(loess(gamma ~ dist, vg1, weights=np), newdata=data.frame(dist=s), se=TRUE)
lines(s, p1$fit)
lines(s, p1$fit-2*p1$se.fit, lty=2)
lines(s, p1$fit+2*p1$se.fit, lty=2)
par(opar)
Going on from a continuous to a discrete treatment of space, we can use functions in spdep to define neighbours and then test for global and local spatial autocorrelation. Making first a triangulated neighbour objects, we can thin out improbable neighbours lying too far from one another using a sphere of influence graph to make a symmetric neighbour object:
suppressPackageStartupMessages(library(spdep))
nb_tri <- tri2nb(crds)
(nb_soi <- graph2nb(soi.graph(nb_tri, crds), sym=TRUE))
## Neighbour list object:
## Number of regions: 95
## Number of nonzero links: 302
## Percentage nonzero weights: 3.34626
## Average number of links: 3.178947
The sphere of influence graph generates three subgraphs; a bit unfortunate, but we can live with that (for now):
plot(nb_soi, crds)
comps <- n.comp.nb(nb_soi)
sf_ihpp$comps <- comps$comp.id
comps$nc
## [1] 3
Using binary weights, looking at the variable but ignoring the trend, we find strong global spatial autocorrelation using global Moran’s \(I\) (tests return "htest"
objects tidied here using broom)
lwB <- nb2listw(nb_soi, style="B")
out <- broom::tidy(moran.test(sf_ihpp$y, listw=lwB, randomisation=FALSE, alternative="two.sided"))[1:5]
names(out)[1:3] <- c("Moran's I", "Expectation", "Variance"); out
## # A tibble: 1 x 5
## `Moran's I` Expectation Variance statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.342 -0.0106 0.00633 4.44 0.00000917
If, however, we take the residuals after including the trend using a linear model, the apparent spatial autocorrelation evaporates
lm_obj <- lm(y ~ x, data=sf_ihpp)
out <- broom::tidy(lm.morantest(lm_obj, listw=lwB, alternative="two.sided"))[1:5]
names(out)[1:3] <- c("Moran's I", "Expectation", "Variance"); out
## # A tibble: 1 x 5
## `Moran's I` Expectation Variance statistic[,1] p.value[,1]
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0579 -0.0199 0.00623 0.986 0.324
The same happens if we calculate local Moran’s \(I\) ignoring the trend (randomisation assumption), and including the trend (normality assumption and saddlepoint approximation)
lmor0 <- localmoran(sf_ihpp$y, listw=lwB, alternative="two.sided")
lmor1 <- as.data.frame(localmoran.sad(lm_obj, nb=nb_soi, style="B", alternative="two.sided"))
sf_ihpp$z_value <- lmor0[,4]
sf_ihpp$z_lmor1_N <- lmor1[,2]
sf_ihpp$z_lmor1_SAD <- lmor1[,4]
suppressPackageStartupMessages(library(tmap))
tm_shape(sf_ihpp) + tm_symbols(col=c("z_value", "z_lmor1_N", "z_lmor1_SAD"), midpoint=0) + tm_facets(free.scales=FALSE, nrow=1) + tm_layout(panel.labels=c("No trend", "Trend, normal", "Trend, SAD"))
Try doing this in pairs or small groups and discuss what is going on; the underlying ideas are more important than the code.
Look at the North Carolina SIDS data vignette for background:
library(sf)
nc <- st_read(system.file("shapes/sids.shp", package="spData")[1], quiet=TRUE)
st_crs(nc) <- "+proj=longlat +datum=NAD27"
row.names(nc) <- as.character(nc$FIPSNO)
head(nc)
## Simple feature collection with 6 features and 22 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
## epsg (SRID): 4267
## proj4string: +proj=longlat +datum=NAD27 +no_defs
## CNTY_ID AREA PERIMETER CNTY_ NAME FIPS FIPSNO CRESS_ID
## 37009 1825 0.114 1.442 1825 Ashe 37009 37009 5
## 37005 1827 0.061 1.231 1827 Alleghany 37005 37005 3
## 37171 1828 0.143 1.630 1828 Surry 37171 37171 86
## 37053 1831 0.070 2.968 1831 Currituck 37053 37053 27
## 37131 1832 0.153 2.206 1832 Northampton 37131 37131 66
## 37091 1833 0.097 1.670 1833 Hertford 37091 37091 46
## BIR74 SID74 NWBIR74 BIR79 SID79 NWBIR79 east north x y
## 37009 1091 1 10 1364 0 19 164 176 -81.67 4052.29
## 37005 487 0 10 542 3 12 183 182 -50.06 4059.70
## 37171 3188 5 208 3616 6 260 204 174 -16.14 4043.76
## 37053 508 1 123 830 2 145 461 182 406.01 4035.10
## 37131 1421 9 1066 1606 3 1197 385 176 281.10 4029.75
## 37091 1452 7 954 1838 5 1237 411 176 323.77 4028.10
## lon lat L_id M_id geometry
## 37009 -81.48594 36.43940 1 2 MULTIPOLYGON (((-81.47276 3...
## 37005 -81.14061 36.52443 1 2 MULTIPOLYGON (((-81.23989 3...
## 37171 -80.75312 36.40033 1 2 MULTIPOLYGON (((-80.45634 3...
## 37053 -76.04892 36.45655 1 4 MULTIPOLYGON (((-76.00897 3...
## 37131 -77.44057 36.38799 1 4 MULTIPOLYGON (((-77.21767 3...
## 37091 -76.96474 36.38189 1 4 MULTIPOLYGON (((-76.74506 3...
The variables are largely count data, L_id
and M_id
are grouping variables. We can also read the original neighbour object:
library(spdep)
gal_file <- system.file("weights/ncCR85.gal", package="spData")[1]
ncCR85 <- read.gal(gal_file, region.id=nc$FIPSNO)
ncCR85
## Neighbour list object:
## Number of regions: 100
## Number of nonzero links: 492
## Percentage nonzero weights: 4.92
## Average number of links: 4.92
plot(st_geometry(nc), border="grey")
plot(ncCR85, st_centroid(st_geometry(nc), of_largest_polygon), add=TRUE, col="blue")
## Warning in st_centroid.sfc(st_geometry(nc), of_largest_polygon):
## st_centroid does not give correct centroids for longitude/latitude data
Now generate a random variable. Here I’ve set the seed - maybe choose your own, and compare outcomes with the people around you. With many people in the room, about 5 in 100 may get a draw that is autocorrelated when tested with Moran’s \(I\) (why?).
set.seed(1)
nc$rand <- rnorm(nrow(nc))
lw <- nb2listw(ncCR85, style="B")
moran.test(nc$rand, listw=lw, alternative="two.sided")
##
## Moran I test under randomisation
##
## data: nc$rand
## weights: lw
##
## Moran I statistic standard deviate = -0.82038, p-value = 0.412
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic Expectation Variance
## -0.060754158 -0.010101010 0.003812269
Now we’ll create a trend (maybe try plotting LM
to see the trend pattern). Do we get different test outcomes by varying beta and sigma (alpha is constant).
nc$LM <- as.numeric(interaction(nc$L_id, nc$M_id))
alpha <- 1
beta <- 0.5
sigma <- 2
nc$trend <- alpha + beta*nc$LM + sigma*nc$rand
moran.test(nc$trend, listw=lw, alternative="two.sided")
##
## Moran I test under randomisation
##
## data: nc$trend
## weights: lw
##
## Moran I statistic standard deviate = 6.1163, p-value = 9.58e-10
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic Expectation Variance
## 0.368410659 -0.010101010 0.003829901
To get back to reality, include the trend in a linear model, and test again.
lm.morantest(lm(trend ~ LM, nc), listw=lw, alternative="two.sided")
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = trend ~ LM, data = nc)
## weights: lw
##
## Moran I statistic standard deviate = -0.50788, p-value = 0.6115
## alternative hypothesis: two.sided
## sample estimates:
## Observed Moran I Expectation Variance
## -0.049567147 -0.018718176 0.003689356
So we can manipulate a missing variable mis-specification to look like spatial autocorrelation. Is this informative?
Sometimes we only have data on a covariate for aggregates of our units of observation. What happens when we “copy out” these aggregate values to the less aggregated observations? First we’ll aggregate nc
by LM
, then make a neighbour object for the aggregate units
aggLM <- aggregate(nc[,"LM"], list(nc$LM), head, n=1)
(aggnb <- poly2nb(aggLM))
## Neighbour list object:
## Number of regions: 12
## Number of nonzero links: 46
## Percentage nonzero weights: 31.94444
## Average number of links: 3.833333
Next, draw a random sample for the aggregated units:
set.seed(1)
LMrand <- rnorm(nrow(aggLM))
Check that it does not show any spatial autocorrelation:
moran.test(LMrand, nb2listw(aggnb, style="B"))
##
## Moran I test under randomisation
##
## data: LMrand
## weights: nb2listw(aggnb, style = "B")
##
## Moran I statistic standard deviate = -0.24911, p-value = 0.5984
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## -0.13101880 -0.09090909 0.02592483
Copy it out to the full data set, indexing on values of LM; the pattern now looks pretty autocorrelated
nc$LMrand <- LMrand[match(nc$LM, aggLM$LM)]
plot(nc[,"LMrand"])
which it is:
moran.test(nc$LMrand, listw=lw, alternative="two.sided")
##
## Moran I test under randomisation
##
## data: nc$LMrand
## weights: lw
##
## Moran I statistic standard deviate = 9.4301, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic Expectation Variance
## 0.575091242 -0.010101010 0.003850884
Again, we’ve manipulated ourselves into a situation with abundant spatial autocorrelation at the level of the counties, but only by copying out from a more aggregated level. What is going on?
Some spatial processes are best represented by decreasing functions of distance
Other spatial processes best represent proximity, contiguity, neighbourhood; the functions then relate to steps along edges on graphs of neighbours
Under certain conditions, the power series \(\rho^i {\mathbf W}^i\) declines in intensity as \(i\) increases from \(0\)
Here we will use areal support, corresponding to a proximity view of spatial processes (Wall 2004)
We’ll use a typical moderate sized data set of Polish municipalities (including boroughs in the capital, Warsaw), included in spDataLarge as used in Geocomputation with R (Lovelace, Nowosad, and Muenchow 2019)
library(sf)
# if(!require("spData", quietly=TRUE)) install.packages("spData")
# if(!require("spDataLarge", quietly=TRUE)) install.packages("spDataLarge",
# repos = "https://nowosad.github.io/drat/", type = "source")
data(pol_pres15, package="spDataLarge")
head(pol_pres15[, c(1, 4, 6)])
## Simple feature collection with 6 features and 3 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 235157.1 ymin: 366913.3 xmax: 281431.7 ymax: 413016.6
## epsg (SRID): NA
## proj4string: +proj=tmerc +lat_0=0 +lon_0=18.99999999999998 +k=0.9993 +x_0=500000 +y_0=-5300000 +ellps=GRS80 +towgs84=0,0,0 +units=m +no_defs
## TERYT name types geometry
## 1 020101 BOLESŁAWIEC Urban MULTIPOLYGON (((261089.5 38...
## 2 020102 BOLESŁAWIEC Rural MULTIPOLYGON (((254150 3837...
## 3 020103 GROMADKA Rural MULTIPOLYGON (((275346 3846...
## 4 020104 NOWOGRODZIEC Urban/rural MULTIPOLYGON (((251769.8 37...
## 5 020105 OSIECZNICA Rural MULTIPOLYGON (((263423.9 40...
## 6 020106 WARTA BOLESŁAWIECKA Rural MULTIPOLYGON (((267030.7 38...
The dataset has 2495 observations with areal support on 65 variables, most counts of election results aggregated from the election returns by polling station. See:
?spDataLarge::pol_pres15
for details.
The spdep function poly2nb()
finds proximate, contiguous neighbours by finding one (queen) or two (rook) boundary points within a snap distance of each other for polygons in a candidate list based on intersecting bounding boxes:
suppressPackageStartupMessages(library(spdep))
system.time(nb_q <- poly2nb(pol_pres15, queen=TRUE))
## user system elapsed
## 1.520 0.016 1.542
The object returned is very sparse; the print method reports asymmetric objects and objects with observations with no neighbours.
nb_q
## Neighbour list object:
## Number of regions: 2495
## Number of nonzero links: 14242
## Percentage nonzero weights: 0.2287862
## Average number of links: 5.708216
opar <- par(mar=c(0,0,0,0)+0.5)
plot(st_geometry(pol_pres15), border="grey", lwd=0.5)
coords <- st_centroid(st_geometry(pol_pres15),
of_largest_polygon=TRUE)
plot(nb_q, coords=st_coordinates(coords), add=TRUE, points=FALSE, lwd=0.5)
par(opar)
We can use GEOS through sf to improve the speed of detection of candidate neighbours, by building a geometry column of bounding boxes of the underlying polygons and finding which intersect. We pass this object to poly2nb()
through the foundInBox=
argument. This may be useful for larger data sets.
system.time({
fB1 <- st_intersects(st_as_sfc(lapply(st_geometry(pol_pres15), function(x) {
st_as_sfc(st_bbox(x))[[1]]
})))
fB1a <- lapply(seq_along(fB1), function(i) fB1[[i]][fB1[[i]] > i])
fB1a <- fB1a[-length(fB1a)]
nb_sf_q1 <- poly2nb(pol_pres15, queen=TRUE, foundInBox=fB1a)
})
## user system elapsed
## 0.962 0.022 0.988
The two neighbour objects are the same:
all.equal(nb_q, nb_sf_q1, check.attributes=FALSE)
## [1] TRUE
We can further check the object for the number of distinct graph components; there is only one, so each observation can be reached from every other observation by traversing the graph edges:
n.comp.nb(nb_q)$nc
## [1] 1
The "listw"
spatial weights object needed for testing or modelling can be created using nb2listw()
, here with binary weights, \(1\) for first-order contiguity, \(0\) for not a neighbour:
lw <- nb2listw(nb_q, style="B")
We can coerce this further to a sparse matrix as defined in the Matrix package:
library(Matrix)
W <- as(lw, "CsparseMatrix")
and check symmetry directly, (t()
is the transpose method):
isTRUE(all.equal(W, t(W)))
## [1] TRUE
Powering up the sparse matrix fills in the higher order neighbours:
image(W)
WW <- W %*% W
image(WW)
W3 <- WW %*% W
image(W3)
W4 <- W3 %*% W
image(W4)
There is an spdep vignette considering the use of the igraph adjacency representation through sparse matrices as intermediate objects:
library(igraph)
g1 <- graph.adjacency(W, mode="undirected")
class(g1)
## [1] "igraph"
We can get an "nb"
neighbour object back from the adjacency representation:
B1 <- get.adjacency(g1)
mat2listw(B1)$neighbours
## Neighbour list object:
## Number of regions: 2495
## Number of nonzero links: 14242
## Percentage nonzero weights: 0.2287862
## Average number of links: 5.708216
There is a single graph component:
c1 <- clusters(g1)
c1$no
## [1] 1
The graph is connected:
is.connected(g1)
## [1] TRUE
with diameter:
dg1 <- diameter(g1)
dg1
## [1] 52
The diameter is the longest path across the graph, where shortest.paths()
returns a matrix of all the edge counts on shortest paths between observations:
sp_mat <- shortest.paths(g1)
max(sp_mat)
## [1] 52
str(sp_mat)
## num [1:2495, 1:2495] 0 1 2 2 2 2 10 10 9 10 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:2495] "1" "2" "3" "4" ...
## ..$ : chr [1:2495] "1" "2" "3" "4" ...
So graphs really do contain a lot of structural information, rendering most IDW cases superfluous.
The hglm hierarchical generalized linear model package uses the "CsparseMatrix"
version of spatial weights for fitting "SAR"
or "CAR"
spatially structured random effects
The CARBayes package appears to take dense matrices using listw2mat()
or nb2mat()
, preferring binary matrices
The "mrf"
random effect in the mgcv package for use with gam()
takes an "nb"
object directly
The R2BayesX::nb2gra()
function converts an "nb"
neighbour object into a graph for use with BayesX through R2BayesX, but creates a dense matrix; the graph is passed as the map=
argument to the "mrf"
structured random effect
gra <- R2BayesX::nb2gra(ncCR85)
str(gra)
## 'gra' num [1:100, 1:100] 3 -1 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:100] "37009" "37005" "37171" "37053" ...
## ..$ : chr [1:100] "37009" "37005" "37171" "37053" ...
The nb2INLA()
function in spdep writes a file that INLA can use through the INLA package, used with the "besag"
model, for example:
tf <- tempfile()
nb2INLA(tf, ncCR85)
file.size(tf)
## [1] 1945
What does it mean that this neighbour object not symmetric?
coords <- st_centroid(st_geometry(nc), of_largest_polygon)
(knn_5_nb <- knn2nb(knearneigh(st_coordinates(coords), k=5)))
## Neighbour list object:
## Number of regions: 100
## Number of nonzero links: 500
## Percentage nonzero weights: 5
## Average number of links: 5
## Non-symmetric neighbours list
klw <- nb2listw(knn_5_nb, style="B")
kW <- as(klw, "CsparseMatrix")
isTRUE(all.equal(kW, t(kW)))
## [1] FALSE
image(kW)
We need to recall that non-symmetric neighbours give directed graphs; if we forget, and treat it as undirected, the extra edges get added (try it):
library(igraph)
g1 <- graph.adjacency(kW, mode="directed")
B1 <- get.adjacency(g1)
mat2listw(B1)$neighbours
## Neighbour list object:
## Number of regions: 100
## Number of nonzero links: 500
## Percentage nonzero weights: 5
## Average number of links: 5
## Non-symmetric neighbours list
Use igraph functions to explore the ncCR85
"nb"
object:
diameter(g1)
## [1] 20
nc_sps <- shortest.paths(g1)
mr <- which.max(apply(nc_sps, 2, max))
nc$sps1 <- nc_sps[,mr]
plot(nc[,"sps1"], breaks=0:21)
plot(nc$sps1, c(st_distance(coords[mr], coords))/1000, xlab="shortest path count", ylab="km distance")
Discuss whether measured distance is really needed to express proximity; the graph shortest paths look fine.
Make objects from the imported neighbours for BayesX and INLA, and as a sparse and dense matrix:
ncCR85a <- ncCR85
attr(ncCR85a, "region.id") <- as.character(nc$CRESS_ID)
nc_gra <- R2BayesX::nb2gra(ncCR85a)
nc_tf <- tempfile()
nb2INLA(nc_tf, ncCR85)
nc_lw <- nb2listw(ncCR85, style="B")
nc_W <- as(nc_lw, "CsparseMatrix")
nc_mat <- listw2mat(nc_lw)
Let’s start by seeing whether there is any spatial autocorrelation in the SIDS rate (modified Freeman-Tukey square root transformation):
nc$ft.SID74 <- sqrt(1000)*(sqrt(nc$SID74/nc$BIR74) + sqrt((nc$SID74+1)/nc$BIR74))
nc$ft.NWBIR74 <- sqrt(1000)*(sqrt(nc$NWBIR74/nc$BIR74) + sqrt((nc$NWBIR74+1)/nc$BIR74))
tm_shape(nc) + tm_fill(c("ft.SID74", "ft.NWBIR74"))
plot(ft.SID74 ~ ft.NWBIR74, nc)
First Moran’s \(I\) with the intercept as
moran.test(nc$ft.SID74, nc_lw, alternative="two.sided", randomisation=FALSE)
##
## Moran I test under normality
##
## data: nc$ft.SID74
## weights: nc_lw
##
## Moran I statistic standard deviate = 3.5842, p-value = 0.0003381
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic Expectation Variance
## 0.211279611 -0.010101010 0.003814926
Now for a mean model accommodating case weights:
lm.morantest(lm(ft.SID74 ~ 1, weights=BIR74, data=nc), nc_lw, alternative="two.sided")
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = ft.SID74 ~ 1, data = nc, weights = BIR74)
## weights: nc_lw
##
## Moran I statistic standard deviate = 4.5965, p-value = 4.297e-06
## alternative hypothesis: two.sided
## sample estimates:
## Observed Moran I Expectation Variance
## 0.273332891 -0.009631207 0.003789801
Next just with the original covariate (transformed non-white births as a proportion of all births):
lm.morantest(lm(ft.SID74 ~ ft.NWBIR74, data=nc), nc_lw, alternative="two.sided")
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = ft.SID74 ~ ft.NWBIR74, data = nc)
## weights: nc_lw
##
## Moran I statistic standard deviate = 1.2148, p-value = 0.2245
## alternative hypothesis: two.sided
## sample estimates:
## Observed Moran I Expectation Variance
## 0.057042251 -0.016844710 0.003699627
Finally with the covariate and case weights:
lm.morantest(lm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc), nc_lw, alternative="two.sided")
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, weights =
## BIR74)
## weights: nc_lw
##
## Moran I statistic standard deviate = 1.7839, p-value = 0.07443
## alternative hypothesis: two.sided
## sample estimates:
## Observed Moran I Expectation Variance
## 0.091350263 -0.016506328 0.003655419
So something spatial appears to be present, but how to model it?
The approach taken by social scientists including economists, and some others has been to approach this through simultaneous autoregressive approaches, where the response is modelled using fixed covariates, and the residual process is modelled by optimising a log likelihood function. The spatialreg package provides spautolm()
and errorsarlm()
:
library(spatialreg)
m1 <- spautolm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc, listw=nc_lw, family="SAR")
When we include the covariate, the spatial error coefficient contributes little:
summary(m1)
##
## Call: spautolm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, listw = nc_lw,
## weights = BIR74, family = "SAR")
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.83149 -0.44934 0.09456 0.56278 2.90208
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.5816687 0.2507427 6.3079 2.828e-10
## ft.NWBIR74 0.0366864 0.0069608 5.2704 1.361e-07
##
## Lambda: 0.035437 LR test value: 1.787 p-value: 0.18129
## Numerical Hessian standard error of lambda: 0.026096
##
## Log likelihood: -119.6704
## ML residual variance (sigma squared): 1304.4, (sigma: 36.117)
## Number of observations: 100
## Number of parameters estimated: 4
## AIC: 247.34
It is unusual to present maps of the spatially structured random effect in cases where the simultaneous autoregressive approach is used, but it is fully possible using components of the returned model object (the vector is doubled for comparison with the CAR version):
nc$SAR_ssre <- 2*as.vector((m1$lambda * nc_W) %*% m1$Y - (m1$lambda * nc_W) %*% (m1$X %*% m1$fit$coefficients))
tm_shape(nc) + tm_fill(c("ft.SID74", "SAR_ssre"), midpoint=c(NA, 0))
The other maximum likelihood implementation gives the same results, but provides a Hausman test for shifts in the covariate coefficients between the aspatial and spatial estimates (Pace and LeSage 2008); there is none:
m1a <- errorsarlm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc, listw=nc_lw)
summary(m1a, Hausman=TRUE)
##
## Call:
## errorsarlm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, listw = nc_lw,
## weights = BIR74)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.83149 -0.44934 0.09456 0.56278 2.90208
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.5816687 0.2507427 6.3079 2.828e-10
## ft.NWBIR74 0.0366864 0.0069608 5.2704 1.361e-07
##
## Lambda: 0.035437, LR test value: 1.787, p-value: 0.18129
## Asymptotic standard error: 0.029167
## z-value: 1.2149, p-value: 0.22439
## Wald statistic: 1.4761, p-value: 0.22439
##
## Log likelihood: -119.6704 for error model
## ML residual variance (sigma squared): 1304.4, (sigma: 36.117)
## Number of observations: 100
## Number of parameters estimated: 4
## AIC: 247.34, (AIC for lm: 247.13)
## Hausman test: 1.565, df: 2, p-value: 0.45726
It also lets us add Durbin=
terms, that is spatially lagged covariates:
m1b <- errorsarlm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc, listw=nc_lw, Durbin=TRUE)
summary(m1b, Hausman=TRUE)
##
## Call:
## errorsarlm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, listw = nc_lw,
## weights = BIR74, Durbin = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8705347 -0.4488357 0.0094254 0.4864939 2.7406880
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.28212888 0.33548214 6.8025 1.028e-11
## ft.NWBIR74 0.03532229 0.00811453 4.3530 1.343e-05
## lag.(Intercept) -0.15336745 0.06073121 -2.5253 0.01156
## lag.ft.NWBIR74 0.00098902 0.00159099 0.6216 0.53418
##
## Lambda: 0.028053, LR test value: 1.036, p-value: 0.30874
## Asymptotic standard error: 0.029824
## z-value: 0.94064, p-value: 0.34689
## Wald statistic: 0.8848, p-value: 0.34689
##
## Log likelihood: -115.1974 for error model
## ML residual variance (sigma squared): 1195.8, (sigma: 34.58)
## Number of observations: 100
## Number of parameters estimated: 6
## AIC: 242.39, (AIC for lm: 241.43)
## Hausman test: 1.8315, df: 4, p-value: 0.76671
and to present the by-covariate effects taking into account the unlagged (direct) and lagged (indirect) covariates and the sum of the coefficients (total):
summary(impacts(m1b))
## Impact measures (SDEM, estimable, n):
## Direct Indirect Total
## ft.NWBIR74 0.03532229 -0.1533675 -0.1180452
## ========================================================
## Standard errors:
## Direct Indirect Total
## ft.NWBIR74 0.008114531 0.06073121 0.06438103
## ========================================================
## Z-values:
## Direct Indirect Total
## ft.NWBIR74 4.352968 -2.525348 -1.833539
##
## p-values:
## Direct Indirect Total
## ft.NWBIR74 1.3431e-05 0.011558 0.066722
However, our model may suffer from not using a mixed model approach to a count response; the simultaneous autoregressive models are mostly used with Gaussian responses. One possibility is to employ the hierarchical generalized linear model approach from the hglm package. First we’ll fit an unstructured IID (independent and identically distributed) random effect (Alam, Rönnegård, and Shen 2015):
library(hglm)
E <- nc$BIR74 * sum(nc$SID74)/sum(nc$BIR74)
HGLM_iid <- hglm(fixed=SID74 ~ ft.NWBIR74, random= ~ 1|CRESS_ID, offset=log(E), weights=BIR74,
data=nc, family=poisson(link=log))
The random effects also have their own standard errors, so we can order and display them with error bars in a forest or caterpillar plot:
ranef_iid <- unname(summary(HGLM_iid, print.ranef=TRUE)$RandCoefMat)
metafor::forest(ranef_iid[,1], ranef_iid[,2], subset=order(ranef_iid[,1], decreasing=TRUE),
slab=NA, annotate=FALSE, lty=c("solid","blank"), pch=19, psize=2, cex.lab=1, cex.axis=1)
We can also fit a weighted Poisson simultaneous autoregressive model, and examine the random effects:
HGLM_sar <- hglm(fixed=SID74 ~ ft.NWBIR74, random= ~ 1|CRESS_ID, offset=log(E), weights=BIR74,
data=nc, family=poisson(link=log), rand.family=SAR(D=nc_W))
ranef_sar <- unname(summary(HGLM_sar, print.ranef=TRUE)$RandCoefMat)
metafor::forest(ranef_sar[,1], ranef_sar[,2], subset=order(ranef_sar[,1], decreasing=TRUE),
slab=NA, annotate=FALSE, lty=c("solid","blank"), pch=19, psize=2, cex.lab=1, cex.axis=1)
There is not much difference between the IID and the SAR spatially structured random effects:
nc$HGLM_re <- ranef_iid[,1]
nc$HGLM_ss_SAR <- ranef_sar[,1]
tm_shape(nc) + tm_fill(c("HGLM_re", "HGLM_ss_SAR"), midpoint=c(0), title="Poisson HGLM RE") +
tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("IID", "SAR SSRE"))
Most epidemiological applications use conditional autoregressive approaches, where some (like spautolm()
and the hglm()
implementations) fit a spatial coefficient, but many fit an intrinsic CAR. First the spautolm()
and the hglm()
implementations:
m1c <- spautolm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc, listw=nc_lw, family="CAR")
summary(m1c)
##
## Call: spautolm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, listw = nc_lw,
## weights = BIR74, family = "CAR")
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.890805 -0.483471 0.070376 0.575906 2.824220
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.4561763 0.2499558 5.8257 5.686e-09
## ft.NWBIR74 0.0410281 0.0069417 5.9103 3.414e-09
##
## Lambda: 0.06028 LR test value: 1.6364 p-value: 0.20082
## Numerical Hessian standard error of lambda: 0.053566
##
## Log likelihood: -119.7458
## ML residual variance (sigma squared): 1302, (sigma: 36.083)
## Number of observations: 100
## Number of parameters estimated: 4
## AIC: 247.49
Again, we can calculate something that represents the spatial patterning of the spatial process (termed “signal” in the documentation of spatialreg::predict.sarlm
), but it is not scaled in the same way as the hgml()
random effects (and importantly we do not have standard errors):
nc$CAR_ssre <- as.vector((m1c$lambda * nc_W) %*% m1c$Y -
(m1c$lambda * nc_W) %*% (m1c$X %*% m1c$fit$coefficients))
tm_shape(nc) + tm_fill(c("SAR_ssre", "CAR_ssre"), midpoint=c(0), title="Gauss ML RE") +
tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("SAR SSRE", "CAR SSRE"))
Fitting the HGLM CAR model is just like the SAR model, and the forest plot of the spatially structured random effect is similar. Recall that spautolm()
is fitting a Gaussian model, but hglm()
is fitting a Poisson model, arguably better suited to count data. This means that the scalings of the random effects will vary in scale:
HGLM_car <- hglm(fixed=SID74 ~ ft.NWBIR74, random= ~ 1|CRESS_ID, offset=log(E), weights=BIR74,
data=nc, family=poisson(link=log), rand.family=CAR(D=nc_W))
ranef_car <- unname(summary(HGLM_car, print.ranef=TRUE)$RandCoefMat)
metafor::forest(ranef_car[,1], ranef_car[,2], subset=order(ranef_car[,1], decreasing=TRUE),
slab=NA, annotate=FALSE, lty=c("solid","blank"), pch=19, psize=2, cex.lab=1, cex.axis=1)
nc$HGLM_ss_CAR <- ranef_car[,1]
tm_shape(nc) + tm_fill(c("HGLM_ss_CAR", "HGLM_ss_SAR"), midpoint=c(0), title="Poisson HGLM RE") +
tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("CAR SSRE", "SAR SSRE"))
To use a generalized additive mixed model (mgcv::gam()
with an "mrf"
random effect), and some other mixed models, the areal entities need to be grouped (done in the first exercise), and we can try a flexible fit on the covariate:
library(mgcv)
names(aggnb) <- as.character(aggLM$Group.1)
nc$LM <- as.factor(nc$LM)
GAM_mrf <- gam(SID74 ~ s(ft.NWBIR74) + s(LM, bs="mrf", xt=list(nb=aggnb)), offset=log(E), weights=BIR74, data=nc, family=poisson(link=log))
summary(GAM_mrf)
And plot the covariate smooth term:
plot(GAM_mrf)
The forest plot is obviously grouped too:
GAM_mrf_re <- predict(GAM_mrf, type="terms", se=TRUE)
metafor::forest(GAM_mrf_re$fit[,2], GAM_mrf_re$se.fit[,2], subset=order(GAM_mrf_re$fit[,1], decreasing=TRUE),
slab=NA, annotate=FALSE, lty=c("solid","blank"), pch=19, psize=2, cex.lab=1, cex.axis=1)
as is the RE map:
nc$GAM_mrf_re <- GAM_mrf_re$fit[,2]
tm_shape(nc) + tm_fill(c("GAM_mrf_re"), midpoint=c(0), title="Poisson GAM MRF RE")
The New York 8 county data set contains population standardized leukemia cases, with Z
as a transformed rate:
NY8 <- st_read(system.file("shapes/NY8_utm18.shp", package="spData"))
## Reading layer `NY8_utm18' from data source `/home/rsb/lib/r_libs/spData/shapes/NY8_utm18.shp' using driver `ESRI Shapefile'
## Simple feature collection with 281 features and 17 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 358241.9 ymin: 4649755 xmax: 480393.1 ymax: 4808545
## epsg (SRID): 32618
## proj4string: +proj=utm +zone=18 +ellps=WGS84 +units=m +no_defs
tm_shape(NY8) + tm_fill("Z")
## Variable "Z" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
Create a neighbour object:
NY_nb <- poly2nb(NY8)
NY_lw <- nb2listw(NY_nb, style="B")
Check how the SAR and CAR models behave, with and without case weights:
mod1 <- spautolm(Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data=NY8, family="CAR", listw=NY_lw, weights=POP8)
summary(mod1)
##
## Call:
## spautolm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = NY8,
## listw = NY_lw, weights = POP8, family = "CAR")
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.500566 -0.274566 0.087494 0.454262 4.257081
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.783775 0.142645 -5.4946 3.916e-08
## PEXPOSURE 0.078887 0.027922 2.8253 0.004724
## PCTAGE65P 3.841221 0.573083 6.7027 2.046e-11
## PCTOWNHOME -0.392319 0.154995 -2.5312 0.011368
##
## Lambda: 0.011751 LR test value: 0.11668 p-value: 0.73266
## Numerical Hessian standard error of lambda: 0.039471
##
## Log likelihood: -251.7067
## ML residual variance (sigma squared): 1105.1, (sigma: 33.243)
## Number of observations: 281
## Number of parameters estimated: 6
## AIC: 515.41
This data set is used Waller and Gotway (2004), and in both editions of ASDAR. It is harder to deploy Poisson models because the cases are not integer counts.
sessionInfo()
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 29 (Workstation Edition)
##
## Matrix products: default
## BLAS: /home/rsb/topics/R/R360-share/lib64/R/lib/libRblas.so
## LAPACK: /home/rsb/topics/R/R360-share/lib64/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mgcv_1.8-28 hglm_2.2-1 hglm.data_1.0-1
## [4] MASS_7.3-51.4 spatialreg_1.1-4 igraph_1.2.4.1
## [7] Matrix_1.2-17 tmap_2.2 spdep_1.1-2
## [10] spData_0.3.0 sp_1.3-1 gstat_2.0-2
## [13] sf_0.7-5 spatstat_1.60-1 rpart_4.1-15
## [16] nlme_3.1-140 spatstat.data_1.4-0
##
## loaded via a namespace (and not attached):
## [1] satellite_1.0.1 xts_0.11-2 webshot_0.5.1
## [4] gmodels_2.18.1 RColorBrewer_1.1-2 mapview_2.7.0
## [7] tools_3.6.0 backports_1.1.4 metafor_2.1-0
## [10] utf8_1.1.4 rgdal_1.4-5 R6_2.4.0
## [13] KernSmooth_2.23-15 colorspace_1.4-1 rgeos_0.4-3
## [16] DBI_1.0.0 raster_2.9-5 tidyselect_0.2.5
## [19] leaflet_2.0.2 compiler_3.6.0 cli_1.1.0
## [22] expm_0.999-4 scales_1.0.0 classInt_0.3-4
## [25] goftest_1.1-1 stringr_1.4.0 digest_0.6.19
## [28] spatstat.utils_1.13-0 rmarkdown_1.13 base64enc_0.1-3
## [31] dichromat_2.0-0 pkgconfig_2.0.2 htmltools_0.3.6
## [34] htmlwidgets_1.3 rlang_0.4.0 FNN_1.1.3
## [37] shiny_1.3.2 BayesXsrc_3.0-1 generics_0.0.2
## [40] zoo_1.8-6 crosstalk_1.0.0 gtools_3.8.1
## [43] dplyr_0.8.2 magrittr_1.5 munsell_0.5.0
## [46] Rcpp_1.0.1 fansi_0.4.0 abind_1.4-5
## [49] stringi_1.4.3 yaml_2.2.0 tmaptools_2.0-1
## [52] grid_3.6.0 gdata_2.18.0 promises_1.0.1
## [55] crayon_1.3.4 deldir_0.1-21 lattice_0.20-38
## [58] splines_3.6.0 tensor_1.5 zeallot_0.1.0
## [61] knitr_1.23 pillar_1.4.2 boot_1.3-22
## [64] spacetime_1.2-2 stats4_3.6.0 codetools_0.2-16
## [67] LearnBayes_2.15.1 XML_3.98-1.20 glue_1.3.1
## [70] evaluate_0.14 png_0.1-7 vctrs_0.1.0
## [73] httpuv_1.5.1 RANN_2.6.1 purrr_0.3.2
## [76] polyclip_1.10-0 tidyr_0.8.3 assertthat_0.2.1
## [79] R2BayesX_1.1-1 xfun_0.8 mime_0.7
## [82] lwgeom_0.1-7 xtable_1.8-4 broom_0.5.2
## [85] e1071_1.7-2 coda_0.19-2 later_0.8.0
## [88] viridisLite_0.3.0 class_7.3-15 tibble_2.1.3
## [91] intervals_0.15.1 units_0.6-3 spDataLarge_0.3.1
Alam, Moudud, Lars Rönnegård, and Xia Shen. 2015. “Fitting Conditional and Simultaneous Autoregressive Spatial Models in Hglm.” The R Journal 7 (2): 5–18. https://doi.org/10.32614/RJ-2015-017.
Baddeley, Adrian, Ege Rubak, and Rolf Turner. 2015. Spatial Point Patterns: Methodology and Applications with R. Boca Raton, FL: Chapman and Hall/CRC.
Lovelace, Robin, Jakub Nowosad, and Jannes Muenchow. 2019. Geocomputation with R. Boca Raton, FL: Chapman and Hall/CRC.
Müller, Werner G. 2007. Collecting Spatial Data: Optimum Design of Experiments for Random Fields. Berlin: Springer-Verlag.
Pace, RK, and JP LeSage. 2008. “A Spatial Hausman Test.” Economics Letters 101: 282–84.
Ripley, B. D. 1981. Spatial Statistics. New York: Wiley.
Schratz, Patrick, Jannes Muenchow, Eugenia Iturritxa, Jakob Richter, and Alexander Brenning. 2019. “Hyperparameter Tuning and Performance Assessment of Statistical and Machine-Learning Algorithms Using Spatial Data.” Ecological Modelling 406: 109–20. https://doi.org/https://doi.org/10.1016/j.ecolmodel.2019.06.002.
Wall, M. M. 2004. “A Close Look at the Spatial Structure Implied by the CAR and SAR Models.” Journal of Statistical Planning and Inference 121: 311–24.
Waller, Lance A., and Carol A. Gotway. 2004. Applied Spatial Statistics for Public Health Data. Hoboken, NJ: John Wiley & Sons.