3  inference: spatial correlation, fitting models

3.1 Spatial correlation for point patterns,

Intro to spatstat

Consider a point pattern that consist of

  • a set of known coordinates
  • an observation window

We can ask ourselves: our point pattern be a realisation of a completely spatially random (CSR) process? A CSR process has

  1. a spatially constant intensity (mean: first order property)
  2. completely independent locations (interactions: second order property)

e.g.

library(spatstat)
# Loading required package: spatstat.data
# Loading required package: spatstat.univar
# spatstat.univar 3.1-1
# Loading required package: spatstat.geom
# spatstat.geom 3.3-4
# Loading required package: spatstat.random
# spatstat.random 3.3-2
# Loading required package: spatstat.explore
# Loading required package: nlme
# spatstat.explore 3.3-4
# Loading required package: spatstat.model
# Loading required package: rpart
# spatstat.model 3.3-3
# Loading required package: spatstat.linnet
# spatstat.linnet 3.2-3
# 
# spatstat 3.3-0 
# For an introduction to spatstat, type 'beginner'
set.seed(13431)
CSR = rpoispp(100)
plot(CSR)

Or does it have a non-constant intensity, but otherwise independent points?

set.seed(1357)
ppi = rpoispp(function(x,y,...) 500 * x)
plot(ppi, main = "inhomogeneous")

Or does it have constant intensity, but dependent points:

cl <- rThomas(100, .02, 5)
plot(cl, main = "clustered")

hc <- rHardcore(0.05,1.5,square(50)) 
plot(hc, main = "inhibition")

or a combination:

#ff <- function(x,y) { 4 * exp(2 * abs(x) - 1) }
set.seed(1357)
ff <- function(x,y) 10 * x
Z <- as.im(ff, owin())
Y <- rMatClust(10, 0.05, Z)
plot(Y)

Checking homogeneity

(q = quadrat.test(CSR))
# Warning: Some expected counts are small; chi^2 approximation may be
# inaccurate
# 
#   Chi-squared test of CSR using quadrat counts
# 
# data:  CSR
# X2 = 25, df = 24, p-value = 0.9
# alternative hypothesis: two.sided
# 
# Quadrats: 5 by 5 grid of tiles
plot(q)

(q = quadrat.test(ppi))
# 
#   Chi-squared test of CSR using quadrat counts
# 
# data:  ppi
# X2 = 88, df = 24, p-value = 6e-09
# alternative hypothesis: two.sided
# 
# Quadrats: 5 by 5 grid of tiles
plot(q)

Estimating density

  • main parameter: bandwidth (sigma): determines the amound of smoothing.
  • if sigma is not specified: uses bw.diggle, an automatically tuned bandwidth

Correction for edge effect?

density(CSR) |> plot()
plot(CSR, add = TRUE, col = 'green')

density(ppi) |> plot()
plot(ppi, add = TRUE, col = 'green')

density(ppi, sigma = .05) |> plot()
plot(ppi, add = TRUE, col = 'green')

Assessing interactions: clustering/inhibition

The K-function (“Ripley’s K”) is the expected number of additional random (CSR) points within a distance r of a typical random point in the observation window.

The G-function (nearest neighbour distance distribution) is the cumulative distribution function G of the distance from a typical random point of X to the nearest other point of X.

envelope(CSR, Lest) |> plot()
# Generating 99 simulations of CSR  ...
# 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
# 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
# 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
# 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
# 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
# 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 
# 99.
# 
# Done.

envelope(cl, Lest) |> plot()
# Generating 99 simulations of CSR  ...
# 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
# 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
# 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
# 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
# 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
# 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 
# 99.
# 
# Done.

envelope(hc, Lest) |> plot()
# Generating 99 simulations of CSR  ...
# 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
# 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
# 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
# 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
# 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
# 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 
# 99.
# 
# Done.

envelope(ppi, Lest) |> plot()
# Generating 99 simulations of CSR  ...
# 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
# 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
# 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
# 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
# 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
# 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 
# 99.
# 
# Done.

envelope(ppi, Linhom) |> plot()
# Generating 99 simulations of CSR  ...
# 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
# 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
# 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
# 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
# 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
# 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 
# 99.
# 
# Done.

envelope(Y , Lest) |> plot()
# Generating 99 simulations of CSR  ...
# 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
# 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
# 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
# 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
# 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
# 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 
# 99.
# 
# Done.

envelope(Y , Linhom) |> plot()
# Generating 99 simulations of CSR  ...
# 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
# 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
# 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
# 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
# 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
# 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 
# 99.
# 
# Done.

3.2 Spatial correlation for geostatistical data

gstat

R package gstat was written in 2002/3, from a stand-alone C program that was released under the GPL in 1997. It implements “basic” geostatistical functions for modelling spatial dependence (variograms), kriging interpolation and conditional simulation. It can be used for multivariable kriging (cokriging), as well as for spatiotemporal variography and kriging. Recent updates included support for sf and stars objects.

What are geostatistical data?

Recall from day 1: locations + measured values

  • The value of interest is measured at a set of sample locations
  • At other location, this value exists but is missing
  • The interest is in estimating (predicting) this missing value (interpolation)
  • The actual sample locations are not of (primary) interest, the signal is in the measured values
library(sf)
# Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.4.0; sf_use_s2() is TRUE
no2 <- read.csv(system.file("external/no2.csv",
    package = "gstat"))
crs <- st_crs("EPSG:32632") # a csv doesn't carry a CRS!
st_as_sf(no2, crs = "OGC:CRS84", coords =
    c("station_longitude_deg", "station_latitude_deg")) |>
    st_transform(crs) -> no2.sf
library(ggplot2)
# plot(st_geometry(no2.sf))
"https://github.com/edzer/sdsr/raw/main/data/de_nuts1.gpkg" |>
  read_sf() |>
  st_transform(crs) -> de
ggplot() + geom_sf(data = de) +
    geom_sf(data = no2.sf, mapping = aes(col = NO2))

Spatial correlation

Lagged scatterplots

“by hand”, base R:

(w = st_is_within_distance(no2.sf, no2.sf, units::set_units(50, km), 
                          retain_unique = TRUE))
# Sparse geometry binary predicate list of length 74, where
# the predicate was `is_within_distance', with retain_unique =
# TRUE
# first 10 elements:
#  1: (empty)
#  2: (empty)
#  3: 4, 5, 26
#  4: 5, 26
#  5: (empty)
#  6: 30, 72
#  7: (empty)
#  8: (empty)
#  9: (empty)
#  10: (empty)
d = as.data.frame(w)
x = no2.sf$NO2[d$row.id]
y = no2.sf$NO2[d$col.id]
cor(x, y)
# [1] 0.296
plot(x, y, main = "lagged scatterplot")
abline(0, 1)

using gstat:

library(gstat)
# 
# Attaching package: 'gstat'
# The following object is masked from 'package:spatstat.explore':
# 
#     idw
hscat(NO2~1, no2.sf, breaks = c(0,50,100,150,200,250)*1000)

Variogram

When we assume \(Z(s)\) has a constant and unknown mean, the spatial dependence can be described by the variogram, defined as \(\gamma(h) = 0.5 E(Z(s)-Z(s+h))^2\). If the random process \(Z(s)\) has a finite variance, then the variogram is related to the covariance function \(C(h)\) by \(\gamma(h) = C(0)-C(h)\).

The variogram can be estimated from sample data by averaging squared differences: \[\hat{\gamma}(\tilde{h})=\frac{1}{2N_h}\sum_{i=1}^{N_h}(Z(s_i)-Z(s_i+h))^2 \ \ h \in \tilde{h}\]

  • divide by \(2N_h\):
    • if finite, \(\gamma(\infty)=\sigma^2=C(0)\)
    • semi variance
  • if data are not gridded, group \(N_h\) pairs \(s_i,s_i+h\) for which \(h \in \tilde{h}\), \(\tilde{h}=[h_1,h_2]\)
  • rule-of-thumb: choose about 10-25 distance intervals \(\tilde{h}\), from length 0 to about on third of the area size
  • plot \(\gamma\) against \(\tilde{h}\) taken as the average value of all \(h \in \tilde{h}\)

We can compute a variogram “by hand”, using base R:

z = no2.sf$NO2
z2 = 0.5 * outer(z, z, FUN = "-")^2 # (Z(s)-Z(s+h))^2
d = as.matrix(st_distance(no2.sf))  # h
vcloud = data.frame(dist = as.vector(d), gamma = as.vector(z2))
vcloud = vcloud[vcloud$dist != 0,]
vcloud$dclass = cut(vcloud$dist, c(0, 50, 100, 150, 200, 250, 300, 350) * 1000)
v = aggregate(gamma~dclass, vcloud, mean)
plot(gamma ~ dclass, v, ylim = c(0, 20))

using gstat:

vv = variogram(NO2~1, no2.sf, width = 50000, cutoff = 350000)
vv$gamma - v$gamma
# [1]  3.55e-15 -3.55e-15  5.33e-15 -5.33e-15  7.11e-15 -1.42e-14
# [7]  5.33e-15
plot(vv)

Fit a variogram model

# The sample variogram:
v = variogram(NO2~1, no2.sf)
plot(v)

fit a model, e.g. an exponential model:

v.fit = fit.variogram(v, vgm(1, "Exp", 50000))
plot(v, v.fit)

3.3 Spatial correlation in lattice data

Analysing lattice data: neighbours, weights, models

library(sf)
data(pol_pres15, package = "spDataLarge")
pol_pres15 |>
    subset(select = c(TERYT, name, types)) |>
    head()
# Simple feature collection with 6 features and 3 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: 235000 ymin: 367000 xmax: 281000 ymax: 413000
# Projected CRS: ETRF2000-PL / CS92
#    TERYT                name       types
# 1 020101         BOLESŁAWIEC       Urban
# 2 020102         BOLESŁAWIEC       Rural
# 3 020103            GROMADKA       Rural
# 4 020104        NOWOGRODZIEC Urban/rural
# 5 020105          OSIECZNICA       Rural
# 6 020106 WARTA BOLESŁAWIECKA       Rural
#                         geometry
# 1 MULTIPOLYGON (((261089 3855...
# 2 MULTIPOLYGON (((254150 3837...
# 3 MULTIPOLYGON (((275346 3846...
# 4 MULTIPOLYGON (((251770 3770...
# 5 MULTIPOLYGON (((263424 4060...
# 6 MULTIPOLYGON (((267031 3870...
library(tmap, warn.conflicts = FALSE)
tm_shape(pol_pres15) + tm_fill("types")

We need to make the geometries valid first,

st_is_valid(pol_pres15) |> all()
# [1] FALSE
pol_pres15 <- st_make_valid(pol_pres15)
st_is_valid(pol_pres15) |> all()
# [1] TRUE

First, we will consider polygons in relationship to their direct neighbours

library(spdep)
# Loading required package: spData
pol_pres15 |> poly2nb(queen = TRUE) -> nb_q
nb_q
# Neighbour list object:
# Number of regions: 2495 
# Number of nonzero links: 14242 
# Percentage nonzero weights: 0.229 
# Average number of links: 5.71

Is the graph connected?

(nb_q |> n.comp.nb())$nc
# [1] 1
par(mar = rep(0, 4))
pol_pres15 |>
    st_geometry() |>
    st_centroid(of_largest_polygon = TRUE) -> coords
plot(st_geometry(pol_pres15), border = 'grey')
plot(nb_q, coords = coords, add = TRUE, points = FALSE)

Alternative approaches to form neighbourhood matrices:

  • based on distance, e.g. setting a distance threshold or selecting a fixed number of nearest neighbours
  • based on triangulating points, for instance polygon centroids
  • sphere of influence, a modification of triangulation
  • include neighbours from neighbours

Weights matrices

Weight matrices are needed in analysis, they determine how observations (or residuals) are weighted in a regression model.

(nb_q |> nb2listw(style = "B") -> lw_q_B)
# Characteristics of weights list object:
# Neighbour list object:
# Number of regions: 2495 
# Number of nonzero links: 14242 
# Percentage nonzero weights: 0.229 
# Average number of links: 5.71 
# 
# Weights style: B 
# Weights constants summary:
#      n      nn    S0    S1     S2
# B 2495 6225025 14242 28484 357280

Spatial correlation: Moran’s I

Moran’s I is defined as

\[ I = \frac{n \sum_{(2)} w_{ij} z_i z_j}{S_0 \sum_{i=1}^{n} z_i^2} \] where \(x_i, i=1, \ldots, n\) are \(n\) observations on the numeric variable of interest, \(z_i = x_i - \bar{x}\), \(\bar{x} = \sum_{i=1}^{n} x_i / n\), \(\sum_{(2)} = \stackrel{\sum_{i=1}^{n} \sum_{j=1}^{n}}{i \neq j}\), \(w_{ij}\) are the spatial weights, and \(S_0 = \sum_{(2)} w_{ij}\).

We can compute it as

pol_pres15$I_turnout |>
    moran.test(lw_q_B, randomisation = FALSE,
               alternative = "two.sided")
# 
#   Moran I test under normality
# 
# data:  pol_pres15$I_turnout  
# weights: lw_q_B    
# 
# Moran I statistic standard deviate = 58, p-value <2e-16
# alternative hypothesis: two.sided
# sample estimates:
# Moran I statistic       Expectation          Variance 
#          0.691434         -0.000401          0.000140
plot(pol_pres15["I_turnout"])

A simple linear regression model, assuming independent observations, can be carried out using lm:

summary(pol_pres15$I_entitled_to_vote)
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#    1308    4026    6033   12221   10524  594643
(lm0 <- lm(I_turnout ~ I_entitled_to_vote, pol_pres15)) |> summary()
# 
# Call:
# lm(formula = I_turnout ~ I_entitled_to_vote, data = pol_pres15)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -0.21352 -0.04387 -0.00092  0.04150  0.23611 
# 
# Coefficients:
#                    Estimate Std. Error t value Pr(>|t|)    
# (Intercept)        4.39e-01   1.34e-03   328.1   <2e-16 ***
# I_entitled_to_vote 5.26e-07   4.18e-08    12.6   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.0618 on 2493 degrees of freedom
# Multiple R-squared:  0.0598,  Adjusted R-squared:  0.0595 
# F-statistic:  159 on 1 and 2493 DF,  p-value: <2e-16
pol_pres15$res = residuals(lm0)
plot(pol_pres15["res"])

A spatial linear regression model (SEM: spatial error model), assuming independent observations, can be carried out using lm:

form = I_turnout ~ I_entitled_to_vote
library(spatialreg)
# Loading required package: Matrix
# 
# Attaching package: 'spatialreg'
# The following objects are masked from 'package:spdep':
# 
#     get.ClusterOption, get.coresOption, get.mcOption,
#     get.VerboseOption, get.ZeroPolicyOption,
#     set.ClusterOption, set.coresOption, set.mcOption,
#     set.VerboseOption, set.ZeroPolicyOption
SEM_pres <- errorsarlm(form, data = pol_pres15, Durbin = FALSE,
      listw = lw_q_B, zero.policy = TRUE) 
SEM_pres |> summary()
# 
# Call:errorsarlm(formula = form, data = pol_pres15, listw = lw_q_B, 
#     Durbin = FALSE, zero.policy = TRUE)
# 
# Residuals:
#        Min         1Q     Median         3Q        Max 
# -0.1483326 -0.0266702 -0.0025573  0.0217927  0.1659212 
# 
# Type: error 
# Coefficients: (asymptotic standard errors) 
#                      Estimate Std. Error  z value  Pr(>|z|)
# (Intercept)        4.5887e-01 2.3544e-03 194.8968 < 2.2e-16
# I_entitled_to_vote 6.8492e-08 2.5677e-08   2.6675  0.007642
# 
# Lambda: 0.138, LR test value: 1964, p-value: < 2.22e-16
# Asymptotic standard error: 0.00205
#     z-value: 67.3, p-value: < 2.22e-16
# Wald statistic: 4527, p-value: < 2.22e-16
# 
# Log likelihood: 4390 for error model
# ML residual variance (sigma squared): 0.00147, (sigma: 0.0383)
# Number of observations: 2495 
# Number of parameters estimated: 4 
# AIC: -8771, (AIC for lm: -6809)

3.4 Fitting regression models under spatial correlation

Exercises Point Patterns

  1. From the point pattern of wind turbines shown in section 1.4, download the data as GeoPackage, and read them into R
  2. Read the boundary of Germany using rnaturalearth::ne_countries(scale = "large", country = "Germany")
  3. Create a plot showing both the observation window and the point pattern
  4. Do all observations fall inside the observation window?
  5. Create a ppp object from the points and the window
  6. Test whether the point pattern is homogeneous
  7. Create a plot with the (estimated) density of the wind turbines, with the turbine points added
  8. Verify that the mean density multiplied by the area of the window approximates the number of turbines
  9. Test for interaction: create diagnostic plots to verify whether the point pattern is clustered, or exhibits repulsion

3.5 Exercises Geostat.

  1. Compute the variogram cloud of NO2 using variogram() and argument cloud = TRUE. (a) How does the resulting object differ from the “regular” variogram (use the head command on both objects); (b) what do the “left” and “right” fields refer to? (c) when we plot the resulting variogram cloud object, does it still indicate spatial correlation?
  2. Compute the variogram of NO2 as above, and change the arguments cutoff and width into very large or small values. What do they do?
  3. Fit a spherical model to the sample variogram of NO2, using fit.variogram() (follow the example below, replace “Exp” with “Sph”)
  4. Fit a Matern model (“Mat”) to the sample variogram using different values for kappa (e.g., 0.3 and 4), and plot the resulting models with the sample variogram.
  5. Which model do you like the best? Can the SSErr attribute of the fitted model be used to compare the models? How else can variogram model fits be compared?

3.6 Exercises Lattice data

  1. Compare the results of the simple linear regression with the spatial error model
  2. Compare the maps of residuals of both models
  3. Fit a spatial Durbin error model (SDEM), using Durbin = TRUE in the same call to errorsarlm; compare the output of the Spatial Durbin model with that of the error model.
  4. carry out a likelyhood ratio test to compare the SEM and SDEM models (lmtest::lrtest(), see the SDS book Ch 17)

3.7 Further reading

  • E. Pebesma, 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10:1, 439-446.
  • A. Baddeley, E. Rubak and R Turner, 2016. Spatial Point Patterns: methodology and Applications in R; Chapman and Hall/CRC 810 pages.
  • J. Illian, A. Penttinen, H. Stoyan and D. Stoyan, 2008. Statistical Analysis and Modelling of Spatial Point Patterns; Wiley, 534 pages.