3 Geostatistical data

Learning goals

Get familiar with geostatistical data and spatial interpolation
Get familiar with concepts of geostatistics: stationarity, variogram, kriging, conditional simulation
Get an idea what spatiotemporal geostatistics is about

Reading materials

From Spatial Data Science: with applications in R:

Chapter 12: Spatial Interpolation
Chapter 13: Multivariate and Spatiotemporal Geostatistics

Summary

Intro to gstat
Geostatistical data
Spatial correlation, variograms, stationarity
Kriging
Simulating geostatistical data
Spatiotemporal geostatistics

3.1 `gstat`

R package gstat was written in 2002/3, from a stand-alone C program that was released under the GPL in 1997. It implements “basic” geostatistical functions for modelling spatial dependence (variograms), kriging interpolation and conditional simulation. It can be used for multivariable kriging (cokriging), as well as for spatiotemporal variography and kriging. Recent updates included support for sf and stars objects.

3.2 What are geostatistical data?

Recall from day 1: locations + measured values

The value of interest is measured at a set of sample locations
At other location, this value exists but is missing
The interest is in estimating (predicting) this missing value (interpolation)
The actual sample locations are not of (primary) interest, the signal is in the measured values

library(sf)
# Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.4.0; sf_use_s2() is TRUE
no2 <- read.csv(system.file("external/no2.csv",
    package = "gstat"))
crs <- st_crs("EPSG:32632") # a csv doesn't carry a CRS!
st_as_sf(no2, crs = "OGC:CRS84", coords =
    c("station_longitude_deg", "station_latitude_deg")) |>
    st_transform(crs) -> no2.sf
library(ggplot2)
# plot(st_geometry(no2.sf))
"https://github.com/edzer/sdsr/raw/main/data/de_nuts1.gpkg" |>
  read_sf() |>
  st_transform(crs) -> de
ggplot() + geom_sf(data = de) +
    geom_sf(data = no2.sf, mapping = aes(col = NO2))

3.3 Spatial correlation

Lagged scatterplots

“by hand”, base R:

(w = st_is_within_distance(no2.sf, no2.sf, units::set_units(50, km), 
                          retain_unique = TRUE))
# Sparse geometry binary predicate list of length 74, where
# the predicate was `is_within_distance', with retain_unique =
# TRUE
# first 10 elements:
#  1: (empty)
#  2: (empty)
#  3: 4, 5, 26
#  4: 5, 26
#  5: (empty)
#  6: 30, 72
#  7: (empty)
#  8: (empty)
#  9: (empty)
#  10: (empty)
d = as.data.frame(w)
x = no2.sf$NO2[d$row.id]
y = no2.sf$NO2[d$col.id]
cor(x, y)
# [1] 0.296
plot(x, y, main = "lagged scatterplot")
abline(0, 1)

using gstat:

library(gstat)
hscat(NO2~1, no2.sf, breaks = c(0,50,100,150,200,250)*1000)

Variogram

When we assume \(Z(s)\) has a constant and unknown mean, the spatial dependence can be described by the variogram, defined as \(\gamma(h) = 0.5 E(Z(s)-Z(s+h))^2\). If the random process \(Z(s)\) has a finite variance, then the variogram is related to the covariance function \(C(h)\) by \(\gamma(h) = C(0)-C(h)\).

The variogram can be estimated from sample data by averaging squared differences: \[\hat{\gamma}(\tilde{h})=\frac{1}{2N_h}\sum_{i=1}^{N_h}(Z(s_i)-Z(s_i+h))^2 \ \ h \in \tilde{h}\]

divide by \(2N_h\):
- if finite, \(\gamma(\infty)=\sigma^2=C(0)\)
- semi variance
if data are not gridded, group \(N_h\) pairs \(s_i,s_i+h\) for which \(h \in \tilde{h}\), \(\tilde{h}=[h_1,h_2]\)
rule-of-thumb: choose about 10-25 distance intervals \(\tilde{h}\), from length 0 to about on third of the area size
plot \(\gamma\) against \(\tilde{h}\) taken as the average value of all \(h \in \tilde{h}\)

We can compute a variogram “by hand”, using base R:

z = no2.sf$NO2
z2 = 0.5 * outer(z, z, FUN = "-")^2 # (Z(s)-Z(s+h))^2
d = as.matrix(st_distance(no2.sf))  # h
vcloud = data.frame(dist = as.vector(d), gamma = as.vector(z2))
vcloud = vcloud[vcloud$dist != 0,]
vcloud$dclass = cut(vcloud$dist, c(0, 50, 100, 150, 200, 250, 300, 350) * 1000)
v = aggregate(gamma~dclass, vcloud, mean)
plot(gamma ~ dclass, v, ylim = c(0, 20))

using gstat:

vv = variogram(NO2~1, no2.sf, width = 50000, cutoff = 350000)
vv$gamma - v$gamma
# [1]  3.55e-15 -3.55e-15  5.33e-15 -5.33e-15  7.11e-15 -1.42e-14
# [7]  5.33e-15
plot(vv)

3.4 Exercises

Compute the variogram cloud of NO2 using variogram() and argument cloud = TRUE. (a) How does the resulting object differ from the “regular” variogram (use the head command on both objects); (b) what do the “left” and “right” fields refer to? (c) when we plot the resulting variogram cloud object, does it still indicate spatial correlation?
Compute the variogram of NO2 as above, and change the arguments cutoff and width into very large or small values. What do they do?
Fit a spherical model to the sample variogram of NO2, using fit.variogram() (follow the example below, replace “Exp” with “Sph”)
Fit a Matern model (“Mat”) to the sample variogram using different values for kappa (e.g., 0.3 and 4), and plot the resulting models with the sample variogram.
Which model do you like the best? Can the SSErr attribute of the fitted model be used to compare the models? How else can variogram model fits be compared?

3.5 Interpolation

For interpolation, we first need a target grid (point patterns have an observation window, geostatistical data do not!)

A simple interpolator (that is hard to beat) is the inverse distance interpolator, \[\hat{Z}(s_0) = \sum_{j=1}^n \lambda_j Z(s_i)\] with \(\lambda_j\) proportional to \(||s_i - s_0||^{-p}\) and normalized to sum to one (weighted mean), and \(p\) tunable but defaulting to 2.

Using the data range:

library(stars)
# Loading required package: abind
g1 = st_as_stars(st_bbox(no2.sf))
library(gstat)
idw(NO2~1, no2.sf, g1) |> plot(reset = FALSE)
# [inverse distance weighted interpolation]
plot(st_geometry(no2.sf), add = TRUE, col = 'yellow')
plot(st_cast(st_geometry(de), "MULTILINESTRING"), add = TRUE, col = 'red')

Better to use the outer polygon:

g2 = st_as_stars(st_bbox(de))
idw(NO2~1, no2.sf, g2) |> plot(reset = FALSE)
# [inverse distance weighted interpolation]
plot(st_geometry(no2.sf), add = TRUE, col = 'yellow')
plot(st_cast(st_geometry(de), "MULTILINESTRING"), add = TRUE, col = 'red')

And crop to (mask out outside) the area of interest:

g3 = st_crop(g2, de)
i = idw(NO2~1, no2.sf, g3) 
# [inverse distance weighted interpolation]
plot(i, reset = FALSE, main = "yearly mean NO2, rural background")
plot(st_geometry(no2.sf), add = TRUE, col = 'yellow')
plot(st_cast(st_geometry(de), "MULTILINESTRING"), add = TRUE, col = 'red')

Geostatistical approaches compute weights based on covariances between observations \(Z(s_i)\), and between observations and the value at the interpolation location \(Z(s_0)\). These covariances are obtained from a model fitted to the sample variogram.

3.6 Fit a variogram model

# The sample variogram:
v = variogram(NO2~1, no2.sf)
plot(v)

fit a model, e.g. an exponential model:

v.fit = fit.variogram(v, vgm(1, "Exp", 50000))
plot(v, v.fit)

3.7 BLUP/Kriging

Given this model, we can interpolate using the best unbiased linear predictor (BLUP), also called kriging predictor. Under the model \(Z(s)=m+e(s)\) it estimates \(m\) using generalized least squares, and predicts \(e(s)\) using a weighted mean, where weights are chosen such that \(Var(Z(s_0)-\hat{Z}(s_0))\) is minimized.

k = krige(NO2~1, no2.sf, g3, v.fit)
# [using ordinary kriging]
k$idw = i$var1.pred
k$kriging = k$var1.pred
hook = function() {
  plot(st_geometry(no2.sf), add = TRUE, col = 'yellow')
  plot(st_cast(st_geometry(de), "MULTILINESTRING"), add = TRUE, col = 'red')
}
plot(merge(k[c("kriging", "idw")]), hook = hook, breaks = "equal")

Density or interpolation?

Both density maps shown in the Point Pattern section and interpolated maps shown in this section look very similar:

raster maps with continuous values
smooth spatial patterns

The differences could not be larger!

point density estimates estimate the number of points per unit area; the values are (normalized) counts
interpolated maps estimate an unmeasured continuous variable; the values are weighted averages of an attribute

To illustrate the difference between density and interpolated values:

# Loading required package: spatstat.data
# Loading required package: spatstat.univar
# spatstat.univar 3.0-1
# Loading required package: spatstat.geom
# spatstat.geom 3.3-3
# Loading required package: spatstat.random
# spatstat.random 3.3-2
# Loading required package: spatstat.explore
# Loading required package: nlme
# spatstat.explore 3.3-2
# 
# Attaching package: 'spatstat.explore'
# The following object is masked from 'package:gstat':
# 
#     idw
# Loading required package: spatstat.model
# Loading required package: rpart
# spatstat.model 3.3-2
# Loading required package: spatstat.linnet
# spatstat.linnet 3.2-2
# 
# spatstat 3.2-1 
# For an introduction to spatstat, type 'beginner'

3.8 Kriging with a non-constant mean

Under the model \(Z(s) = X(s)\beta + e(s)\), \(\beta\) is estimated using generalized least squares, and the variogram of regression residuals is needed; see Ch 12.

3.9 Conditional simulation

Simulating spatially correlated data

Using a coarse grid, with base R:

set.seed(13579)
g2c = st_as_stars(st_bbox(de), dx = 15000)
g3c = st_crop(g2c, de)
p = st_as_sf(g3c, as_points = TRUE)
d = st_distance(p)
Sigma = variogramLine(v.fit, covariance = TRUE, dist_vector = d)
n = 100
ch = chol(Sigma)
sim = matrix(rnorm(n * nrow(ch)), nrow = n) %*% ch + mean(no2.sf$NO2)
for (i in seq_len(n)) {
    m = g3c[[1]]
    m[!is.na(m)] = sim[i,]
    g3c[[ paste0("sim", i) ]] = m
}
plot(merge(g3c[2:11]), breaks = "equal")

As a check, we could compute the variogram of some of the realisations:

g3c["sim4"] |> 
  st_as_sf() |> 
  variogram(sim4~1, data = _) |> 
  plot(model = v.fit)

g3c["sim5"] |> 
  st_as_sf() |> 
  variogram(sim5~1, data = _) |> 
  plot(model = v.fit, ylim = c(0,17.5))

g3c["sim6"] |> 
  st_as_sf() |> 
  variogram(sim6~1, data = _) |> 
  plot(model = v.fit, ylim = c(0,17.5))

The mean of these simulations is constant, not related to measured values:

st_apply(merge(g3c[-1]), c("x", "y"), mean) |> plot()

mean(no2.sf$NO2)
# [1] 8.38

Conditioning simulations on measured values can be done with gstat, using conditional simulation

cs = krige(NO2~1, no2.sf, g3, v.fit, nsim = 50, nmax = 30)
# drawing 50 GLS realisations of beta...
# [using conditional Gaussian simulation]
plot(cs[,,,1:10])

We see that these simulations are much more alike; also their mean and variance resemble that of the kriging mean and variance:

csm = st_apply(cs, c("x", "y"), mean)
csm$kriging = krige(NO2~1, no2.sf, g3, v.fit)[1]
# [using ordinary kriging]
plot(merge(csm), breaks = "equal")

csv = st_apply(cs, c("x", "y"), var)
csv$kr_var = krige(NO2~1, no2.sf, g3, v.fit)[2]
# [using ordinary kriging]
plot(merge(csv), breaks = "equal")

3.10 Exercises

What causes the differences between the mean and the variance of the simulations (left) and the mean and variance obtained by kriging (right)?
When comparing (above) the sample variogram of simulated fields with the variogram model used to simulate them, where do you see differences? Can you explain why these are not identical, or hypothesize under which circumstances these would become (more) identical?
Under which practical data analysis problem would conditional simulations be more useful than the kriging prediction + kriging variance maps?

3.11 Further reading

Pebesma, E.J., 2004. Multivariable geostatistics in S: the gstat package. Computers & Geosciences, 30: 683-691.
Benedikt Gräler, Edzer Pebesma and Gerard Heuvelink, 2016. Spatio-Temporal Interpolation using gstat. The R Journal 8(1), 204-218

Learning goals

Reading materials

3.1 gstat

3.2 What are geostatistical data?

3.3 Spatial correlation

Lagged scatterplots

Variogram

3.4 Exercises

3.5 Interpolation

3.6 Fit a variogram model

3.7 BLUP/Kriging

3.8 Kriging with a non-constant mean

3.9 Conditional simulation

Simulating spatially correlated data

3.10 Exercises

3.11 Further reading

3.1 `gstat`