library(sf)
# Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.4.0; sf_use_s2() is TRUE
no2 <- read.csv(system.file("external/no2.csv",
package = "gstat"))
crs <- st_crs("EPSG:32632") # a csv doesn't carry a CRS!
st_as_sf(no2, crs = "OGC:CRS84", coords =
c("station_longitude_deg", "station_latitude_deg")) |>
st_transform(crs) -> no2.sf
library(ggplot2)
# plot(st_geometry(no2.sf))
"https://github.com/edzer/sdsr/raw/main/data/de_nuts1.gpkg" |>
read_sf() |>
st_transform(crs) -> de
ggplot() + geom_sf(data = de) +
geom_sf(data = no2.sf, mapping = aes(col = NO2))
3 Geostatistical data
Learning goals
- Get familiar with geostatistical data and spatial interpolation
- Get familiar with concepts of geostatistics: stationarity, variogram, kriging, conditional simulation
- Get an idea what spatiotemporal geostatistics is about
Reading materials
From Spatial Data Science: with applications in R:
- Chapter 12: Spatial Interpolation
- Chapter 13: Multivariate and Spatiotemporal Geostatistics
- Intro to
gstat
- Geostatistical data
- Spatial correlation, variograms, stationarity
- Kriging
- Simulating geostatistical data
- Spatiotemporal geostatistics
3.1 gstat
R package gstat
was written in 2002/3, from a stand-alone C program that was released under the GPL in 1997. It implements “basic” geostatistical functions for modelling spatial dependence (variograms), kriging interpolation and conditional simulation. It can be used for multivariable kriging (cokriging), as well as for spatiotemporal variography and kriging. Recent updates included support for sf
and stars
objects.
3.2 What are geostatistical data?
Recall from day 1: locations + measured values
- The value of interest is measured at a set of sample locations
- At other location, this value exists but is missing
- The interest is in estimating (predicting) this missing value (interpolation)
- The actual sample locations are not of (primary) interest, the signal is in the measured values
3.3 Spatial correlation
Lagged scatterplots
“by hand”, base R:
(w = st_is_within_distance(no2.sf, no2.sf, units::set_units(50, km),
retain_unique = TRUE))
# Sparse geometry binary predicate list of length 74, where
# the predicate was `is_within_distance', with retain_unique =
# TRUE
# first 10 elements:
# 1: (empty)
# 2: (empty)
# 3: 4, 5, 26
# 4: 5, 26
# 5: (empty)
# 6: 30, 72
# 7: (empty)
# 8: (empty)
# 9: (empty)
# 10: (empty)
d = as.data.frame(w)
x = no2.sf$NO2[d$row.id]
y = no2.sf$NO2[d$col.id]
cor(x, y)
# [1] 0.296
plot(x, y, main = "lagged scatterplot")
abline(0, 1)
using gstat:
Variogram
When we assume \(Z(s)\) has a constant and unknown mean, the spatial dependence can be described by the variogram, defined as \(\gamma(h) = 0.5 E(Z(s)-Z(s+h))^2\). If the random process \(Z(s)\) has a finite variance, then the variogram is related to the covariance function \(C(h)\) by \(\gamma(h) = C(0)-C(h)\).
The variogram can be estimated from sample data by averaging squared differences: \[\hat{\gamma}(\tilde{h})=\frac{1}{2N_h}\sum_{i=1}^{N_h}(Z(s_i)-Z(s_i+h))^2 \ \ h \in \tilde{h}\]
- divide by \(2N_h\):
- if finite, \(\gamma(\infty)=\sigma^2=C(0)\)
- semi variance
- if data are not gridded, group \(N_h\) pairs \(s_i,s_i+h\) for which \(h \in \tilde{h}\), \(\tilde{h}=[h_1,h_2]\)
- rule-of-thumb: choose about 10-25 distance intervals \(\tilde{h}\), from length 0 to about on third of the area size
- plot \(\gamma\) against \(\tilde{h}\) taken as the average value of all \(h \in \tilde{h}\)
We can compute a variogram “by hand”, using base R:
z = no2.sf$NO2
z2 = 0.5 * outer(z, z, FUN = "-")^2 # (Z(s)-Z(s+h))^2
d = as.matrix(st_distance(no2.sf)) # h
vcloud = data.frame(dist = as.vector(d), gamma = as.vector(z2))
vcloud = vcloud[vcloud$dist != 0,]
vcloud$dclass = cut(vcloud$dist, c(0, 50, 100, 150, 200, 250, 300, 350) * 1000)
v = aggregate(gamma~dclass, vcloud, mean)
plot(gamma ~ dclass, v, ylim = c(0, 20))
using gstat:
3.4 Exercises
- Compute the variogram cloud of NO2 using
variogram()
and argumentcloud = TRUE
. (a) How does the resulting object differ from the “regular” variogram (use thehead
command on both objects); (b) what do the “left” and “right” fields refer to? (c) when we plot the resulting variogram cloud object, does it still indicate spatial correlation? - Compute the variogram of NO2 as above, and change the arguments
cutoff
andwidth
into very large or small values. What do they do? - Fit a spherical model to the sample variogram of NO2, using
fit.variogram()
(follow the example below, replace “Exp” with “Sph”) - Fit a Matern model (“Mat”) to the sample variogram using different values for kappa (e.g., 0.3 and 4), and plot the resulting models with the sample variogram.
- Which model do you like the best? Can the SSErr attribute of the fitted model be used to compare the models? How else can variogram model fits be compared?
3.5 Interpolation
For interpolation, we first need a target grid (point patterns have an observation window, geostatistical data do not!)
A simple interpolator (that is hard to beat) is the inverse distance interpolator, \[\hat{Z}(s_0) = \sum_{j=1}^n \lambda_j Z(s_i)\] with \(\lambda_j\) proportional to \(||s_i - s_0||^{-p}\) and normalized to sum to one (weighted mean), and \(p\) tunable but defaulting to 2.
Using the data range:
library(stars)
# Loading required package: abind
g1 = st_as_stars(st_bbox(no2.sf))
library(gstat)
idw(NO2~1, no2.sf, g1) |> plot(reset = FALSE)
# [inverse distance weighted interpolation]
plot(st_geometry(no2.sf), add = TRUE, col = 'yellow')
plot(st_cast(st_geometry(de), "MULTILINESTRING"), add = TRUE, col = 'red')
Better to use the outer polygon:
g2 = st_as_stars(st_bbox(de))
idw(NO2~1, no2.sf, g2) |> plot(reset = FALSE)
# [inverse distance weighted interpolation]
plot(st_geometry(no2.sf), add = TRUE, col = 'yellow')
plot(st_cast(st_geometry(de), "MULTILINESTRING"), add = TRUE, col = 'red')
And crop to (mask out outside) the area of interest:
g3 = st_crop(g2, de)
i = idw(NO2~1, no2.sf, g3)
# [inverse distance weighted interpolation]
plot(i, reset = FALSE, main = "yearly mean NO2, rural background")
plot(st_geometry(no2.sf), add = TRUE, col = 'yellow')
plot(st_cast(st_geometry(de), "MULTILINESTRING"), add = TRUE, col = 'red')
Geostatistical approaches compute weights based on covariances between observations \(Z(s_i)\), and between observations and the value at the interpolation location \(Z(s_0)\). These covariances are obtained from a model fitted to the sample variogram.
3.6 Fit a variogram model
fit a model, e.g. an exponential model:
v.fit = fit.variogram(v, vgm(1, "Exp", 50000))
plot(v, v.fit)
3.7 BLUP/Kriging
Given this model, we can interpolate using the best unbiased linear predictor (BLUP), also called kriging predictor. Under the model \(Z(s)=m+e(s)\) it estimates \(m\) using generalized least squares, and predicts \(e(s)\) using a weighted mean, where weights are chosen such that \(Var(Z(s_0)-\hat{Z}(s_0))\) is minimized.
k = krige(NO2~1, no2.sf, g3, v.fit)
# [using ordinary kriging]
k$idw = i$var1.pred
k$kriging = k$var1.pred
hook = function() {
plot(st_geometry(no2.sf), add = TRUE, col = 'yellow')
plot(st_cast(st_geometry(de), "MULTILINESTRING"), add = TRUE, col = 'red')
}
plot(merge(k[c("kriging", "idw")]), hook = hook, breaks = "equal")
Both density maps shown in the Point Pattern section and interpolated maps shown in this section look very similar:
- raster maps with continuous values
- smooth spatial patterns
The differences could not be larger!
- point density estimates estimate the number of points per unit area; the values are (normalized) counts
- interpolated maps estimate an unmeasured continuous variable; the values are weighted averages of an attribute
To illustrate the difference between density and interpolated values:
# Loading required package: spatstat.data
# Loading required package: spatstat.univar
# spatstat.univar 3.0-1
# Loading required package: spatstat.geom
# spatstat.geom 3.3-3
# Loading required package: spatstat.random
# spatstat.random 3.3-2
# Loading required package: spatstat.explore
# Loading required package: nlme
# spatstat.explore 3.3-2
#
# Attaching package: 'spatstat.explore'
# The following object is masked from 'package:gstat':
#
# idw
# Loading required package: spatstat.model
# Loading required package: rpart
# spatstat.model 3.3-2
# Loading required package: spatstat.linnet
# spatstat.linnet 3.2-2
#
# spatstat 3.2-1
# For an introduction to spatstat, type 'beginner'
3.8 Kriging with a non-constant mean
Under the model \(Z(s) = X(s)\beta + e(s)\), \(\beta\) is estimated using generalized least squares, and the variogram of regression residuals is needed; see Ch 12.
3.9 Conditional simulation
3.10 Exercises
- What causes the differences between the mean and the variance of the simulations (left) and the mean and variance obtained by kriging (right)?
- When comparing (above) the sample variogram of simulated fields with the variogram model used to simulate them, where do you see differences? Can you explain why these are not identical, or hypothesize under which circumstances these would become (more) identical?
- Under which practical data analysis problem would conditional simulations be more useful than the kriging prediction + kriging variance maps?