Design-based and model-based inference and spatial sampling

EO4Statistics meeting, Nov 12 2020

References

Most of what follows is based on the following references:

De Gruijter, J. J., & Ter Braak, C. J. F. (1990). Model-free estimation from spatial samples: a reappraisal of classical sampling theory. Mathematical geology, 22(4), 407-415.
Brus, D. J., & De Gruijter, J. J. (1997). Random sampling or geostatistical modelling? Choosing between design-based and model-based sampling strategies for soil (with discussion). Geoderma, 80(1-2), 1-44.
De Gruijter, J., Brus, D. J., Bierkens, M. F., & Knotters, M. (2006). Sampling for natural resource monitoring. Springer Science & Business Media.

What is the proportion of \(p_F\) of forest in the area of interest (box)?

Given we can plan the sampling, how should we do that?
Given a sample has already been taken, what are our options to estimate \(p_F\)?

Sample mean (n = 50):

(m = mean(v$f == "Forest"))

## [1] 0.22

95% CI using normal approximation (CLT, n large, \(m\) not too close to 0 or 1):

se = sqrt(m * (1 - m) / n)
m + qnorm(c(.025,.975)) * se

## [1] 0.105 0.335

True mean (fraction over this whole single image):

## population_mean 
##           0.357

(we use small \(s\) for a fixed variable, capital \(S\) for a random variable)

it is a design-based estimate: observations on a variable \(z\) at locations \(s\) are considered as coming from a random sampling process, \(z(S)\), with \(S\) randomly sampled from the area of interest \(A\): \(S\) is a random variable, \(z\) is fixed (as is \(z(s)\)).
because of random sampling, any pair \(\{z(S_i),~z(S_j)\}\) are independent, and their separation distance \(|S_i-S_j|\) is again a random variable
the CI means that under repeated sampling (using the same procedure), on average 95 out of 100 cases this CI will include the population parameter \(p_F\).

…

(red line: population mean 0.357)

model-based, we assume \(z(s)\) to be a realisation (sample of size 1) of a superpopulation (random field) \(Z(s)\)
we postulate (assume) a model, e.g. \(Z(s) = m + e(s)\), with \(m\) a spatial constant unknown mean (fixed effect), and \(e(s)\) a spatially correlated random variable (random effect)
For stationary random fields, Correlation of \(Z(s_i)\) and \(Z(s_j)\) are assumed to be correlated as a function of distance: \(\mbox{Cov}(Z(s_i), Z(s_j)) = C(h)\), with \(h=|s_i-s_j|\)
sampling strategy does “not” play a role: can be anything, but ignoring biases leads to disaster, as anywhere in statistics

the boxes show independent realisations of \(Z(s)\), unconditional to any measurements
\(m\) is a property of the whole random field \(Z(s)\), not just the realisation at hand (unconditional to the observations)
it is also in no way spatially restricted to the area of interest
“true” \(m\) is 0.309 (parameter of generating process)

Model-based, \(s\) is no longer random so we can predict \(Z(s_0)\) by \(\hat{Z}(s_0)=\sum_{i=1}^n \lambda_i Z(s_i)\) (with \(\lambda_i\) the kriging weights) for any \(s_0\) (kriging interpolation): estimates probability of \(Z(s_0)\) being forest
This means that we can also predict the area mean \(Z(A)=\frac{1}{|A|}\int_A Z(u)du\), e.g. using the mean of predictions over the area: \[\frac{1}{K}\sum_{j=1}^K \hat{Z}(s_j)\] this is the block kriging prediction; it comes with a block kriging standard error that takes all the correlations between the \(\hat{Z}(s_j)\) into account: block kriging deals with change of support

Left: predictions of \(\hat{p}(Z(s) = \mbox{"Forest"})\), Right: population

Block kriging prediction:

## predicted_mean 
##          0.234

## p.025 p.975 
## 0.177 0.291

Compare to design-based mean estimate:

## estimated_mean 
##           0.22

## p.025 p.975 
## 0.105 0.335

CI’s reflect averaging over the conditional random fields \(Z(s)|z(s_i),i=1.,,,.n\) that:

have mean \(m\) and covariance \(C(h)\)
are conditioned to (“obey the”) sample data, i.e. \(Z(s_i)=z(s_i)\) for all observations \(i\), but are otherwise independent

Model accuracy: 0.94 (Really?)

How was the model validated?
Area average of \(\hat{p}_F\): 0.231; fraction of pixels with \(\hat{p}_F > .5\): 0.244
No sensible standard errors for area average estimates

Design-based methods do not need a model, and are more robust: model-based approaches lean on an assumed model.
Design-based models are robust only if random sampling was used, and inclusion probabilities (sample weights) are known
Model-based approaches make no assumption about sampling, but also have no guard against biased sampling
Model-based approaches can make predictions for any location, and predict means over regions of any shape; design-based only estimate averages over the sampled region(s).
The probability statements differ: design-based CI’s refer to repeating the sampling procedure, model-based CI’s refer to (conditional) super-populations under a fixed sample.
The real fun starts when a design-based strategy is taken using a model-based predicted layer for stratification, and prediction errors are ignored.

R is a language that was developed for data analysis
All numbers and figures in these slides were generated by R (where the commands were mostly suppressed, to avoid clutter):
- this makes them 100% reproducible, and
- a source for further experimentation
- literate programming helps explaining code, computation, and intention
- source of slides is found here
- copy-and-paste errors are ruled out
R has a number (> 16K) of extension packages, many of them for spatial statistics, well maintained, well tested; best seen as an ecosystem.
Installation of extension packages on a variety of hardware (Mac OSX, Windows, or Linux) is relatively painless.