EO4Statistics meeting, Nov 12 2020

References

Most of what follows is based on the following references:

  • De Gruijter, J. J., & Ter Braak, C. J. F. (1990). Model-free estimation from spatial samples: a reappraisal of classical sampling theory. Mathematical geology, 22(4), 407-415.
  • Brus, D. J., & De Gruijter, J. J. (1997). Random sampling or geostatistical modelling? Choosing between design-based and model-based sampling strategies for soil (with discussion). Geoderma, 80(1-2), 1-44.
  • De Gruijter, J., Brus, D. J., Bierkens, M. F., & Knotters, M. (2006). Sampling for natural resource monitoring. Springer Science & Business Media.

What is the proportion of \(p_F\) of forest in the area of interest (box)?

  • Given we can plan the sampling, how should we do that?
  • Given a sample has already been taken, what are our options to estimate \(p_F\)?

95% CI, design-based, simple random sampling

Sample mean (n = 50):

(m = mean(v$f == "Forest"))
## [1] 0.22

95% CI using normal approximation (CLT, n large, \(m\) not too close to 0 or 1):

se = sqrt(m * (1 - m) / n)
m + qnorm(c(.025,.975)) * se
## [1] 0.105 0.335

True mean (fraction over this whole single image):

## population_mean 
##           0.357

What does this confidence interval mean?

(we use small \(s\) for a fixed variable, capital \(S\) for a random variable)

  • it is a design-based estimate: observations on a variable \(z\) at locations \(s\) are considered as coming from a random sampling process, \(z(S)\), with \(S\) randomly sampled from the area of interest \(A\): \(S\) is a random variable, \(z\) is fixed (as is \(z(s)\)).
  • because of random sampling, any pair \(\{z(S_i),~z(S_j)\}\) are independent, and their separation distance \(|S_i-S_j|\) is again a random variable
  • the CI means that under repeated sampling (using the same procedure), on average 95 out of 100 cases this CI will include the population parameter \(p_F\).

Replicates: repeated sampling

CI coverage: repeated sampling

(red line: population mean 0.357)

model-based 95% CI

  • model-based, we assume \(z(s)\) to be a realisation (sample of size 1) of a superpopulation (random field) \(Z(s)\)
  • we postulate (assume) a model, e.g. \(Z(s) = m + e(s)\), with \(m\) a spatial constant unknown mean (fixed effect), and \(e(s)\) a spatially correlated random variable (random effect)
  • For stationary random fields, Correlation of \(Z(s_i)\) and \(Z(s_j)\) are assumed to be correlated as a function of distance: \(\mbox{Cov}(Z(s_i), Z(s_j)) = C(h)\), with \(h=|s_i-s_j|\)
  • sampling strategy does “not” play a role: can be anything, but ignoring biases leads to disaster, as anywhere in statistics

Estimating \(m\)?

  • the boxes show independent realisations of \(Z(s)\), unconditional to any measurements
  • \(m\) is a property of the whole random field \(Z(s)\), not just the realisation at hand (unconditional to the observations)
  • it is also in no way spatially restricted to the area of interest
  • “true” \(m\) is 0.309 (parameter of generating process)

Predicting \(Z(A)\)

  • Model-based, \(s\) is no longer random so we can predict \(Z(s_0)\) by \(\hat{Z}(s_0)=\sum_{i=1}^n \lambda_i Z(s_i)\) (with \(\lambda_i\) the kriging weights) for any \(s_0\) (kriging interpolation): estimates probability of \(Z(s_0)\) being forest
  • This means that we can also predict the area mean \(Z(A)=\frac{1}{|A|}\int_A Z(u)du\), e.g. using the mean of predictions over the area: \[\frac{1}{K}\sum_{j=1}^K \hat{Z}(s_j)\] this is the block kriging prediction; it comes with a block kriging standard error that takes all the correlations between the \(\hat{Z}(s_j)\) into account: block kriging deals with change of support

Required: variogram

Kriging predictions:

Left: predictions of \(\hat{p}(Z(s) = \mbox{"Forest"})\), Right: population

Block kriging for the whole of \(A\)

Block kriging prediction:

## predicted_mean 
##          0.234
## p.025 p.975 
## 0.177 0.291

Compare to design-based mean estimate:

## estimated_mean 
##           0.22
## p.025 p.975 
## 0.105 0.335

What does the block kriging CI refer to?

CI’s reflect averaging over the conditional random fields \(Z(s)|z(s_i),i=1.,,,.n\) that:

  • have mean \(m\) and covariance \(C(h)\)
  • are conditioned to (“obey the”) sample data, i.e. \(Z(s_i)=z(s_i)\) for all observations \(i\), but are otherwise independent

Machine learning example

Model accuracy: 0.94 (Really?)

  • How was the model validated?
  • Area average of \(\hat{p}_F\): 0.231; fraction of pixels with \(\hat{p}_F > .5\): 0.244
  • No sensible standard errors for area average estimates

Comparison design-based / model-based

  1. Design-based methods do not need a model, and are more robust: model-based approaches lean on an assumed model.
  2. Design-based models are robust only if random sampling was used, and inclusion probabilities (sample weights) are known
  3. Model-based approaches make no assumption about sampling, but also have no guard against biased sampling
  4. Model-based approaches can make predictions for any location, and predict means over regions of any shape; design-based only estimate averages over the sampled region(s).
  5. The probability statements differ: design-based CI’s refer to repeating the sampling procedure, model-based CI’s refer to (conditional) super-populations under a fixed sample.
  6. The real fun starts when a design-based strategy is taken using a model-based predicted layer for stratification, and prediction errors are ignored.

Why we use R

  • R is a language that was developed for data analysis
  • All numbers and figures in these slides were generated by R (where the commands were mostly suppressed, to avoid clutter):
    • this makes them 100% reproducible, and
    • a source for further experimentation
    • literate programming helps explaining code, computation, and intention
    • source of slides is found here
    • copy-and-paste errors are ruled out
  • R has a number (> 16K) of extension packages, many of them for spatial statistics, well maintained, well tested; best seen as an ecosystem.
  • Installation of extension packages on a variety of hardware (Mac OSX, Windows, or Linux) is relatively painless.