Slides prepared for the Data Science Summer School, Jul 30, 2021; see also the course page.

Program:

  • 10:00 AM GeoData and Spatial Data Analysis with R (Part I)
  • 12:00 PM Short Break
  • 12:15 PM GeoData and Spatial Data Analysis with R (Part II)
  • 02:15 PM end

Each full hour will have about 45 minutes lecturing, followed by 15 mins Q&A.

Context

  • Why “GeoData and Spatial Data”? The organisation chose that title. I consider “GeoData” and “Spatial Data” as equal. “Geo” refers to “Earth”, so some spatial data, e.g. astronomic images or images taken through microscopes or MRT scans are less Earth-bound, and might be better described as “Spatial Data”.
  • Why “Data Analysis?” Because this is a summer school on Data Science.
  • Why with R? Because R is an open source environment made for data analysis and graphics that runs on all platforms. If we’d use a closed source environment, we’d miss the opportunity to reproduce and scrutinize computations and fail to meet the science goal of Data Science; if we’d use a general purpose programming language we would (more) easily end up in a package installation hell.
  • Why me? I have been involved in R, and development of R-Spatial packages, for around 20 years, teach this subject on a regular basis, and recently finished the draft of the book Spatial Data Science (with applications in R).

How to use these slides?

These slides were created with R and the R package rmarkdown. The source of these slides is an R-markdown document that can be loaded in R, and executed there (just like a Jupyter notebook, but simpler). To do this, take the following steps:

  1. Go to the GitHub page of this course
  2. Click on the file slides.Rmd
  3. Click on the “Raw” tab
  4. Right-click on that page, “save as”, and save it to a local copy of slides.Rmd
  5. Click (or double-click) on this file, and RStudio should open it, showing the file
  6. In RStudio, click “knit” to recreate the entire rendered document, which runs all the R chunks
  7. For running individual R chunks, (notebook “cells”), use the arrows “Run all chunks above”, after which you can use “Run current chunk”

Packages needed

All packages needed can be installed with

install.packages(c("gapminder", "gstat", "maps", "mapview",
  "rnaturalearth", "sf", "spatstat", "stars", "stringr", "tidyverse",
  "units", "spdep", "plm", "splm"))

What to expect?

10-11: Spatial data and geometries

  • What is so special about spatial data?
  • How can we represent spatial data in R?
  • What does a coordinate mean?
  • What does coordinate reference system mean?
  • Geometries, measures, predicates, transformers
  • Q&A

11-12: Attributes

  • Support, aggregates
  • gapminder example

12:15-13:15: Data cubes, large datasets

  • gapminder data cube
  • raster data
  • raster time series
  • image data cubes, image collections

13:15-14:15: Models

  • point patterns
  • geostatistical data
  • lattice data

Spatial data and geometries

Special about spatial data

All data are spatial - data comes from observation, and observation needs to happen somewhere and at some time. This makes all data spatial. For a lot of data, the location expressed in spatial, Earth-bound coordinates of observation is not of prime importance:

  • if a patient undergoes a brain scan, the location of the scanner is not important; the location of the person relative to the scanner is
  • if a person receives a positive COVID-19 test result, the location of testing may not be important for the person’s decision on whether to go into quarantine or not
  • for someone trying to do contact tracing, this person’s location history may be most relevant, however

With spatial data we mean data for which spatial locations (or relations) are known, and for which they play a role in the exploration, analysis or visualisation of the data.

Representing spatial data in R

Several of the simple R data structures can be used to represent spatial data, examples are columns in a data.frame:

head(quakes)
##      lat   long depth mag stations
## 1 -20.42 181.62   562 4.8       41
## 2 -20.62 181.03   650 4.2       15
## 3 -26.00 184.10    42 5.4       43
## 4 -17.97 181.66   626 4.1       19
## 5 -20.42 181.96   649 4.0       11
## 6 -19.68 184.31   195 4.0       12

a matrix

class(volcano)
## [1] "matrix" "array"
volcano[1:5,1:5]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  100  100  101  101  101
## [2,]  101  101  102  102  102
## [3,]  102  102  103  103  103
## [4,]  103  103  104  104  104
## [5,]  104  104  105  105  105
image(volcano)

numeric vectors for polygon coordinates with NA to separate individual rings:

library(maps)
m = map(regions="Australia", plot = FALSE, fill = TRUE)
pols = cbind(m$x, m$y)
dim(pols)
## [1] 1846    2
head(cbind(m$x, m$y), 10)
##           [,1]      [,2]
##  [1,] 123.5945 -12.42568
##  [2,] 123.5952 -12.43594
##  [3,] 123.5732 -12.43418
##  [4,] 123.5725 -12.42393
##  [5,] 123.5945 -12.42568
##  [6,]       NA        NA
##  [7,] 158.8788 -54.70976
##  [8,] 158.8452 -54.74922
##  [9,] 158.8359 -54.70400
## [10,] 158.8970 -54.50605
map(regions="Australia")

We will shortly see a few more structured approaches to represent spatial data, using simple features, coverages or rasters. First let us look at what coordinates mean.

Coordinates

With coordinates, we usually think a numbered measured along a ruler, where the ruler might be an imaginary line: it has an offset (0), a unit (m), and a constant direction. For spatial data we could have two imageginary lines perpendicular to each other, and we call this Cartesian space. Distance between \((x_1,y_1)\) and \((x_2,y_2)\) in Cartesian space is computed by Euclidean distance: \[\sqrt{(x_1-x_2)^2+(y_1-y_2)^2}\]

The spatial data we just saw, above are not of this kind:

  • the raster data (volcano) did not have coordinates
  • the other examples had coordinates that are ANGLES, distances along a circle (or ellipsoidal) arc:

Left: geocentric coordinates (Cartesian, three-dimensional, units metres); Right: spherical/ellipsoidal coordinates (angles, units degrees)

Euclidean distances do not work for ellipsoidal coordinates: one degree longitude at the equator is about 111 km, at the poles it is 0 km.

What does coordinate reference system mean?

“Data are not just numbers, they are numbers with a context” (Cobb & Moore)

Coordinate reference systems provide the context of coordinates:

  • they tell whether the coordinates are ellipsoidal (angles), or derived, projected (Cartesian) coordinates
  • in case they are projected, they detail the kind of projection used, so that the underlying ellipsoidal coordinates can be recovered
  • in any case, they point out which ellipsoidal model (datum) was used.

Knowing this we can

  • convert between projected and unprojected, or to another projection
  • transform from one datum to another
  • combine the coordinates with any other coordinates that have a coordinate reference system

Geometries, measures, predicates, transformers

Geometries can be described in many ways. We use the simple feature access specification that focuses on points, lines and polygons; the main types are:

  • points, POINT(0 1)
  • sets of points, MULTIPOINT(0 1,10 8)
  • linestrings, LINESTRING(0 0,1 1)
  • sets of linestrings, MULTILINESTRING((0 0,1 1),(5 5,4 6))
  • polygons, POLYGON((0 0,1 0,1 1,0 0))
  • sets of polygons, MULTIPOLYGON(((0 0,1 0,1 1,0 0)), ((3 3,4 3,4 4,3 3)))
  • combinations (mixes) of these GEOMETRYCOLLECTION(POINT(0 1),LINESTRING(0 0,1 1))

Linestrings are formed by sequences of points, where straight lines are thought of as connecting them. Polygons are formed of non self-intersecting linestrings that form closed rings (first coordinate equals last coordinate).

An example:

library(sf)
## Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
(nc = read_sf(system.file("gpkg/nc.gpkg", package="sf")))
## Simple feature collection with 100 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## Geodetic CRS:  NAD27
## # A tibble: 100 x 15
##     AREA PERIMETER CNTY_ CNTY_ID NAME  FIPS  FIPSNO CRESS_ID BIR74 SID74 NWBIR74
##    <dbl>     <dbl> <dbl>   <dbl> <chr> <chr>  <dbl>    <int> <dbl> <dbl>   <dbl>
##  1 0.114      1.44  1825    1825 Ashe  37009  37009        5  1091     1      10
##  2 0.061      1.23  1827    1827 Alle… 37005  37005        3   487     0      10
##  3 0.143      1.63  1828    1828 Surry 37171  37171       86  3188     5     208
##  4 0.07       2.97  1831    1831 Curr… 37053  37053       27   508     1     123
##  5 0.153      2.21  1832    1832 Nort… 37131  37131       66  1421     9    1066
##  6 0.097      1.67  1833    1833 Hert… 37091  37091       46  1452     7     954
##  7 0.062      1.55  1834    1834 Camd… 37029  37029       15   286     0     115
##  8 0.091      1.28  1835    1835 Gates 37073  37073       37   420     0     254
##  9 0.118      1.42  1836    1836 Warr… 37185  37185       93   968     4     748
## 10 0.124      1.43  1837    1837 Stok… 37169  37169       85  1612     1     160
## # … with 90 more rows, and 4 more variables: BIR79 <dbl>, SID79 <dbl>,
## #   NWBIR79 <dbl>, geom <MULTIPOLYGON [°]>
library(ggplot2)
ggplot() + geom_sf(data = nc, aes(fill = SID74))

Coverages, rasters

Coverages are tesselations (subdivisions) of space into regions, where every point can be uniquely assigned to a subregion.

With POLYGONS we cannot do that, as two polygons that share a boundary cannot tell to which polygon a point on the shared boundary belongs.

Rasters are regular tesselations, which uniquely subdivide space into square or rectangular areas (raster cells, or pixels).

An example of a raster:

library(stars)
## Loading required package: abind
(L7 = read_stars(system.file("tif/L7_ETMs.tif", package = "stars")))
## stars object with 3 dimensions and 1 attribute
## attribute(s):
##              Min. 1st Qu. Median     Mean 3rd Qu. Max.
## L7_ETMs.tif     1      54     69 68.91242      86  255
## dimension(s):
##      from  to  offset delta                       refsys point values x/y
## x       1 349  288776  28.5 UTM Zone 25, Southern Hem... FALSE   NULL [x]
## y       1 352 9120761 -28.5 UTM Zone 25, Southern Hem... FALSE   NULL [y]
## band    1   6      NA    NA                           NA    NA   NULL
ggplot() + geom_stars(data = L7) + facet_wrap(~band) + coord_equal()

or simply:

plot(L7)

Q&A

Attributes, support

Spatial data gets more interesting when we not only consider the geometric properties, but also other properties, which we call (feature) attributes.

A crucial consideration here is support. Non-point geometries (lines, polygons) contain essentially an infinite number of points. If we have an attribute of such a geometry, the question is whether the attribute is valid for

  • each point in the geometry (“point support”)
  • the collection of points (“line/area/block support”)

An example of point support: if we have a polygon of a soil map delineating the area with sandy soils, it indicates (up to mapping errors) that every point in the polygon has a sandy soil.

An example of area support: if we have an administrative region, the population density of that region is not a density measure for every point in the area, but an aggregate value relevant for the polygon as a whole.

When we analyse area support data (aggregates), choosing a different set of polygons leads to different results, a phenomenon known as the modifiable areal unit problem (MAUP). Intentially manipulating polygons such that a particular analysis result is obtained, e.g. in election outcomes, is called gerrymandering.

A common operation on areal data is to generate attributes for a new spatial subdivision, often indicated by

  • upscaling when the new units are larger
  • downscaling when the new units are smaller

When doing this, for spatially extensive variables (e.g. population counts) their sum needs to be preserved, for spatially intensive variables (e.g. population density) their mean needs to be preserved.

gapminder example

From the help of gapminder in package gapminder (by Jenny Bryan) “by continent, which country experienced the sharpest 5-year drop in life expectancy and what was the drop?”

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gapminder)
gapminder |>
   group_by(continent, country) |>
   select(country, year, continent, lifeExp) |>
   mutate(le_delta = lifeExp - lag(lifeExp)) |>
   summarize(worst_le_delta = min(le_delta, na.rm = TRUE)) |>
   filter(min_rank(worst_le_delta) < 2) |>
   arrange(worst_le_delta)
## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
## # A tibble: 5 x 3
## # Groups:   continent [5]
##   continent country     worst_le_delta
##   <fct>     <fct>                <dbl>
## 1 Africa    Rwanda             -20.4  
## 2 Asia      Cambodia            -9.10 
## 3 Americas  El Salvador         -1.51 
## 4 Europe    Montenegro          -1.46 
## 5 Oceania   Australia            0.170

Now, can we show a map of yearly changes in life expectancy?

Step 1: match the table to country geometries:

library(rnaturalearth)
library(sf)
ne = ne_countries(returnclass = "sf")
# Try to match country names:
m = match(gapminder$country, ne$admin)
(u = unique(gapminder$country[is.na(m)]))
##  [1] Bahrain               Comoros               Congo, Dem. Rep.     
##  [4] Congo, Rep.           Cote d'Ivoire         Guinea-Bissau        
##  [7] Hong Kong, China      Korea, Dem. Rep.      Korea, Rep.          
## [10] Mauritius             Reunion               Sao Tome and Principe
## [13] Serbia                Singapore             Slovak Republic      
## [16] Tanzania              United States         West Bank and Gaza   
## [19] Yemen, Rep.          
## 142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe

So this is pretty ugly. We can solve it the hard way, manually sorting out:

n = as.character(ne$admin)
(conv = tribble(
        ~old, ~new, 
        u[3], n[34],
        u[4], n[35],
        u[5], n[32],
        u[6], n[63],
        u[8], n[130],
        u[9], n[88],
        u[13], n[148],
        u[15], n[150],
        u[16], n[165],
        u[17], n[169],
        u[19], n[174]
))
## # A tibble: 11 x 2
##    old              new                             
##    <fct>            <chr>                           
##  1 Congo, Dem. Rep. Democratic Republic of the Congo
##  2 Congo, Rep.      Republic of Congo               
##  3 Cote d'Ivoire    Ivory Coast                     
##  4 Guinea-Bissau    Guinea Bissau                   
##  5 Korea, Dem. Rep. North Korea                     
##  6 Korea, Rep.      South Korea                     
##  7 Serbia           Republic of Serbia              
##  8 Slovak Republic  Slovakia                        
##  9 Tanzania         United Republic of Tanzania     
## 10 United States    United States of America        
## 11 Yemen, Rep.      Yemen
library(stringr)
repl = conv$new
names(repl) = conv$old
gapminder$ncountry =
  str_replace_all(gapminder$country, repl)
m = match(gapminder$ncountry, ne$admin)
(un = unique(gapminder$country[is.na(m)]))
## [1] Bahrain               Comoros               Hong Kong, China     
## [4] Mauritius             Reunion               Sao Tome and Principe
## [7] Singapore             West Bank and Gaza   
## 142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe

reducing the number of unmatched countries from 19 to 8.

Now we can do the big trick, and add a geometry column to gapminder:

gapminder_sf = st_sf(gapminder, geometry = ne$geometry[m])
gapminder_sf$geometry
## Geometry set for 1704 features  (with 96 geometries empty)
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -171.7911 ymin: -55.61183 xmax: 178.5171 ymax: 83.23324
## CRS:           +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## First 5 geometries:
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
sum(is.na(m))
## [1] 96

We see that the NA values in m have been filled with empty geometries. Let’s look at the geometries:

st_geometry(gapminder_sf) |> plot()

Oops! We didn’t see that from the table above! Also the question didn’t mention “of all countries with complete time series”…

The help from gapminder (can you find that?) says: “The supplemental data frame ‘gapminder_unfiltered’ was not filtered on ‘year’ or for complete data and has 3313 rows.” Maybe we should use that?

gapminder_unfiltered$ncountry =
  str_replace_all(gapminder_unfiltered$country, repl)
m = match(gapminder_unfiltered$ncountry, ne$admin)
(un = unique(gapminder_unfiltered$country[is.na(m)]))
##  [1] Aruba                 Bahamas               Bahrain              
##  [4] Barbados              Cape Verde            Comoros              
##  [7] French Guiana         French Polynesia      Grenada              
## [10] Guadeloupe            Hong Kong, China      Macao, China         
## [13] Maldives              Malta                 Martinique           
## [16] Mauritius             Micronesia, Fed. Sts. Netherlands Antilles 
## [19] Reunion               Samoa                 Sao Tome and Principe
## [22] Singapore             Timor-Leste           Tonga                
## [25] West Bank and Gaza   
## 187 Levels: Afghanistan Albania Algeria Angola Argentina Armenia ... Zimbabwe

so there are now 25 rather than 9 we can’t match to natural earth polygons, and

gapminder_u_sf = st_sf(gapminder_unfiltered, geometry = ne$geometry[m])
gapminder_u_sf$geometry
## Geometry set for 3313 features  (with 215 geometries empty)
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -55.61183 xmax: 180 ymax: 83.23324
## CRS:           +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## First 5 geometries:
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
sum(is.na(m))
## [1] 215
st_geometry(gapminder_u_sf) |> plot()

looks much better. Which other countries are now included, but absent from gapminder?

setdiff(unique(gapminder_unfiltered$ncountry), unique(gapminder$ncountry))
##  [1] "Armenia"               "Aruba"                 "Azerbaijan"           
##  [4] "Bahamas"               "Barbados"              "Belarus"              
##  [7] "Belize"                "Bhutan"                "Brunei"               
## [10] "Cape Verde"            "Cyprus"                "Estonia"              
## [13] "Fiji"                  "French Guiana"         "French Polynesia"     
## [16] "Georgia"               "Grenada"               "Guadeloupe"           
## [19] "Guyana"                "Kazakhstan"            "Latvia"               
## [22] "Lithuania"             "Luxembourg"            "Macao, China"         
## [25] "Maldives"              "Malta"                 "Martinique"           
## [28] "Micronesia, Fed. Sts." "Moldova"               "Netherlands Antilles" 
## [31] "New Caledonia"         "Papua New Guinea"      "Qatar"                
## [34] "Russia"                "Samoa"                 "Solomon Islands"      
## [37] "Suriname"              "Tajikistan"            "Timor-Leste"          
## [40] "Tonga"                 "Turkmenistan"          "Ukraine"              
## [43] "United Arab Emirates"  "Uzbekistan"            "Vanuatu"

We have 45 countries more than in gapminder, meaning that we now have 45 - 25 + 9 = 29 countries more matched with polygons.

We could now, for instance, create a map for the last measured year:

gapminder_u_sf |> 
        filter(year == 2007) |> 
        select(lifeExp) |> 
        plot()

where South Sudan appears to be missing (it gained independence in 2011).

Assessing missing-ness from plotting is a good check to find big ones (Russia), but not at all sufficient, since many countries are small. How many countries go invisible on a plot like the one above? Note that we plot country borders with thin lines (one pixel). Suppose that for a country to be visible with color it must span at least, say 3 x 3 pixels. We can obtain the device size in pixels by

dev.size("px")

and let’s say it is 1000 by 1000. Then, the size of a single pixel (when we do not have margins around the plot) at the equator is roughly

library(units)
## udunits database from /usr/share/xml/udunits/udunits2.xml
(pix_size = 2 * pi * set_units(s2::s2_earth_radius_meters(), m) / 1000)
## 40030.24 [m]

and the number of the Natural Earth countries with an area smaller than 9 equator-pixels is

a = st_make_valid(ne) |> # the Natural Earth dataset is invalid on the sphere
    st_area() |> 
    na.omit()
sum(a < 9 * pix_size^2)
## [1] 15

If we show the matrix of lifeExp by country and year,

co = as.factor(gapminder_u_sf$ncountry)
minyear = min(gapminder_u_sf$year)
ye = gapminder_u_sf$year - minyear + 1
m = matrix(NA, length(levels(co)), length(unique(ye)))
m[cbind(co,ye)] = gapminder_u_sf$lifeExp
image(x = sort(unique(ye)) + minyear - 1, z = t(m))

where we see a clear pattern of (more) valid values for every year ending in 2 or 7. We can show maps for each of those years by:

sel = gapminder_u_sf$year %% 5 == 2 # 2002, 2007 etc
ggplot() + geom_sf(data = gapminder_u_sf[sel,], aes(fill = lifeExp)) +
        facet_wrap(~year)

ggplot() + geom_sf(data = gapminder_sf, aes(fill = lifeExp)) +
        facet_wrap(~year)

raster data

Raster data data are used for various phenomena, but are in particular suitable for phenomena that vary continuously over space, such as elevation, temperature, air quality, and so on. Also imagery, when georeferenced, is provided as raster data.

Raster data is usually regular, but may come in other forms. Converting (projecting, reprojecting, unprojecting) may cause curvilinear grids; regridding these is called warping, may cause some data loss and is not invertible.

Color imagery comes with three bands, satellite imagery, as shown above, typically has more than three bands.

raster time series

Video is a time sequence of RGB images; satellite images also come as repeated time series but at much lower temporal resolution. Weather data (“nowcasts”), weather forecasts, climate predictions, land or sea surface temperatures are all created, using models, to create raster time series. Here is an example:

w = system.file("nc/bcsd_obs_1999.nc", package = "stars") |>
    read_stars()
## pr, tas,
w
## stars object with 3 dimensions and 2 attributes
## attribute(s):
##                 Min.   1st Qu.   Median      Mean   3rd Qu.      Max. NA's
## pr [mm/m]  0.5900000 56.139999 81.88000 101.26433 121.07250 848.54999 7116
## tas [C]   -0.4209678  8.898887 15.65763  15.48932  21.77979  29.38581 7116
## dimension(s):
##      from to offset  delta  refsys point                    values x/y
## x       1 81    -85  0.125      NA    NA                      NULL [x]
## y       1 33 37.125 -0.125      NA    NA                      NULL [y]
## time    1 12     NA     NA POSIXct    NA 1999-01-31,...,1999-12-31

Where we see monthly data of precipitation (and temperature). We also see that x and y do, but time does not have regular intervals. From the data set alone, it is also not clear whether monthly total precipitation (and mean temperature) are given, or values for the date shown (most likely: the former; this is temporal support).

We can plot these data, and add some state boundary for reference by:

nc = read_sf(system.file("gpkg/nc.gpkg", package="sf")) |>
    st_geometry()
nc.u = st_union(nc)
hook = function() plot(st_geometry(nc.u), add = TRUE, border = 'orange')
plot(w, hook = hook)

or by using ggplot2:

ggplot() + geom_stars(data = w) + 
        geom_sf(data = st_cast(nc.u, "MULTILINESTRING"), col = 'orange') + 
        facet_wrap(~as.Date(time))

image data cubes, image collections

Higher-dimensional data cubes are for instance obtained by remote sensing imagery, where dimensions include x, y (raster), band (color / wavelength) and time. Satellite imagery for larger areas is often distributed as images (tiles) for different coordinate reference systems, and with times of the overpass of satellite. This means that warping and aggregation is required to create a regular data cube out of such image collections. A recent blog entry on how to do this (reproducibly) is found here.

Although much of this data is free-for-download, analysing this data over a large area (even at low resolution) and/or over a longer time span quickly makes the download option unfeasible. Using cloud platforms that already serve (fast) access to complete data archive are the only alternative.

Q&A

Models

The goal of data analysis is often beyond descriptive statistics and exploratory graphs, but may involve

  • inference (estimation) of e.g. relationships between variables, or properties of the process that most likely generated the data
  • prediction of unobserved values (e.g. missing values in gapminder, spatial and/or temporal interpolation) from observations
  • predicting properties that were not directly, or fully, observed (e.g. predicting land cover or land use from satellite images and ground truth observations)

Problems special to spatial data often feed back to the sampling process and the goals:

  1. Can analysis proceed design-based? Yes if some form of random sampling was used to collect the data, inclusion probabilities are known, and statistics for the sampling units are required (and not e.g. spatial interpolation between samples)
  2. If the answer is “no”, analysis proceeds model-based, and if the data are spatially correlated, this correlation needs to be addressed in the analysis.

“Classical” modelling approaches (e.g. linear regression, but also most machine learning approaches) assume IID data: Independent, Identically Distributed. If analysis proceeds “model-based”, data are not independent. If sampling is biased, observations are also not identically distributed.

Point patterns

Wind turbine in Germany, e.g. from here we can download the wind turbines as a GeoJSON file, which we will assume resides in directory ~/Downoads

wt_sf = read_sf("~/Downloads/_Onshore_Windkraftanlagen_in_Deutschland.geojson")
library(rnaturalearth)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble  3.1.2     ✔ purrr   0.3.4
## ✔ tidyr   1.1.3     ✔ forcats 0.5.1
## ✔ readr   1.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ purrr::map()    masks maps::map()
de = ne_countries(scale = 50, returnclass = "sf") |>
    filter(admin == "Germany") 
plot(st_geometry(de), border = 'red')
plot(st_geometry(wt_sf), add = TRUE, cex = .5)

If we create an interactive plot for this dataset with mapview, using

library(mapview)
mapviewOptions(fgb = FALSE) # not needed locally
mapview(wt_sf)
## Warning in validateCoords(lng, lat, funcName): Data contains 14 rows with either
## missing or invalid lat/lon values and will be ignored

we can zoom and click on points to see their attributes. We also see that different color tones indicate overplotted, identical points; there is no way of

We can create a ppp object from this by

library(spatstat)
## Loading required package: spatstat.data
## Loading required package: spatstat.geom
## spatstat.geom 2.2-0
## Loading required package: spatstat.core
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## Loading required package: rpart
## spatstat.core 2.2-0
## Loading required package: spatstat.linnet
## spatstat.linnet 2.2-1
## 
## spatstat 2.2-0       (nickname: 'That's not important right now') 
## For an introduction to spatstat, type 'beginner'
pp = st_geometry(wt_sf)
window = st_geometry(de) 
wt = as.ppp(c(window, pp))
## Error: Only projected coordinates may be converted to spatstat class objects

Which won’t work; we’ll have to project these data first, e.g. to UTM zone 32N. We also need to remove points with missing (empty) geometries.

crs = st_crs("EPSG:32632")
pp = st_transform(pp, crs)[!st_is_empty(pp)]
window = st_transform(window, crs)
wt = as.ppp(c(window, pp))
## Warning: 70 points were rejected as lying outside the specified window
## Warning: data contain duplicated points
plot(density(wt, bw = "SJ"))
plot(window, add = TRUE)

Of course, this density depends on the bandwidth, and we can create more rough or smooth versions by setting it:

plot(density(wt, sigma = 1e4))
plot(window, add = TRUE)

plot(density(wt, sigma = 1e6))
plot(window, add = TRUE)

Second order moments

The K function counts for a varying window size \(r\) how many additional points are found within a window of radius \(r\) centered at each point, and compares that to a completely spatially random (CSR, Poisson) process.

plot(Kest(wt))
## number of data points exceeds 3000 - computing border correction estimate only

One can compute an envelope around the CSR model (which is omitted here because of excessive runtime)

plot(envelope(wt, Kest))

Another measure is the G-function, which estimates the nearest neighbour distribution function from a point pattern.

plot(Gest(wt))

We see that ther is a clear point mass at distance zero, indicating many duplicates. We can remore duplicates using unique(), and see a strong difference:

plot(Gest(unique(wt)))

For this function, the envelope is quickly computed:

plot(envelope(unique(wt), Gest))
## Generating 99 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,  99.
## 
## Done.

and indicates we have a significant deviation from CSR (Poisson, theoretical), with nearest neighbour distances much more abundant then expected (clustering, attraction).

Geostatistical data

Geostatistical analysis is often focused on the interpolation of a spatially continuous variable (a field), from scattered point observations. Consider the wind turbine data from the previous section,

wt_sf$Power = as.numeric(gsub(",", ".", gsub("\\.", "", wt_sf$Gesamtleistung__MW_)))
## Warning: NAs introduced by coercion
plot(wt_sf["Power"], pch = 16, logz = TRUE, reset = FALSE)
plot(st_geometry(de), add = TRUE)

then we see scattered values, which we could theoretically interpolate but the outcome would be meaningless: at unobserved locations there are no wind turbines, and the power of a turbine is not an indication of e.g. wind potential (as they are all of different size).

An example of a data set we could meaningfully interpolate is about air quality, with annual mean NO2 concentrations, measured at rural background stations. The data are from package gstat, but from a not-yet-released version, so we read them directly from GitHub:

url = "https://raw.githubusercontent.com/r-spatial/gstat/master/inst/external/no2.csv"
no2.sf = read.csv(url) |> 
    st_as_sf(coords = c("station_longitude_deg", "station_latitude_deg"),
             crs = st_crs(4326))
plot(no2.sf["NO2"], pch = 16, reset = FALSE)
plot(st_geometry(de), add = TRUE)

library(gstat)
## 
## Attaching package: 'gstat'
## The following object is masked from 'package:spatstat.core':
## 
##     idw
v = variogram(NO2~1, no2.sf)
v.fit = fit.variogram(v, vgm(1, "Exp", 100, 1))
plot(v, v.fit, plot.numbers = TRUE, xlab = "distance (km)")

library(stars)
grd = st_as_stars(st_bbox(de))
st_crs(grd) = st_crs(no2.sf)
k = krige(NO2~1, no2.sf, grd, v.fit)
## [using ordinary kriging]
plot(k["var1.pred"], breaks = "equal", reset = FALSE, axes = TRUE)
plot(st_geometry(de), add = TRUE, border = 'orange')
plot(st_geometry(no2.sf), col = 'green', add = TRUE, pch = 3)

Note that this procedure used ellipsoidal coordinates all the way, and computed distances correctly nevertheless. In many cases, it is more convenient (or appropriate) to carry out geostatistical analyses using projected coordinates:

no2.sf_utm = st_transform(no2.sf, st_crs("EPSG:32632"))
de_utm = st_transform(de, st_crs("EPSG:32632"))
v = variogram(NO2~1, no2.sf_utm)
v.fit = fit.variogram(v, vgm(1, "Exp", 10000, 1))
plot(v, v.fit, plot.numbers = TRUE, xlab = "distance (m)")

grd_utm = st_as_stars(st_bbox(de_utm))
k = krige(NO2~1, no2.sf_utm, grd_utm, v.fit)
## [using ordinary kriging]
plot(k["var1.pred"], breaks = "equal", reset = FALSE, axes = TRUE)
plot(st_geometry(de_utm), add = TRUE, border = 'orange')
plot(st_geometry(no2.sf_utm), col = 'green', add = TRUE, pch = 3)

We can of course crop the map to the area of Germany

plot(st_crop(k,de_utm)["var1.pred"], reset = FALSE)
plot(st_geometry(de_utm), add = TRUE, border = 'orange')
plot(st_geometry(no2.sf_utm), col = 'green', add = TRUE, pch = 3)

but in general field variables like air quality don’t “stop” at administrative boundaries.

Lattice data

The following example was taken from Chapters 14-17 of SDS.

The analysed variable is first round turnout proportion of registered voters in municipalities and Warsaw burroughs in the 2015 Polish presidential election.

# install.packages("spDataLarge", repos = "https://nowosad.github.io/drat/", type = "source")
data(pol_pres15, package = "spDataLarge")
pol_pres15 |>
    subset(select = c(TERYT, name, types)) |>
    head()
## Simple feature collection with 6 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 235157.1 ymin: 366913.3 xmax: 281431.7 ymax: 413016.6
## Projected CRS: ETRS89 / Poland CS92
##    TERYT                name       types                       geometry
## 1 020101         BOLESŁAWIEC       Urban MULTIPOLYGON (((261089.5 38...
## 2 020102         BOLESŁAWIEC       Rural MULTIPOLYGON (((254150 3837...
## 3 020103            GROMADKA       Rural MULTIPOLYGON (((275346 3846...
## 4 020104        NOWOGRODZIEC Urban/rural MULTIPOLYGON (((251769.8 37...
## 5 020105          OSIECZNICA       Rural MULTIPOLYGON (((263423.9 40...
## 6 020106 WARTA BOLESŁAWIECKA       Rural MULTIPOLYGON (((267030.7 38...
library(tmap)
tm_shape(pol_pres15) + tm_fill("types")

We can compute Moran’s I, a measure for spatial autocorrelation, when we established neighbourhood relationships:

library(spdep)
## Loading required package: sp
## Loading required package: spData
pol_pres15 |> poly2nb(queen = TRUE) -> nb_q
nb_q |> nb2listw(style = "B") -> lw_q_B
glance_htest <- function(ht) c(ht$estimate,
    "Std deviate" = unname(ht$statistic),
    "p.value" = unname(ht$p.value))
(pol_pres15 |>
        st_drop_geometry() |>
        subset(select = I_turnout, drop = TRUE) -> z) |>
    moran.test(listw = lw_q_B, randomisation = FALSE) |>
    glance_htest()
## Moran I statistic       Expectation          Variance       Std deviate 
##      0.6914339743     -0.0004009623      0.0001400449     58.4613487704 
##           p.value 
##      0.0000000000

The book chapter (15) then further details this into local Moran’s I tests.

Spatial regression models are classical regression models extended with effects that take care of spatial nature of the data, using either

  • a spatially correlated residual (e.g. characterised by a single correlation coefficient for first-order neighbours)
  • a spatial autoregression model, where the (mean of) responses in neighbouring polygons is added as a regressor

Chapter 16 and 17 of SDS gives examples of several of the usual approaches, for this.

With spatial time series, such as in gapminder, panel linear models can be considered,

library(plm)
## 
## Attaching package: 'plm'
## The following objects are masked from 'package:dplyr':
## 
##     between, lag, lead
plm(log(lifeExp) ~ log(gdpPercap), gapminder, index = c("country", "year")) |>
    summary()
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = log(lifeExp) ~ log(gdpPercap), data = gapminder, 
##     index = c("country", "year"))
## 
## Balanced Panel: n = 142, T = 12, N = 1704
## 
## Residuals:
##        Min.     1st Qu.      Median     3rd Qu.        Max. 
## -5.6776e-01 -4.6444e-02 -9.7224e-05  5.9601e-02  2.9937e-01 
## 
## Coefficients:
##                 Estimate Std. Error t-value  Pr(>|t|)    
## log(gdpPercap) 0.1640736  0.0058029  28.274 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    25.268
## Residual Sum of Squares: 16.71
## R-Squared:      0.33868
## Adj. R-Squared: 0.27852
## F-statistic: 799.442 on 1 and 1561 DF, p-value: < 2.22e-16

but these entirely ignore spatial (neighbourhood) effects. For spatial panel linear models we follow Milo & Piras, 2012; we must first create a spatial weight matrix, which is row-standardised:

gm = gapminder_sf |> filter(year == 2002) # or any other
wm = st_intersects(gm, gm) |> 
    as.matrix()
diag(wm) = 0 # remove self-intersections
dimnames(wm) = list(gm$country, gm$country)
# row-standardise:
wm_2 = apply(wm, 1, function(x) { s = sum(x); if (s == 0) 0*x else x/(sum(x)) }) |>
    t()

We need to remove countries with zero neighbours:

wm_nn = which(apply(wm_2, 1, sum) == 0)
gmlistw = mat2listw(wm_2[-wm_nn, -wm_nn])
gapminder_nn = gapminder |> 
    filter(!country %in% names(wm_nn)) |>
    select(-continent) # so that column 1 and 2 are individual, time

and can then fit the most simple model by

library(splm)
spml(log(lifeExp) ~ log(gdpPercap), gapminder_nn, listw = gmlistw) |>
    summary()
## Warning: Function eigen_setup moved to the spatialreg package
## Warning: Function do_ldet moved to the spatialreg package

## Warning: Function do_ldet moved to the spatialreg package

## Warning: Function do_ldet moved to the spatialreg package

## Warning: Function do_ldet moved to the spatialreg package

## Warning: Function do_ldet moved to the spatialreg package

## Warning: Function do_ldet moved to the spatialreg package

## Warning: Function do_ldet moved to the spatialreg package

## Warning: Function do_ldet moved to the spatialreg package

## Warning: Function do_ldet moved to the spatialreg package

## Warning: Function do_ldet moved to the spatialreg package

## Warning: Function do_ldet moved to the spatialreg package

## Warning: Function do_ldet moved to the spatialreg package

## Warning: Function do_ldet moved to the spatialreg package

## Warning: Function do_ldet moved to the spatialreg package

## Warning: Function do_ldet moved to the spatialreg package

## Warning: Function do_ldet moved to the spatialreg package
## Spatial panel fixed effects error model
##  
## 
## Call:
## spml(formula = log(lifeExp) ~ log(gdpPercap), data = gapminder_nn, 
##     listw = gmlistw)
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -0.5538648 -0.0612555  0.0087088  0.0754276  0.3143790 
## 
## Spatial error parameter:
##      Estimate Std. Error t-value  Pr(>|t|)    
## rho 0.8420572  0.0099066      85 < 2.2e-16 ***
## 
## Coefficients:
##                 Estimate Std. Error t-value  Pr(>|t|)    
## log(gdpPercap) 0.0344993  0.0042252   8.165 3.213e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

and a model that also has a spatial autoregressive component by

spml(log(lifeExp) ~ log(gdpPercap), gapminder_nn, listw = gmlistw,
        model = "random", lag = TRUE) |>
    summary()
## Warning in if (class(covTheta) == "try-error") {: the condition has length > 1
## and only the first element will be used
## ML panel with spatial lag, random effects, spatial error correlation 
## 
## Call:
## spreml(formula = formula, data = data, index = index, w = listw2mat(listw), 
##     w2 = listw2mat(listw2), lag = lag, errors = errors, cl = cl)
## 
## Residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.55    3.21    3.40    3.36    3.53    3.64 
## 
## Error variance parameters:
##      Estimate Std. Error t-value  Pr(>|t|)    
## phi  2.996150   0.431952  6.9363 4.025e-12 ***
## rho -0.217109   0.052232 -4.1566 3.230e-05 ***
## 
## Spatial autoregressive coefficient:
##        Estimate Std. Error t-value  Pr(>|t|)    
## lambda 0.831166   0.014322  58.033 < 2.2e-16 ***
## 
## Coefficients:
##                Estimate Std. Error t-value  Pr(>|t|)    
## (Intercept)    0.337909   0.021871   15.45 < 2.2e-16 ***
## log(gdpPercap) 0.041998   0.002530   16.60 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that it was not further examined whether this model makes any sense, or the transformations; this is purely meant as an illustration how to get spml to work.

Q&A

save.image()
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] splm_1.4-11           plm_2.4-1             spdep_1.1-8          
##  [4] spData_0.3.10         sp_1.4-5              tmap_3.3-2           
##  [7] gstat_2.0-8           spatstat_2.2-0        spatstat.linnet_2.2-1
## [10] spatstat.core_2.2-0   rpart_4.1-15          nlme_3.1-152         
## [13] spatstat.geom_2.2-0   spatstat.data_2.1-0   mapview_2.10.0       
## [16] forcats_0.5.1         purrr_0.3.4           readr_1.4.0          
## [19] tidyr_1.1.3           tibble_3.1.2          tidyverse_1.3.1      
## [22] units_0.7-2           stringr_1.4.0         rnaturalearth_0.1.0  
## [25] gapminder_0.3.0       dplyr_1.0.7           stars_0.5-4          
## [28] abind_1.4-5           ggplot2_3.3.5         sf_1.0-3             
## [31] maps_3.3.0           
## 
## loaded via a namespace (and not attached):
##   [1] spam_2.6-0              readxl_1.3.1            uuid_0.1-4             
##   [4] backports_1.2.1         miscTools_0.6-26        systemfonts_1.0.2      
##   [7] lwgeom_0.2-7            splines_4.1.0           crosstalk_1.1.1        
##  [10] leaflet_2.0.4.1         digest_0.6.27           leafpop_0.1.0          
##  [13] htmltools_0.5.1.1       ibdreg_0.3.1            gdata_2.18.0           
##  [16] leaflet.providers_1.9.0 fansi_0.5.0             magrittr_2.0.1         
##  [19] tensor_1.5              modelr_0.1.8            gmodels_2.18.1         
##  [22] sandwich_3.0-1          xts_0.12.1              svglite_2.0.0          
##  [25] bdsmatrix_1.3-4         spatstat.sparse_2.0-0   colorspace_2.0-2       
##  [28] rvest_1.0.0             rbibutils_2.2.1         haven_2.4.1            
##  [31] xfun_0.24               rgdal_1.5-23            leafem_0.1.6           
##  [34] crayon_1.4.1            jsonlite_1.7.2          brew_1.0-6             
##  [37] zoo_1.8-9               glue_1.4.2              polyclip_1.10-0        
##  [40] gtable_0.3.0            webshot_0.5.2           maxLik_1.4-8           
##  [43] scales_1.1.1            DBI_1.1.1               Rcpp_1.0.7             
##  [46] viridisLite_0.4.0       proxy_0.4-26            dotCall64_1.0-1        
##  [49] Formula_1.2-4           stats4_4.1.0            intervals_0.15.2       
##  [52] htmlwidgets_1.5.3       httr_1.4.2              FNN_1.1.3              
##  [55] RColorBrewer_1.1-2      wk_0.4.1                ellipsis_0.3.2         
##  [58] pkgconfig_2.0.3         XML_3.99-0.6            farver_2.1.0           
##  [61] sass_0.4.0              dbplyr_2.1.1            deldir_0.2-10          
##  [64] utf8_1.2.1              tidyselect_1.1.1        labeling_0.4.2         
##  [67] rlang_0.4.11            tmaptools_3.1-1         spatialreg_1.1-8       
##  [70] munsell_0.5.0           cellranger_1.1.0        tools_4.1.0            
##  [73] cli_2.5.0               generics_0.1.0          broom_0.7.8            
##  [76] evaluate_0.14           fftwtools_0.9-11        yaml_2.2.1             
##  [79] goftest_1.2-2           leafsync_0.1.0          knitr_1.33             
##  [82] fs_1.5.0                s2_1.0.6                satellite_1.0.2        
##  [85] xml2_1.3.2              compiler_4.1.0          rstudioapi_0.13        
##  [88] png_0.1-7               e1071_1.7-7             spatstat.utils_2.2-0   
##  [91] reprex_2.0.0            spacetime_1.2-5         bslib_0.2.5.1          
##  [94] stringi_1.6.2           highr_0.9               ps_1.6.0               
##  [97] rgeos_0.5-5             lattice_0.20-44         Matrix_1.3-4           
## [100] classInt_0.4-3          vctrs_0.3.8             LearnBayes_2.15.1      
## [103] pillar_1.6.1            lifecycle_1.0.0         lmtest_0.9-38          
## [106] Rdpack_2.1.2            jquerylib_0.1.4         raster_3.4-13          
## [109] R6_2.5.0                KernSmooth_2.23-20      spDataLarge_0.5.4      
## [112] codetools_0.2-18        dichromat_2.0-0         gtools_3.9.2           
## [115] MASS_7.3-54             boot_1.3-28             assertthat_0.2.1       
## [118] withr_2.4.2             expm_0.999-6            mgcv_1.8-36            
## [121] parallel_4.1.0          hms_1.1.0               grid_4.1.0             
## [124] coda_0.19-4             rnaturalearthdata_0.1.0 class_7.3-19           
## [127] rmarkdown_2.9           lubridate_1.7.10        base64enc_0.1-3