Slides prepared for the Data Science Summer School, Jul 30, 2021; see also the course page.
Program:
Each full hour will have about 45 minutes lecturing, followed by 15 mins Q&A.
These slides were created with R and the R package rmarkdown
. The source of these slides is an R-markdown document that can be loaded in R, and executed there (just like a Jupyter notebook, but simpler). To do this, take the following steps:
slides.Rmd
slides.Rmd
All packages needed can be installed with
install.packages(c("gapminder", "gstat", "maps", "mapview",
"rnaturalearth", "sf", "spatstat", "stars", "stringr", "tidyverse",
"units", "spdep", "plm", "splm"))
10-11: Spatial data and geometries
11-12: Attributes
gapminder
example12:15-13:15: Data cubes, large datasets
gapminder
data cube13:15-14:15: Models
All data are spatial - data comes from observation, and observation needs to happen somewhere and at some time. This makes all data spatial. For a lot of data, the location expressed in spatial, Earth-bound coordinates of observation is not of prime importance:
With spatial data we mean data for which spatial locations (or relations) are known, and for which they play a role in the exploration, analysis or visualisation of the data.
Several of the simple R data structures can be used to represent spatial data, examples are columns in a data.frame:
head(quakes)
## lat long depth mag stations
## 1 -20.42 181.62 562 4.8 41
## 2 -20.62 181.03 650 4.2 15
## 3 -26.00 184.10 42 5.4 43
## 4 -17.97 181.66 626 4.1 19
## 5 -20.42 181.96 649 4.0 11
## 6 -19.68 184.31 195 4.0 12
a matrix
class(volcano)
## [1] "matrix" "array"
volcano[1:5,1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 100 100 101 101 101
## [2,] 101 101 102 102 102
## [3,] 102 102 103 103 103
## [4,] 103 103 104 104 104
## [5,] 104 104 105 105 105
image(volcano)
numeric vectors for polygon coordinates with NA
to separate individual rings:
library(maps)
m = map(regions="Australia", plot = FALSE, fill = TRUE)
pols = cbind(m$x, m$y)
dim(pols)
## [1] 1846 2
head(cbind(m$x, m$y), 10)
## [,1] [,2]
## [1,] 123.5945 -12.42568
## [2,] 123.5952 -12.43594
## [3,] 123.5732 -12.43418
## [4,] 123.5725 -12.42393
## [5,] 123.5945 -12.42568
## [6,] NA NA
## [7,] 158.8788 -54.70976
## [8,] 158.8452 -54.74922
## [9,] 158.8359 -54.70400
## [10,] 158.8970 -54.50605
map(regions="Australia")
We will shortly see a few more structured approaches to represent spatial data, using simple features, coverages or rasters. First let us look at what coordinates mean.
With coordinates, we usually think a numbered measured along a ruler, where the ruler might be an imaginary line: it has an offset (0), a unit (m), and a constant direction. For spatial data we could have two imageginary lines perpendicular to each other, and we call this Cartesian space. Distance between \((x_1,y_1)\) and \((x_2,y_2)\) in Cartesian space is computed by Euclidean distance: \[\sqrt{(x_1-x_2)^2+(y_1-y_2)^2}\]
The spatial data we just saw, above are not of this kind:
volcano
) did not have coordinatesLeft: geocentric coordinates (Cartesian, three-dimensional, units metres); Right: spherical/ellipsoidal coordinates (angles, units degrees)
Euclidean distances do not work for ellipsoidal coordinates: one degree longitude at the equator is about 111 km, at the poles it is 0 km.
“Data are not just numbers, they are numbers with a context” (Cobb & Moore)
Coordinate reference systems provide the context of coordinates:
Knowing this we can
Geometries can be described in many ways. We use the simple feature access specification that focuses on points, lines and polygons; the main types are:
POINT(0 1)
MULTIPOINT(0 1,10 8)
LINESTRING(0 0,1 1)
MULTILINESTRING((0 0,1 1),(5 5,4 6))
POLYGON((0 0,1 0,1 1,0 0))
MULTIPOLYGON(((0 0,1 0,1 1,0 0)), ((3 3,4 3,4 4,3 3)))
GEOMETRYCOLLECTION(POINT(0 1),LINESTRING(0 0,1 1))
Linestrings are formed by sequences of points, where straight lines are thought of as connecting them. Polygons are formed of non self-intersecting linestrings that form closed rings (first coordinate equals last coordinate).
An example:
library(sf)
## Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
(nc = read_sf(system.file("gpkg/nc.gpkg", package="sf")))
## Simple feature collection with 100 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## Geodetic CRS: NAD27
## # A tibble: 100 x 15
## AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.114 1.44 1825 1825 Ashe 37009 37009 5 1091 1 10
## 2 0.061 1.23 1827 1827 Alle… 37005 37005 3 487 0 10
## 3 0.143 1.63 1828 1828 Surry 37171 37171 86 3188 5 208
## 4 0.07 2.97 1831 1831 Curr… 37053 37053 27 508 1 123
## 5 0.153 2.21 1832 1832 Nort… 37131 37131 66 1421 9 1066
## 6 0.097 1.67 1833 1833 Hert… 37091 37091 46 1452 7 954
## 7 0.062 1.55 1834 1834 Camd… 37029 37029 15 286 0 115
## 8 0.091 1.28 1835 1835 Gates 37073 37073 37 420 0 254
## 9 0.118 1.42 1836 1836 Warr… 37185 37185 93 968 4 748
## 10 0.124 1.43 1837 1837 Stok… 37169 37169 85 1612 1 160
## # … with 90 more rows, and 4 more variables: BIR79 <dbl>, SID79 <dbl>,
## # NWBIR79 <dbl>, geom <MULTIPOLYGON [°]>
library(ggplot2)
ggplot() + geom_sf(data = nc, aes(fill = SID74))
Coverages are tesselations (subdivisions) of space into regions, where every point can be uniquely assigned to a subregion.
With POLYGONS
we cannot do that, as two polygons that share a boundary cannot tell to which polygon a point on the shared boundary belongs.
Rasters are regular tesselations, which uniquely subdivide space into square or rectangular areas (raster cells, or pixels).
An example of a raster:
library(stars)
## Loading required package: abind
(L7 = read_stars(system.file("tif/L7_ETMs.tif", package = "stars")))
## stars object with 3 dimensions and 1 attribute
## attribute(s):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## L7_ETMs.tif 1 54 69 68.91242 86 255
## dimension(s):
## from to offset delta refsys point values x/y
## x 1 349 288776 28.5 UTM Zone 25, Southern Hem... FALSE NULL [x]
## y 1 352 9120761 -28.5 UTM Zone 25, Southern Hem... FALSE NULL [y]
## band 1 6 NA NA NA NA NULL
ggplot() + geom_stars(data = L7) + facet_wrap(~band) + coord_equal()
or simply:
plot(L7)
Spatial data gets more interesting when we not only consider the geometric properties, but also other properties, which we call (feature) attributes.
A crucial consideration here is support. Non-point geometries (lines, polygons) contain essentially an infinite number of points. If we have an attribute of such a geometry, the question is whether the attribute is valid for
An example of point support: if we have a polygon of a soil map delineating the area with sandy soils, it indicates (up to mapping errors) that every point in the polygon has a sandy soil.
An example of area support: if we have an administrative region, the population density of that region is not a density measure for every point in the area, but an aggregate value relevant for the polygon as a whole.
When we analyse area support data (aggregates), choosing a different set of polygons leads to different results, a phenomenon known as the modifiable areal unit problem (MAUP). Intentially manipulating polygons such that a particular analysis result is obtained, e.g. in election outcomes, is called gerrymandering.
A common operation on areal data is to generate attributes for a new spatial subdivision, often indicated by
When doing this, for spatially extensive variables (e.g. population counts) their sum needs to be preserved, for spatially intensive variables (e.g. population density) their mean needs to be preserved.
gapminder
exampleFrom the help of gapminder
in package gapminder
(by Jenny Bryan) “by continent, which country experienced the sharpest 5-year drop in life expectancy and what was the drop?”
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gapminder)
gapminder |>
group_by(continent, country) |>
select(country, year, continent, lifeExp) |>
mutate(le_delta = lifeExp - lag(lifeExp)) |>
summarize(worst_le_delta = min(le_delta, na.rm = TRUE)) |>
filter(min_rank(worst_le_delta) < 2) |>
arrange(worst_le_delta)
## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
## # A tibble: 5 x 3
## # Groups: continent [5]
## continent country worst_le_delta
## <fct> <fct> <dbl>
## 1 Africa Rwanda -20.4
## 2 Asia Cambodia -9.10
## 3 Americas El Salvador -1.51
## 4 Europe Montenegro -1.46
## 5 Oceania Australia 0.170
Now, can we show a map of yearly changes in life expectancy?
Step 1: match the table to country geometries:
library(rnaturalearth)
library(sf)
ne = ne_countries(returnclass = "sf")
# Try to match country names:
m = match(gapminder$country, ne$admin)
(u = unique(gapminder$country[is.na(m)]))
## [1] Bahrain Comoros Congo, Dem. Rep.
## [4] Congo, Rep. Cote d'Ivoire Guinea-Bissau
## [7] Hong Kong, China Korea, Dem. Rep. Korea, Rep.
## [10] Mauritius Reunion Sao Tome and Principe
## [13] Serbia Singapore Slovak Republic
## [16] Tanzania United States West Bank and Gaza
## [19] Yemen, Rep.
## 142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe
So this is pretty ugly. We can solve it the hard way, manually sorting out:
n = as.character(ne$admin)
(conv = tribble(
~old, ~new,
u[3], n[34],
u[4], n[35],
u[5], n[32],
u[6], n[63],
u[8], n[130],
u[9], n[88],
u[13], n[148],
u[15], n[150],
u[16], n[165],
u[17], n[169],
u[19], n[174]
))
## # A tibble: 11 x 2
## old new
## <fct> <chr>
## 1 Congo, Dem. Rep. Democratic Republic of the Congo
## 2 Congo, Rep. Republic of Congo
## 3 Cote d'Ivoire Ivory Coast
## 4 Guinea-Bissau Guinea Bissau
## 5 Korea, Dem. Rep. North Korea
## 6 Korea, Rep. South Korea
## 7 Serbia Republic of Serbia
## 8 Slovak Republic Slovakia
## 9 Tanzania United Republic of Tanzania
## 10 United States United States of America
## 11 Yemen, Rep. Yemen
library(stringr)
repl = conv$new
names(repl) = conv$old
gapminder$ncountry =
str_replace_all(gapminder$country, repl)
m = match(gapminder$ncountry, ne$admin)
(un = unique(gapminder$country[is.na(m)]))
## [1] Bahrain Comoros Hong Kong, China
## [4] Mauritius Reunion Sao Tome and Principe
## [7] Singapore West Bank and Gaza
## 142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe
reducing the number of unmatched countries from 19 to 8.
Now we can do the big trick, and add a geometry column to gapminder
:
gapminder_sf = st_sf(gapminder, geometry = ne$geometry[m])
gapminder_sf$geometry
## Geometry set for 1704 features (with 96 geometries empty)
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -171.7911 ymin: -55.61183 xmax: 178.5171 ymax: 83.23324
## CRS: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## First 5 geometries:
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
sum(is.na(m))
## [1] 96
We see that the NA
values in m
have been filled with empty geometries. Let’s look at the geometries:
st_geometry(gapminder_sf) |> plot()
Oops! We didn’t see that from the table above! Also the question didn’t mention “of all countries with complete time series”…
The help from gapminder
(can you find that?) says: “The supplemental data frame ‘gapminder_unfiltered’ was not filtered on ‘year’ or for complete data and has 3313 rows.” Maybe we should use that?
gapminder_unfiltered$ncountry =
str_replace_all(gapminder_unfiltered$country, repl)
m = match(gapminder_unfiltered$ncountry, ne$admin)
(un = unique(gapminder_unfiltered$country[is.na(m)]))
## [1] Aruba Bahamas Bahrain
## [4] Barbados Cape Verde Comoros
## [7] French Guiana French Polynesia Grenada
## [10] Guadeloupe Hong Kong, China Macao, China
## [13] Maldives Malta Martinique
## [16] Mauritius Micronesia, Fed. Sts. Netherlands Antilles
## [19] Reunion Samoa Sao Tome and Principe
## [22] Singapore Timor-Leste Tonga
## [25] West Bank and Gaza
## 187 Levels: Afghanistan Albania Algeria Angola Argentina Armenia ... Zimbabwe
so there are now 25 rather than 9 we can’t match to natural earth polygons, and
gapminder_u_sf = st_sf(gapminder_unfiltered, geometry = ne$geometry[m])
gapminder_u_sf$geometry
## Geometry set for 3313 features (with 215 geometries empty)
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -55.61183 xmax: 180 ymax: 83.23324
## CRS: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## First 5 geometries:
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
## MULTIPOLYGON (((61.21082 35.65007, 62.23065 35....
sum(is.na(m))
## [1] 215
st_geometry(gapminder_u_sf) |> plot()
looks much better. Which other countries are now included, but absent from gapminder
?
setdiff(unique(gapminder_unfiltered$ncountry), unique(gapminder$ncountry))
## [1] "Armenia" "Aruba" "Azerbaijan"
## [4] "Bahamas" "Barbados" "Belarus"
## [7] "Belize" "Bhutan" "Brunei"
## [10] "Cape Verde" "Cyprus" "Estonia"
## [13] "Fiji" "French Guiana" "French Polynesia"
## [16] "Georgia" "Grenada" "Guadeloupe"
## [19] "Guyana" "Kazakhstan" "Latvia"
## [22] "Lithuania" "Luxembourg" "Macao, China"
## [25] "Maldives" "Malta" "Martinique"
## [28] "Micronesia, Fed. Sts." "Moldova" "Netherlands Antilles"
## [31] "New Caledonia" "Papua New Guinea" "Qatar"
## [34] "Russia" "Samoa" "Solomon Islands"
## [37] "Suriname" "Tajikistan" "Timor-Leste"
## [40] "Tonga" "Turkmenistan" "Ukraine"
## [43] "United Arab Emirates" "Uzbekistan" "Vanuatu"
We have 45 countries more than in gapminder
, meaning that we now have 45 - 25 + 9 = 29 countries more matched with polygons.
We could now, for instance, create a map for the last measured year:
gapminder_u_sf |>
filter(year == 2007) |>
select(lifeExp) |>
plot()
where South Sudan appears to be missing (it gained independence in 2011).
Assessing missing-ness from plotting is a good check to find big ones (Russia), but not at all sufficient, since many countries are small. How many countries go invisible on a plot like the one above? Note that we plot country borders with thin lines (one pixel). Suppose that for a country to be visible with color it must span at least, say 3 x 3 pixels. We can obtain the device size in pixels by
dev.size("px")
and let’s say it is 1000 by 1000. Then, the size of a single pixel (when we do not have margins around the plot) at the equator is roughly
library(units)
## udunits database from /usr/share/xml/udunits/udunits2.xml
(pix_size = 2 * pi * set_units(s2::s2_earth_radius_meters(), m) / 1000)
## 40030.24 [m]
and the number of the Natural Earth countries with an area smaller than 9 equator-pixels is
a = st_make_valid(ne) |> # the Natural Earth dataset is invalid on the sphere
st_area() |>
na.omit()
sum(a < 9 * pix_size^2)
## [1] 15
If we show the matrix of lifeExp
by country and year,
co = as.factor(gapminder_u_sf$ncountry)
minyear = min(gapminder_u_sf$year)
ye = gapminder_u_sf$year - minyear + 1
m = matrix(NA, length(levels(co)), length(unique(ye)))
m[cbind(co,ye)] = gapminder_u_sf$lifeExp
image(x = sort(unique(ye)) + minyear - 1, z = t(m))
where we see a clear pattern of (more) valid values for every year ending in 2 or 7. We can show maps for each of those years by:
sel = gapminder_u_sf$year %% 5 == 2 # 2002, 2007 etc
ggplot() + geom_sf(data = gapminder_u_sf[sel,], aes(fill = lifeExp)) +
facet_wrap(~year)
ggplot() + geom_sf(data = gapminder_sf, aes(fill = lifeExp)) +
facet_wrap(~year)
Raster data data are used for various phenomena, but are in particular suitable for phenomena that vary continuously over space, such as elevation, temperature, air quality, and so on. Also imagery, when georeferenced, is provided as raster data.
Raster data is usually regular, but may come in other forms. Converting (projecting, reprojecting, unprojecting) may cause curvilinear grids; regridding these is called warping, may cause some data loss and is not invertible.
Color imagery comes with three bands, satellite imagery, as shown above, typically has more than three bands.
Video is a time sequence of RGB images; satellite images also come as repeated time series but at much lower temporal resolution. Weather data (“nowcasts”), weather forecasts, climate predictions, land or sea surface temperatures are all created, using models, to create raster time series. Here is an example:
w = system.file("nc/bcsd_obs_1999.nc", package = "stars") |>
read_stars()
## pr, tas,
w
## stars object with 3 dimensions and 2 attributes
## attribute(s):
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pr [mm/m] 0.5900000 56.139999 81.88000 101.26433 121.07250 848.54999 7116
## tas [C] -0.4209678 8.898887 15.65763 15.48932 21.77979 29.38581 7116
## dimension(s):
## from to offset delta refsys point values x/y
## x 1 81 -85 0.125 NA NA NULL [x]
## y 1 33 37.125 -0.125 NA NA NULL [y]
## time 1 12 NA NA POSIXct NA 1999-01-31,...,1999-12-31
Where we see monthly data of precipitation (and temperature). We also see that x
and y
do, but time
does not have regular intervals. From the data set alone, it is also not clear whether monthly total precipitation (and mean temperature) are given, or values for the date shown (most likely: the former; this is temporal support).
We can plot these data, and add some state boundary for reference by:
nc = read_sf(system.file("gpkg/nc.gpkg", package="sf")) |>
st_geometry()
nc.u = st_union(nc)
hook = function() plot(st_geometry(nc.u), add = TRUE, border = 'orange')
plot(w, hook = hook)
or by using ggplot2
:
ggplot() + geom_stars(data = w) +
geom_sf(data = st_cast(nc.u, "MULTILINESTRING"), col = 'orange') +
facet_wrap(~as.Date(time))
Higher-dimensional data cubes are for instance obtained by remote sensing imagery, where dimensions include x
, y
(raster), band
(color / wavelength) and time
. Satellite imagery for larger areas is often distributed as images (tiles) for different coordinate reference systems, and with times of the overpass of satellite. This means that warping and aggregation is required to create a regular data cube out of such image collections. A recent blog entry on how to do this (reproducibly) is found here.
Although much of this data is free-for-download, analysing this data over a large area (even at low resolution) and/or over a longer time span quickly makes the download option unfeasible. Using cloud platforms that already serve (fast) access to complete data archive are the only alternative.
The goal of data analysis is often beyond descriptive statistics and exploratory graphs, but may involve
gapminder
, spatial and/or temporal interpolation) from observationsProblems special to spatial data often feed back to the sampling process and the goals:
“Classical” modelling approaches (e.g. linear regression, but also most machine learning approaches) assume IID data: Independent, Identically Distributed. If analysis proceeds “model-based”, data are not independent. If sampling is biased, observations are also not identically distributed.
Wind turbine in Germany, e.g. from here we can download the wind turbines as a GeoJSON file, which we will assume resides in directory ~/Downoads
wt_sf = read_sf("~/Downloads/_Onshore_Windkraftanlagen_in_Deutschland.geojson")
library(rnaturalearth)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble 3.1.2 ✔ purrr 0.3.4
## ✔ tidyr 1.1.3 ✔ forcats 0.5.1
## ✔ readr 1.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::map() masks maps::map()
de = ne_countries(scale = 50, returnclass = "sf") |>
filter(admin == "Germany")
plot(st_geometry(de), border = 'red')
plot(st_geometry(wt_sf), add = TRUE, cex = .5)
If we create an interactive plot for this dataset with mapview
, using
library(mapview)
mapviewOptions(fgb = FALSE) # not needed locally
mapview(wt_sf)
## Warning in validateCoords(lng, lat, funcName): Data contains 14 rows with either
## missing or invalid lat/lon values and will be ignored
we can zoom and click on points to see their attributes. We also see that different color tones indicate overplotted, identical points; there is no way of
We can create a ppp
object from this by
library(spatstat)
## Loading required package: spatstat.data
## Loading required package: spatstat.geom
## spatstat.geom 2.2-0
## Loading required package: spatstat.core
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## Loading required package: rpart
## spatstat.core 2.2-0
## Loading required package: spatstat.linnet
## spatstat.linnet 2.2-1
##
## spatstat 2.2-0 (nickname: 'That's not important right now')
## For an introduction to spatstat, type 'beginner'
pp = st_geometry(wt_sf)
window = st_geometry(de)
wt = as.ppp(c(window, pp))
## Error: Only projected coordinates may be converted to spatstat class objects
Which won’t work; we’ll have to project these data first, e.g. to UTM zone 32N. We also need to remove points with missing (empty) geometries.
crs = st_crs("EPSG:32632")
pp = st_transform(pp, crs)[!st_is_empty(pp)]
window = st_transform(window, crs)
wt = as.ppp(c(window, pp))
## Warning: 70 points were rejected as lying outside the specified window
## Warning: data contain duplicated points
plot(density(wt, bw = "SJ"))
plot(window, add = TRUE)
Of course, this density depends on the bandwidth, and we can create more rough or smooth versions by setting it:
plot(density(wt, sigma = 1e4))
plot(window, add = TRUE)
plot(density(wt, sigma = 1e6))
plot(window, add = TRUE)
The K function counts for a varying window size \(r\) how many additional points are found within a window of radius \(r\) centered at each point, and compares that to a completely spatially random (CSR, Poisson) process.
plot(Kest(wt))
## number of data points exceeds 3000 - computing border correction estimate only
One can compute an envelope around the CSR model (which is omitted here because of excessive runtime)
plot(envelope(wt, Kest))
Another measure is the G-function, which estimates the nearest neighbour distribution function from a point pattern.
plot(Gest(wt))
We see that ther is a clear point mass at distance zero, indicating many duplicates. We can remore duplicates using unique()
, and see a strong difference:
plot(Gest(unique(wt)))
For this function, the envelope is quickly computed:
plot(envelope(unique(wt), Gest))
## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
and indicates we have a significant deviation from CSR (Poisson, theoretical), with nearest neighbour distances much more abundant then expected (clustering, attraction).
Geostatistical analysis is often focused on the interpolation of a spatially continuous variable (a field), from scattered point observations. Consider the wind turbine data from the previous section,
wt_sf$Power = as.numeric(gsub(",", ".", gsub("\\.", "", wt_sf$Gesamtleistung__MW_)))
## Warning: NAs introduced by coercion
plot(wt_sf["Power"], pch = 16, logz = TRUE, reset = FALSE)
plot(st_geometry(de), add = TRUE)
then we see scattered values, which we could theoretically interpolate but the outcome would be meaningless: at unobserved locations there are no wind turbines, and the power of a turbine is not an indication of e.g. wind potential (as they are all of different size).
An example of a data set we could meaningfully interpolate is about air quality, with annual mean NO2 concentrations, measured at rural background stations. The data are from package gstat
, but from a not-yet-released version, so we read them directly from GitHub:
url = "https://raw.githubusercontent.com/r-spatial/gstat/master/inst/external/no2.csv"
no2.sf = read.csv(url) |>
st_as_sf(coords = c("station_longitude_deg", "station_latitude_deg"),
crs = st_crs(4326))
plot(no2.sf["NO2"], pch = 16, reset = FALSE)
plot(st_geometry(de), add = TRUE)
library(gstat)
##
## Attaching package: 'gstat'
## The following object is masked from 'package:spatstat.core':
##
## idw
v = variogram(NO2~1, no2.sf)
v.fit = fit.variogram(v, vgm(1, "Exp", 100, 1))
plot(v, v.fit, plot.numbers = TRUE, xlab = "distance (km)")
library(stars)
grd = st_as_stars(st_bbox(de))
st_crs(grd) = st_crs(no2.sf)
k = krige(NO2~1, no2.sf, grd, v.fit)
## [using ordinary kriging]
plot(k["var1.pred"], breaks = "equal", reset = FALSE, axes = TRUE)
plot(st_geometry(de), add = TRUE, border = 'orange')
plot(st_geometry(no2.sf), col = 'green', add = TRUE, pch = 3)
Note that this procedure used ellipsoidal coordinates all the way, and computed distances correctly nevertheless. In many cases, it is more convenient (or appropriate) to carry out geostatistical analyses using projected coordinates:
no2.sf_utm = st_transform(no2.sf, st_crs("EPSG:32632"))
de_utm = st_transform(de, st_crs("EPSG:32632"))
v = variogram(NO2~1, no2.sf_utm)
v.fit = fit.variogram(v, vgm(1, "Exp", 10000, 1))
plot(v, v.fit, plot.numbers = TRUE, xlab = "distance (m)")
grd_utm = st_as_stars(st_bbox(de_utm))
k = krige(NO2~1, no2.sf_utm, grd_utm, v.fit)
## [using ordinary kriging]
plot(k["var1.pred"], breaks = "equal", reset = FALSE, axes = TRUE)
plot(st_geometry(de_utm), add = TRUE, border = 'orange')
plot(st_geometry(no2.sf_utm), col = 'green', add = TRUE, pch = 3)
We can of course crop the map to the area of Germany
plot(st_crop(k,de_utm)["var1.pred"], reset = FALSE)
plot(st_geometry(de_utm), add = TRUE, border = 'orange')
plot(st_geometry(no2.sf_utm), col = 'green', add = TRUE, pch = 3)
but in general field variables like air quality don’t “stop” at administrative boundaries.
The following example was taken from Chapters 14-17 of SDS.
The analysed variable is first round turnout proportion of registered voters in municipalities and Warsaw burroughs in the 2015 Polish presidential election.
# install.packages("spDataLarge", repos = "https://nowosad.github.io/drat/", type = "source")
data(pol_pres15, package = "spDataLarge")
pol_pres15 |>
subset(select = c(TERYT, name, types)) |>
head()
## Simple feature collection with 6 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 235157.1 ymin: 366913.3 xmax: 281431.7 ymax: 413016.6
## Projected CRS: ETRS89 / Poland CS92
## TERYT name types geometry
## 1 020101 BOLESŁAWIEC Urban MULTIPOLYGON (((261089.5 38...
## 2 020102 BOLESŁAWIEC Rural MULTIPOLYGON (((254150 3837...
## 3 020103 GROMADKA Rural MULTIPOLYGON (((275346 3846...
## 4 020104 NOWOGRODZIEC Urban/rural MULTIPOLYGON (((251769.8 37...
## 5 020105 OSIECZNICA Rural MULTIPOLYGON (((263423.9 40...
## 6 020106 WARTA BOLESŁAWIECKA Rural MULTIPOLYGON (((267030.7 38...
library(tmap)
tm_shape(pol_pres15) + tm_fill("types")
We can compute Moran’s I, a measure for spatial autocorrelation, when we established neighbourhood relationships:
library(spdep)
## Loading required package: sp
## Loading required package: spData
pol_pres15 |> poly2nb(queen = TRUE) -> nb_q
nb_q |> nb2listw(style = "B") -> lw_q_B
glance_htest <- function(ht) c(ht$estimate,
"Std deviate" = unname(ht$statistic),
"p.value" = unname(ht$p.value))
(pol_pres15 |>
st_drop_geometry() |>
subset(select = I_turnout, drop = TRUE) -> z) |>
moran.test(listw = lw_q_B, randomisation = FALSE) |>
glance_htest()
## Moran I statistic Expectation Variance Std deviate
## 0.6914339743 -0.0004009623 0.0001400449 58.4613487704
## p.value
## 0.0000000000
The book chapter (15) then further details this into local Moran’s I tests.
Spatial regression models are classical regression models extended with effects that take care of spatial nature of the data, using either
Chapter 16 and 17 of SDS gives examples of several of the usual approaches, for this.
With spatial time series, such as in gapminder
, panel linear models can be considered,
library(plm)
##
## Attaching package: 'plm'
## The following objects are masked from 'package:dplyr':
##
## between, lag, lead
plm(log(lifeExp) ~ log(gdpPercap), gapminder, index = c("country", "year")) |>
summary()
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = log(lifeExp) ~ log(gdpPercap), data = gapminder,
## index = c("country", "year"))
##
## Balanced Panel: n = 142, T = 12, N = 1704
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -5.6776e-01 -4.6444e-02 -9.7224e-05 5.9601e-02 2.9937e-01
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## log(gdpPercap) 0.1640736 0.0058029 28.274 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 25.268
## Residual Sum of Squares: 16.71
## R-Squared: 0.33868
## Adj. R-Squared: 0.27852
## F-statistic: 799.442 on 1 and 1561 DF, p-value: < 2.22e-16
but these entirely ignore spatial (neighbourhood) effects. For spatial panel linear models we follow Milo & Piras, 2012; we must first create a spatial weight matrix, which is row-standardised:
gm = gapminder_sf |> filter(year == 2002) # or any other
wm = st_intersects(gm, gm) |>
as.matrix()
diag(wm) = 0 # remove self-intersections
dimnames(wm) = list(gm$country, gm$country)
# row-standardise:
wm_2 = apply(wm, 1, function(x) { s = sum(x); if (s == 0) 0*x else x/(sum(x)) }) |>
t()
We need to remove countries with zero neighbours:
wm_nn = which(apply(wm_2, 1, sum) == 0)
gmlistw = mat2listw(wm_2[-wm_nn, -wm_nn])
gapminder_nn = gapminder |>
filter(!country %in% names(wm_nn)) |>
select(-continent) # so that column 1 and 2 are individual, time
and can then fit the most simple model by
library(splm)
spml(log(lifeExp) ~ log(gdpPercap), gapminder_nn, listw = gmlistw) |>
summary()
## Warning: Function eigen_setup moved to the spatialreg package
## Warning: Function do_ldet moved to the spatialreg package
## Warning: Function do_ldet moved to the spatialreg package
## Warning: Function do_ldet moved to the spatialreg package
## Warning: Function do_ldet moved to the spatialreg package
## Warning: Function do_ldet moved to the spatialreg package
## Warning: Function do_ldet moved to the spatialreg package
## Warning: Function do_ldet moved to the spatialreg package
## Warning: Function do_ldet moved to the spatialreg package
## Warning: Function do_ldet moved to the spatialreg package
## Warning: Function do_ldet moved to the spatialreg package
## Warning: Function do_ldet moved to the spatialreg package
## Warning: Function do_ldet moved to the spatialreg package
## Warning: Function do_ldet moved to the spatialreg package
## Warning: Function do_ldet moved to the spatialreg package
## Warning: Function do_ldet moved to the spatialreg package
## Warning: Function do_ldet moved to the spatialreg package
## Spatial panel fixed effects error model
##
##
## Call:
## spml(formula = log(lifeExp) ~ log(gdpPercap), data = gapminder_nn,
## listw = gmlistw)
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -0.5538648 -0.0612555 0.0087088 0.0754276 0.3143790
##
## Spatial error parameter:
## Estimate Std. Error t-value Pr(>|t|)
## rho 0.8420572 0.0099066 85 < 2.2e-16 ***
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## log(gdpPercap) 0.0344993 0.0042252 8.165 3.213e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
and a model that also has a spatial autoregressive component by
spml(log(lifeExp) ~ log(gdpPercap), gapminder_nn, listw = gmlistw,
model = "random", lag = TRUE) |>
summary()
## Warning in if (class(covTheta) == "try-error") {: the condition has length > 1
## and only the first element will be used
## ML panel with spatial lag, random effects, spatial error correlation
##
## Call:
## spreml(formula = formula, data = data, index = index, w = listw2mat(listw),
## w2 = listw2mat(listw2), lag = lag, errors = errors, cl = cl)
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.55 3.21 3.40 3.36 3.53 3.64
##
## Error variance parameters:
## Estimate Std. Error t-value Pr(>|t|)
## phi 2.996150 0.431952 6.9363 4.025e-12 ***
## rho -0.217109 0.052232 -4.1566 3.230e-05 ***
##
## Spatial autoregressive coefficient:
## Estimate Std. Error t-value Pr(>|t|)
## lambda 0.831166 0.014322 58.033 < 2.2e-16 ***
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 0.337909 0.021871 15.45 < 2.2e-16 ***
## log(gdpPercap) 0.041998 0.002530 16.60 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that it was not further examined whether this model makes any sense, or the transformations; this is purely meant as an illustration how to get spml
to work.
save.image()
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] splm_1.4-11 plm_2.4-1 spdep_1.1-8
## [4] spData_0.3.10 sp_1.4-5 tmap_3.3-2
## [7] gstat_2.0-8 spatstat_2.2-0 spatstat.linnet_2.2-1
## [10] spatstat.core_2.2-0 rpart_4.1-15 nlme_3.1-152
## [13] spatstat.geom_2.2-0 spatstat.data_2.1-0 mapview_2.10.0
## [16] forcats_0.5.1 purrr_0.3.4 readr_1.4.0
## [19] tidyr_1.1.3 tibble_3.1.2 tidyverse_1.3.1
## [22] units_0.7-2 stringr_1.4.0 rnaturalearth_0.1.0
## [25] gapminder_0.3.0 dplyr_1.0.7 stars_0.5-4
## [28] abind_1.4-5 ggplot2_3.3.5 sf_1.0-3
## [31] maps_3.3.0
##
## loaded via a namespace (and not attached):
## [1] spam_2.6-0 readxl_1.3.1 uuid_0.1-4
## [4] backports_1.2.1 miscTools_0.6-26 systemfonts_1.0.2
## [7] lwgeom_0.2-7 splines_4.1.0 crosstalk_1.1.1
## [10] leaflet_2.0.4.1 digest_0.6.27 leafpop_0.1.0
## [13] htmltools_0.5.1.1 ibdreg_0.3.1 gdata_2.18.0
## [16] leaflet.providers_1.9.0 fansi_0.5.0 magrittr_2.0.1
## [19] tensor_1.5 modelr_0.1.8 gmodels_2.18.1
## [22] sandwich_3.0-1 xts_0.12.1 svglite_2.0.0
## [25] bdsmatrix_1.3-4 spatstat.sparse_2.0-0 colorspace_2.0-2
## [28] rvest_1.0.0 rbibutils_2.2.1 haven_2.4.1
## [31] xfun_0.24 rgdal_1.5-23 leafem_0.1.6
## [34] crayon_1.4.1 jsonlite_1.7.2 brew_1.0-6
## [37] zoo_1.8-9 glue_1.4.2 polyclip_1.10-0
## [40] gtable_0.3.0 webshot_0.5.2 maxLik_1.4-8
## [43] scales_1.1.1 DBI_1.1.1 Rcpp_1.0.7
## [46] viridisLite_0.4.0 proxy_0.4-26 dotCall64_1.0-1
## [49] Formula_1.2-4 stats4_4.1.0 intervals_0.15.2
## [52] htmlwidgets_1.5.3 httr_1.4.2 FNN_1.1.3
## [55] RColorBrewer_1.1-2 wk_0.4.1 ellipsis_0.3.2
## [58] pkgconfig_2.0.3 XML_3.99-0.6 farver_2.1.0
## [61] sass_0.4.0 dbplyr_2.1.1 deldir_0.2-10
## [64] utf8_1.2.1 tidyselect_1.1.1 labeling_0.4.2
## [67] rlang_0.4.11 tmaptools_3.1-1 spatialreg_1.1-8
## [70] munsell_0.5.0 cellranger_1.1.0 tools_4.1.0
## [73] cli_2.5.0 generics_0.1.0 broom_0.7.8
## [76] evaluate_0.14 fftwtools_0.9-11 yaml_2.2.1
## [79] goftest_1.2-2 leafsync_0.1.0 knitr_1.33
## [82] fs_1.5.0 s2_1.0.6 satellite_1.0.2
## [85] xml2_1.3.2 compiler_4.1.0 rstudioapi_0.13
## [88] png_0.1-7 e1071_1.7-7 spatstat.utils_2.2-0
## [91] reprex_2.0.0 spacetime_1.2-5 bslib_0.2.5.1
## [94] stringi_1.6.2 highr_0.9 ps_1.6.0
## [97] rgeos_0.5-5 lattice_0.20-44 Matrix_1.3-4
## [100] classInt_0.4-3 vctrs_0.3.8 LearnBayes_2.15.1
## [103] pillar_1.6.1 lifecycle_1.0.0 lmtest_0.9-38
## [106] Rdpack_2.1.2 jquerylib_0.1.4 raster_3.4-13
## [109] R6_2.5.0 KernSmooth_2.23-20 spDataLarge_0.5.4
## [112] codetools_0.2-18 dichromat_2.0-0 gtools_3.9.2
## [115] MASS_7.3-54 boot_1.3-28 assertthat_0.2.1
## [118] withr_2.4.2 expm_0.999-6 mgcv_1.8-36
## [121] parallel_4.1.0 hms_1.1.0 grid_4.1.0
## [124] coda_0.19-4 rnaturalearthdata_0.1.0 class_7.3-19
## [127] rmarkdown_2.9 lubridate_1.7.10 base64enc_0.1-3