Spatial Data Science with R: 1.

Before we start

Link to course pages
Link to this document
Link to GitHub repo

What are spatial data?

One could argue that all data is spatial, because how can one observe something that is not somewhere?

However, only data for which the location of observation is registered, referenced, and of (some) relevance for the information contained is considered spatial data

Overview

data structures vs. data concepts
the software landscape
work flows; file formats
handling attributes (properties) of geometries,
plotting maps,
measurement units and coordinate reference systems,
spatiotemporal data, raster and vector data cubes

data structures vs. data concepts

data structures: how we represent data in the computer (or on paper)
data concepts: what we mean by these representations

Example data structure: a vector, array, or list

Example data concept: an object, an event, a field or coverage, a trajectory, an administrative division, outcome of an election, a hurricane. We will come back to this.

Simple features

See https://journal.r-project.org/archive/2018/RJ-2018-009/

Simple features are (a data model for) things that have a geometry and properties.

Simple feature geometry refers to the geometry of a simple feature.

The simple refers to lines (and polygon boundaries) being represented by points connected with straight lines.

For some things, this "thing data model" either works

badly: continuously varying spatial properties, like air temperature
moderately well: categorical/discrete properties varying continuously over space, such as land use

The common geometries

POINT (0 1)
MULTIPOINT (0 1,1 1,-1 2)
LINESTRING (0 0,1 1)
MULTILINESTRING ((0 0,1 1),(2 2,0 2))
POLYGON ((0 0,3 0,3 3,0 3,0 0),(1 1,2 1,2 2,1 2,1 1))
MULTIPOLYGON (((0 0,1 0,1 1,0 1,0 0)),((2 2,3 2,3 3,2 3,2 2)))
GEOMETRYCOLLECTION(POINT(0 1), LINESTRING(0 0,1 1))
POINT EMPTY

These encodings are called Well-known text (WKT)

Binary encodings for foreign I/O

Well-known binary (WKB):

library(sf)

## Linking to GEOS 3.7.0, GDAL 2.4.0, PROJ 5.2.0

st_as_binary(st_as_sfc("POINT (0 1)"))[[1]]

##  [1] 01 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0 3f

st_as_binary(st_as_sfc("LINESTRING (0 0,1 1)"))[[1]]

##  [1] 01 02 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
## [24] 00 00 00 00 00 00 00 00 f0 3f 00 00 00 00 00 00 f0 3f

WKB:

I/O but also to GDAL, GEOS and liblwgeom
I/O (direct) to databases,
not human readable, but
fast, and lossless

Representation in R (by `sf`):

str(st_as_sfc("POINT (0 1)")[[1]])

##  'XY' num [1:2] 0 1

str(st_as_sfc("LINESTRING (0 0,1 1,2 2)")[[1]])

##  'XY' num [1:3, 1:2] 0 1 2 0 1 2

POINT is a numeric vector
a point set is a matrix (1 point per row)
a set of anything else is a list
classes disambiguate identical storage form (e.g. LINESTRING and MULTIPOINT, or POLYGON and MULTILINESTRING)

`sf` class system

single feature geometry: sfc

class(st_linestring(rbind(c(0,0), c(1,1))))

## [1] "XY"         "LINESTRING" "sfg"

set of single feature geometries: sfc (keeps bbox, CRS)

class(st_sfc(st_point(c(0,1)), st_point(c(5,5)), crs = 4326))

## [1] "sfc_POINT" "sfc"

set of single features (geom + attributes): sf

class(st_sf(attr = "a", geom = st_sfc(st_point(c(0,1)))))

## [1] "sf"         "data.frame"

Empty geometries

Empty geometries reflect a missing geometry. E.g.

st_intersection(st_point(c(0,1)), st_point(c(2,2)))

## GEOMETRYCOLLECTION EMPTY

"there is no geometry for which …"
these are important for type safety
unclear why they are typed: what makes an empty point different from an empty linestring?

Mixes: 1. at the single feature level:

gc = st_union(st_point(c(0,1)), st_linestring(rbind(c(0,0), c(1,1))))
gc

## GEOMETRYCOLLECTION (POINT (0 1), LINESTRING (0 0, 1 1))

st_sf(attr = "a", geom = st_sfc(gc))

## Simple feature collection with 1 feature and 1 field
## geometry type:  GEOMETRYCOLLECTION
## dimension:      XY
## bbox:           xmin: 0 ymin: 0 xmax: 1 ymax: 1
## epsg (SRID):    NA
## proj4string:    NA
##   attr                           geom
## 1    a GEOMETRYCOLLECTION (POINT (...

Mixes: 2. at the feature set level:

g = st_sfc(st_point(c(0,1)), st_linestring(rbind(c(0,0), c(1,1))))
st_sf(attr = c("a", "b"), geom = g)

## Simple feature collection with 2 features and 1 field
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: 0 ymin: 0 xmax: 1 ymax: 1
## epsg (SRID):    NA
## proj4string:    NA
##   attr                  geom
## 1    a           POINT (0 1)
## 2    b LINESTRING (0 0, 1 1)

Operations on geometries

Geometry operations can be:

predicates: return TRUE or FALSE, e.g. st_intersects
measures: return a measure (value, possibly with unit), e.g. st_area
geometry generating operations, e.g. st_intersection

They have an arity:

unary: operates on a single geometry, e.g. st_area
binary: operates on a pair, e.g. st_overlaps
n-ary: operates on a set, e.g. st_intersection

Chapter 5 of Spatial Data Science details all the combinations.

If this all looks complicated, do compare it to how a GIS works.

the R spatial package landscape: from sp to sf

The "sp stack" (2005-…) : sp, rgdal, rgeos, raster, … (sp: 550 revdeps)

The sf stack (2016-…): sf, stars, mapview, tmap, gstat, spdep, … (sf: 150 revdeps)

sp vs. sf

comparison:

sf can, sp cannot combine geometry types in a GEOMETRY
sf can, sp cannot return GEOMETRYCOLLECTION
sf can, sp cannot handle empty geometries
sf can, sp cannot distinguish between POLYGON and MULTIPOLYGON etc.
sp has an ambiguous data model for holes in polygons, sf doesn't
sp's SpatialXxxDataFrame objects are not data.frames, and as a consequence, sp objects do not work well with tidyverse commands
geom_sf: sf has full ggplot2 support

moving between sp and sf

geom = st_sfc(st_point(0:1), st_point(1:2), crs = 4326)
x = st_sf(attr = c("a", "b"), geom = geom)
library(sp)
x.sp = as(x, "Spatial") # sf -> sp
x2 = st_as_sf(x.sp) # and back: sp -> sf
all.equal(x, x2, check.attributes = FALSE)

## [1] TRUE

This works for objects that can be converted, and gives an error otherwise.
LINESTRING and POLYGON return as MULTILINESTRING and MULTIPOLYGON respectively.

the sf dependency landscape

Exercise (1)

use the dataset found in system.file("shape/nc.shp", package="sf")
Read this dataset with sf::st_read (or sf::read_sf)
plot the dataset, then plot the second attribute
try to manipulate the color scheme and color breaks
plot the first 20 features of the dataset
create a plot with axes
what is the coordinate reference system of the dataset?
For the nc dataset, compute the area of the polygons with st_area
compare the areas with the AREA field in the dataset; is there a discrepancy?
select the polygon that intersects with POINT (-81.49826 36.4314)
Repeat the above steps (or similar ones) with a dataset of your own

workflows

Typical workflow:

read: load data from external files or database
compute: carry out computations
write: export results (new data "layers", summary statistics, figures)

Recommended:

put everything in an R script
even better: also put comments in the R script
much better: work from R-Markdown files, write executable documents

workflows without side effects

Avoid side effects (anything not in the script/R Markdown file):

NEVER read .RData file at startup (rstudio option; R --vanilla)
NEVER write .RData at end (rstudio option; R --vanilla)
do not put side effects options in .Renviron / .Rprofile (I use it for graphics options or user credentials)

Reproducible workflows

avoid using setwd(), at all times
avoid using rm(ls()) at the start of the script, but assume a clean workspace
put all library(xxx) statements at the start
work in a single directory (+subdirectories if needed)
put a .Rproj rstudio project file in the directory
zip the directory to share/archive/backup, including the result (html/pdf)

So your colleague (or later self) can:

open the zip file
open the .Rproj file
open the .Rmd file, and click knit

`sessionInfo()`

R and R package versions matter. Use

sessionInfo()

devtools::session_info()

at the end of every .Rmd file so that it is documented which version of everything was used. This may help diagnosing issues later on.

attributes of geometries

what is the bedrock type along the grey road?
what is the population density along the grey road?

support of data

a lot of data is not registered for time instances or spatial points, but rather for time periods, and/or spatial areas, referred to as support of the data
one can argue that all observation must refer to periods/areas, because observation takes time and space, but that doesn't help much
making the support explicit may help avoid making meaningless analyses

attribute-geometry relationships

suppressPackageStartupMessages(library(tidyverse))
nc = read_sf(system.file("gpkg/nc.gpkg", package="sf")) %>%
        select(CNTY_, BIR74)
st_agr(nc) = c(CNTY_ = "identity", BIR74 = "aggregate")
pt = st_as_sfc("POINT (-81.49826 36.4314)", crs = st_crs(nc))
i = st_intersection(st_sf(geom = pt), nc)

## although coordinates are longitude/latitude, st_intersection assumes that they are planar

## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries

We don't get this warning if none of the attributes is an aggregate; we're faking this by

st_agr(nc) = c(CNTY_ = "identity", BIR74 = "constant")
i = st_intersection(st_sf(geom = pt), nc)

## although coordinates are longitude/latitude, st_intersection assumes that they are planar

plotting maps

plotting spatial data: base, ggplot2::geom_sf, raster/stars, mapview, tmap

measurement units and coordinate reference systems

attributes: ordinal, nominal, interval, ratio
reference sytems: units of measurement, datums, coordinate reference systems, calendars, ontologies, …

coordinate transformations

st_crs() shows, or sets, the coordinate reference system
st_transform() takes care of these, using PROJ strings or EPSG codes (numbers), e.g.

nc = read_sf(system.file("gpkg/nc.gpkg", package="sf"))
nc.tr = st_transform(nc, 2264)
st_crs(nc.tr)

## Coordinate Reference System:
##   EPSG: 2264 
##   proj4string: "+proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.2192024384 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs"

st_crs(nc.tr)$units

## [1] "us-ft"

Raster data

sp has some elementary infrastructure for this (in-memory)
package raster is scalable, solid, and has a large functionality
raster data model: single raster layers, or a stack of raster layers, possibly time-referenced (max: 3-D).
package stars tries to break out of this limitation, and covers raster and vector data cubes
a challenge is always out-of-core computation (data on disk); generic implementation that is still performant is difficult
e.g. doing time series analysis on NetCDF files series, with one time instance per NetCDF file

raster grid types

spatiotemporal (S/T) data

some S/T data has irregular space-time locations, typically in case where the time and location are of primary interest (events): earth quakes, floods, lightning, disease cases, deaths,
often, such data is aggregated to counts/frequencies/densities per time period and by area, either for privacy or for convenience
other data is collected at regular space/time combinations (e.g. sensor data, satellite imagery)
stars: a package for raster and vector data cubes, was developed to handle this second case (time instances/periods are identical for all spatial locations/areas)
a common action is change of spatial/temporal resolution (interpolation/downsampling/aggregation)

Data concepts

Data are composed of discrete entities ("numbers, bits")

"Real life" is

partly discrete (person, house)
partly less discrete (mountain, flood)
partly continuous (elevation, air temperature)

S. Scheider, B. Graeler, E. Pebesma, C. Stasch, 2016. Modelling spatio-temporal information generation. Int J of Geographic Information Science, 30 (10), 1980-2008 https://dx.doi.org/10.1080/13658816.2016.1151520 (open access)

Data concepts (2)

Higher-level concepts:

Reference systems: Space \(S\), Time \(T\), Discrete \(D\), Quality \(Q\)
Region \(R\), Period \(P\)
Object, event \(D \Rightarrow (S,T) \Rightarrow Q\) (marked point pattern)
Field, Coverage: \(S \Rightarrow Q\), \((S,T) \Rightarrow Q\)
Trajectory: \(T \Rightarrow S \Rightarrow Q\)
Aggregation: \((R,P) \Rightarrow Q\)

The concept of a function (\(\Rightarrow\)) is largely absent in GIS or data base representations, and neither is that of aggregations as a type.

Exercises (2)

Read the file system.file("tif/L7_ETMs.tif", package = "stars") with stars::read_stars
print the dataset, and describe what it contains
plot the dataset; how are the color breaks chosen?
find the maximum pixel value for each of the bands (st_apply)
find and plot the maximum reflectance over all bands for all pixels
find and plot the mean reflectance over all bands for all pixels
plot the subset of these data with the first 10 rows, the second 10 colums, and the last 3 bands
use st_crop to crop the dataset to the area 300 m around POINT (293749.5 9115745)
is this cropped dataset still a (rectangular) raster?