Before we start

What are spatial data?

  • One could argue that all data is spatial, because how can one observe something that is not somewhere?
  • However, only data for which the location of observation is registered, referenced, and of (some) relevance for the information contained is considered spatial data

Overview

  • data structures vs. data concepts
  • the software landscape
  • work flows; file formats
  • handling attributes (properties) of geometries,
  • plotting maps,
  • measurement units and coordinate reference systems,
  • spatiotemporal data, raster and vector data cubes

data structures vs. data concepts

  • data structures: how we represent data in the computer (or on paper)
  • data concepts: what we mean by these representations

Example data structure: a vector, array, or list

Example data concept: an object, an event, a field or coverage, a trajectory, an administrative division, outcome of an election, a hurricane. We will come back to this.

Simple features

See https://journal.r-project.org/archive/2018/RJ-2018-009/

Simple features are (a data model for) things that have a geometry and properties.

Simple feature geometry refers to the geometry of a simple feature.

The simple refers to lines (and polygon boundaries) being represented by points connected with straight lines.

For some things, this "thing data model" either works

  • badly: continuously varying spatial properties, like air temperature
  • moderately well: categorical/discrete properties varying continuously over space, such as land use

The common geometries

  • POINT (0 1)
  • MULTIPOINT (0 1,1 1,-1 2)
  • LINESTRING (0 0,1 1)
  • MULTILINESTRING ((0 0,1 1),(2 2,0 2))
  • POLYGON ((0 0,3 0,3 3,0 3,0 0),(1 1,2 1,2 2,1 2,1 1))
  • MULTIPOLYGON (((0 0,1 0,1 1,0 1,0 0)),((2 2,3 2,3 3,2 3,2 2)))
  • GEOMETRYCOLLECTION(POINT(0 1), LINESTRING(0 0,1 1))
  • POINT EMPTY

These encodings are called Well-known text (WKT)

Binary encodings for foreign I/O

Well-known binary (WKB):

library(sf)
## Linking to GEOS 3.7.0, GDAL 2.4.0, PROJ 5.2.0
st_as_binary(st_as_sfc("POINT (0 1)"))[[1]]
##  [1] 01 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0 3f
st_as_binary(st_as_sfc("LINESTRING (0 0,1 1)"))[[1]]
##  [1] 01 02 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
## [24] 00 00 00 00 00 00 00 00 f0 3f 00 00 00 00 00 00 f0 3f

WKB:

  • I/O but also to GDAL, GEOS and liblwgeom
  • I/O (direct) to databases,
  • not human readable, but
  • fast, and lossless

Representation in R (by sf):

str(st_as_sfc("POINT (0 1)")[[1]])
##  'XY' num [1:2] 0 1
str(st_as_sfc("LINESTRING (0 0,1 1,2 2)")[[1]])
##  'XY' num [1:3, 1:2] 0 1 2 0 1 2
  • POINT is a numeric vector
  • a point set is a matrix (1 point per row)
  • a set of anything else is a list
  • classes disambiguate identical storage form (e.g. LINESTRING and MULTIPOINT, or POLYGON and MULTILINESTRING)

sf class system

  • single feature geometry: sfc
class(st_linestring(rbind(c(0,0), c(1,1))))
## [1] "XY"         "LINESTRING" "sfg"
  • set of single feature geometries: sfc (keeps bbox, CRS)
class(st_sfc(st_point(c(0,1)), st_point(c(5,5)), crs = 4326))
## [1] "sfc_POINT" "sfc"
  • set of single features (geom + attributes): sf
class(st_sf(attr = "a", geom = st_sfc(st_point(c(0,1)))))
## [1] "sf"         "data.frame"

Empty geometries

Empty geometries reflect a missing geometry. E.g.

st_intersection(st_point(c(0,1)), st_point(c(2,2)))
## GEOMETRYCOLLECTION EMPTY
  • "there is no geometry for which …"
  • these are important for type safety
  • unclear why they are typed: what makes an empty point different from an empty linestring?

Mixes: 1. at the single feature level:

gc = st_union(st_point(c(0,1)), st_linestring(rbind(c(0,0), c(1,1))))
gc
## GEOMETRYCOLLECTION (POINT (0 1), LINESTRING (0 0, 1 1))
st_sf(attr = "a", geom = st_sfc(gc))
## Simple feature collection with 1 feature and 1 field
## geometry type:  GEOMETRYCOLLECTION
## dimension:      XY
## bbox:           xmin: 0 ymin: 0 xmax: 1 ymax: 1
## epsg (SRID):    NA
## proj4string:    NA
##   attr                           geom
## 1    a GEOMETRYCOLLECTION (POINT (...

Mixes: 2. at the feature set level:

g = st_sfc(st_point(c(0,1)), st_linestring(rbind(c(0,0), c(1,1))))
st_sf(attr = c("a", "b"), geom = g)
## Simple feature collection with 2 features and 1 field
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: 0 ymin: 0 xmax: 1 ymax: 1
## epsg (SRID):    NA
## proj4string:    NA
##   attr                  geom
## 1    a           POINT (0 1)
## 2    b LINESTRING (0 0, 1 1)

Operations on geometries

Geometry operations can be:

  • predicates: return TRUE or FALSE, e.g. st_intersects
  • measures: return a measure (value, possibly with unit), e.g. st_area
  • geometry generating operations, e.g. st_intersection

They have an arity:

  • unary: operates on a single geometry, e.g. st_area
  • binary: operates on a pair, e.g. st_overlaps
  • n-ary: operates on a set, e.g. st_intersection

Chapter 5 of Spatial Data Science details all the combinations.

If this all looks complicated, do compare it to how a GIS works.

the R spatial package landscape: from sp to sf

The "sp stack" (2005-…) : sp, rgdal, rgeos, raster, … (sp: 550 revdeps)

The sf stack (2016-…): sf, stars, mapview, tmap, gstat, spdep, … (sf: 150 revdeps)

sp vs. sf

comparison:

  • sf can, sp cannot combine geometry types in a GEOMETRY
  • sf can, sp cannot return GEOMETRYCOLLECTION
  • sf can, sp cannot handle empty geometries
  • sf can, sp cannot distinguish between POLYGON and MULTIPOLYGON etc.
  • sp has an ambiguous data model for holes in polygons, sf doesn't
  • sp's SpatialXxxDataFrame objects are not data.frames, and as a consequence, sp objects do not work well with tidyverse commands
  • geom_sf: sf has full ggplot2 support

moving between sp and sf

geom = st_sfc(st_point(0:1), st_point(1:2), crs = 4326)
x = st_sf(attr = c("a", "b"), geom = geom)
library(sp)
x.sp = as(x, "Spatial") # sf -> sp
x2 = st_as_sf(x.sp) # and back: sp -> sf
all.equal(x, x2, check.attributes = FALSE)
## [1] TRUE
  • This works for objects that can be converted, and gives an error otherwise.
  • LINESTRING and POLYGON return as MULTILINESTRING and MULTIPOLYGON respectively.

the sf dependency landscape

Exercise (1)

  • use the dataset found in system.file("shape/nc.shp", package="sf")
  • Read this dataset with sf::st_read (or sf::read_sf)
  • plot the dataset, then plot the second attribute
  • try to manipulate the color scheme and color breaks
  • plot the first 20 features of the dataset
  • create a plot with axes
  • what is the coordinate reference system of the dataset?
  • For the nc dataset, compute the area of the polygons with st_area
  • compare the areas with the AREA field in the dataset; is there a discrepancy?
  • select the polygon that intersects with POINT (-81.49826 36.4314)
  • Repeat the above steps (or similar ones) with a dataset of your own

workflows

Typical workflow:

  • read: load data from external files or database
  • compute: carry out computations
  • write: export results (new data "layers", summary statistics, figures)

Recommended:

  • put everything in an R script
  • even better: also put comments in the R script
  • much better: work from R-Markdown files, write executable documents

workflows without side effects

Avoid side effects (anything not in the script/R Markdown file):

  • NEVER read .RData file at startup (rstudio option; R --vanilla)
  • NEVER write .RData at end (rstudio option; R --vanilla)
  • do not put side effects options in .Renviron / .Rprofile (I use it for graphics options or user credentials)

Reproducible workflows

  • avoid using setwd(), at all times
  • avoid using rm(ls()) at the start of the script, but assume a clean workspace
  • put all library(xxx) statements at the start
  • work in a single directory (+subdirectories if needed)
  • put a .Rproj rstudio project file in the directory
  • zip the directory to share/archive/backup, including the result (html/pdf)

So your colleague (or later self) can:

  • open the zip file
  • open the .Rproj file
  • open the .Rmd file, and click knit

sessionInfo()

R and R package versions matter. Use

sessionInfo()

or

devtools::session_info()

at the end of every .Rmd file so that it is documented which version of everything was used. This may help diagnosing issues later on.

attributes of geometries

  • what is the bedrock type along the grey road?
  • what is the population density along the grey road?

support of data

  • a lot of data is not registered for time instances or spatial points, but rather for time periods, and/or spatial areas, referred to as support of the data
  • one can argue that all observation must refer to periods/areas, because observation takes time and space, but that doesn't help much
  • making the support explicit may help avoid making meaningless analyses

attribute-geometry relationships

suppressPackageStartupMessages(library(tidyverse))
nc = read_sf(system.file("gpkg/nc.gpkg", package="sf")) %>%
        select(CNTY_, BIR74)
st_agr(nc) = c(CNTY_ = "identity", BIR74 = "aggregate")
pt = st_as_sfc("POINT (-81.49826 36.4314)", crs = st_crs(nc))
i = st_intersection(st_sf(geom = pt), nc)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries

We don't get this warning if none of the attributes is an aggregate; we're faking this by

st_agr(nc) = c(CNTY_ = "identity", BIR74 = "constant")
i = st_intersection(st_sf(geom = pt), nc)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar

plotting maps

  • plotting spatial data: base, ggplot2::geom_sf, raster/stars, mapview, tmap

measurement units and coordinate reference systems

  • attributes: ordinal, nominal, interval, ratio
  • reference sytems: units of measurement, datums, coordinate reference systems, calendars, ontologies, …

coordinate transformations

  • st_crs() shows, or sets, the coordinate reference system
  • st_transform() takes care of these, using PROJ strings or EPSG codes (numbers), e.g.
nc = read_sf(system.file("gpkg/nc.gpkg", package="sf"))
nc.tr = st_transform(nc, 2264)
st_crs(nc.tr)
## Coordinate Reference System:
##   EPSG: 2264 
##   proj4string: "+proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.2192024384 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs"
st_crs(nc.tr)$units
## [1] "us-ft"

Raster data

  • sp has some elementary infrastructure for this (in-memory)
  • package raster is scalable, solid, and has a large functionality
  • raster data model: single raster layers, or a stack of raster layers, possibly time-referenced (max: 3-D).
  • package stars tries to break out of this limitation, and covers raster and vector data cubes
  • a challenge is always out-of-core computation (data on disk); generic implementation that is still performant is difficult
  • e.g. doing time series analysis on NetCDF files series, with one time instance per NetCDF file

raster grid types

spatiotemporal (S/T) data

  • some S/T data has irregular space-time locations, typically in case where the time and location are of primary interest (events): earth quakes, floods, lightning, disease cases, deaths,
  • often, such data is aggregated to counts/frequencies/densities per time period and by area, either for privacy or for convenience
  • other data is collected at regular space/time combinations (e.g. sensor data, satellite imagery)
  • stars: a package for raster and vector data cubes, was developed to handle this second case (time instances/periods are identical for all spatial locations/areas)
  • a common action is change of spatial/temporal resolution (interpolation/downsampling/aggregation)

Data concepts

Data are composed of discrete entities ("numbers, bits")

"Real life" is

  • partly discrete (person, house)
  • partly less discrete (mountain, flood)
  • partly continuous (elevation, air temperature)

S. Scheider, B. Graeler, E. Pebesma, C. Stasch, 2016. Modelling spatio-temporal information generation. Int J of Geographic Information Science, 30 (10), 1980-2008 https://dx.doi.org/10.1080/13658816.2016.1151520 (open access)

Data concepts (2)

Higher-level concepts:

  • Reference systems: Space \(S\), Time \(T\), Discrete \(D\), Quality \(Q\)
  • Region \(R\), Period \(P\)
  • Object, event \(D \Rightarrow (S,T) \Rightarrow Q\) (marked point pattern)
  • Field, Coverage: \(S \Rightarrow Q\), \((S,T) \Rightarrow Q\)
  • Trajectory: \(T \Rightarrow S \Rightarrow Q\)
  • Aggregation: \((R,P) \Rightarrow Q\)

The concept of a function (\(\Rightarrow\)) is largely absent in GIS or data base representations, and neither is that of aggregations as a type.

Exercises (2)

  • Read the file system.file("tif/L7_ETMs.tif", package = "stars") with stars::read_stars
  • print the dataset, and describe what it contains
  • plot the dataset; how are the color breaks chosen?
  • find the maximum pixel value for each of the bands (st_apply)
  • find and plot the maximum reflectance over all bands for all pixels
  • find and plot the mean reflectance over all bands for all pixels
  • plot the subset of these data with the first 10 rows, the second 10 colums, and the last 3 bands
  • use st_crop to crop the dataset to the area 300 m around POINT (293749.5 9115745)
  • is this cropped dataset still a (rectangular) raster?