Spatial Data Science in the Tidyverse

Jan 17, 2019

The good news

we've seen a strong uptake of sf, by both users and developers
we improved interoperability with spatial databases, geojson etc., also of coordinate reference system handling (PROJ)
spatial indexes are computed on-the-fly e.g. by st_intersects, st_intersection etc.
geom_sf is now in ggplot2, and takes care of (re)projecting incompatible layers
attribute-geometry relationships:

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(sf))
system.file("gpkg/nc.gpkg", package="sf") %>% read_sf -> nc
nc2 <- nc %>% select(SID74, SID79) %>% gather(VAR, SID, -geom)
ggplot() + geom_sf(data = nc2, aes(fill = SID)) + facet_wrap(~VAR, ncol = 1)

What about time, and raster?

if we have time-dependent data, do we put it over multiple columns? or recycle feature geometries in a long ("tidy") table?
(gg)plotting multiple attributes in facets requires recycling of geometries (tidyr::gather)
sf has no raster support, no solid vector \(\leftrightarrow\) raster integration
you could emulate them, but that won't bring you far

Why raster data?
variables that vary continuously over space (time) can be represented more naturally, by regular sampling (think: images)
Earth Observation data (visible: land, ocean; temperature; radar; atmospheric composition) come down, freely, in Copernicus now with 150 Tb/day
Climate model data, also free, is large; CMIP6 will generate 15-30 Pb
(numerous other sources)

link to tweet

Data cubes

Data cubes are array data where values (or records) are given for each combination of the dimension values. Examples:

sales by product, store, and week
population by gender, age class, region, and census round,
temperature by x-, y-, z-coordinate and time
screen pixel color as a function of row, column and time
forecast by time-of-forcast, time-to-forecast (and variable, x, y, and z)

Why not handle them entirely as tidy (long) tables?
storage, speed, indexes

support in `base` and `dplyr`

base::array has powerfull infrastructure for this:

dimnames to label dimension values (only character)
apply: apply functions to a dimension or sets of dimensions
[ subsetting (slice, crop)
abind:: abind, aperm, adrop
arrays are atomic

dplyr::tbl_cube has some further support

list of base::arrays with measurements
this mimics heterogeneous array records
list of vectors with dimension values
allows for e.g. Date or POSIXct dimension values
allows for filter on array dimensions, and group_by aggregations
no support for dimension list-vectors (e.g., sfc)
no support for regular dimensions

Raster- and vector data cubes: `stars`

stars data cubes have as special case

Raster data cubes, where x and y take a spatial dimension, and
Vector data cubes, where a set of feature geometries (points, lines, polygons) form the values of (at least) one dimension

and further

follow tbl_cube: a list of arrays and a set of dimensions
have a dimensions table with
- dimensions as regular intervals (offset, cellsize)
- dimensions as values (e.g. dates, also feature geometries)
- measurement units, coordinate reference systems (PROJ)
read and write any format supported by GDAL
can read (directly) netcdf
supports out-of-core (stars_proxy), and
will support cloud processing

Raster types

suppressPackageStartupMessages(library(stars))
library(viridis)
## Loading required package: viridisLite
system.file("tif/L7_ETMs.tif", package = "stars") %>% read_stars() -> x
ggplot() + geom_stars(data = x) + coord_equal() + 
    facet_wrap(~band) + theme_void() + scale_fill_viridis() +
    scale_x_discrete(expand=c(0,0)) + scale_y_discrete(expand=c(0,0))

r = read_stars(s2, proxy = TRUE)
dim(r)
##     x     y  band 
## 10980 10980     4
plot(r)

# S2 10m: band 4: near infrared, band 1: red.
ndvi = function(x) (x[4]-x[1])/(x[4] + x[1])
system.time(s2.ndvi <- st_apply(r, c("x", "y"), ndvi))
##    user  system elapsed 
##       0       0       0
plot(s2.ndvi) # modifies execution order: reads downsampled, computes ndvi, plots

Cloud proxies; data cube views

Cloud proxy:

the actual imagery is in the cloud
proxy objects describe it
only the data viewed, or downloaded, is computed/retrieved

Data cube view:

dimension settings are no longer derived from imagery, but set independently
imagery is resampled on-the-fly to match the cube dimensions
allows integrating non-aligned time series, geometries, and coordinate reference systems (different sensors, or e.g. UTM zones for Sentinel2)
largely inspired and pioneered by Google Earth Engine

None-data cube extensions

spatio-temporal events (point patterns)
spatial networks, routing (tidygraph)
movement data (trajectories)

Hurricane dataset

https://www.nhc.noaa.gov/data/#hurdat

rectilinear long/lat grid such that cells have aproximately constant area
constant time intervals
\(\Rightarrow\) counts are proportional to densities
st_intersects only counts hits, and ignores length/duration/strength

Concluding

Spatial data science is an exciting field, full with data, maps, challenges, and tidyverse extensions
With sf we extended tables to spatial tables
with stars we extend that to raster and vector data cubes, including spatial time series as special case
stars can do out-of-core raster, is lazy, will address cloud rasters processing
we can analyse big image collections interactively only when we downsample
read more about this in the Spatial Data Science upcoming book: https://www.r-spatial.org/book/
slides link: @edzerpebesma

Spatial Birds-of-Feather meeting: Tomorrow, 8 a.m., breakfast