Jan 17, 2019

The good news

  • we've seen a strong uptake of sf, by both users and developers
  • we improved interoperability with spatial databases, geojson etc., also of coordinate reference system handling (PROJ)
  • spatial indexes are computed on-the-fly e.g. by st_intersects, st_intersection etc.
  • geom_sf is now in ggplot2, and takes care of (re)projecting incompatible layers
  • attribute-geometry relationships:

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(sf))
system.file("gpkg/nc.gpkg", package="sf") %>% read_sf -> nc
nc2 <- nc %>% select(SID74, SID79) %>% gather(VAR, SID, -geom)
ggplot() + geom_sf(data = nc2, aes(fill = SID)) + facet_wrap(~VAR, ncol = 1)

What about time, and raster?

  • if we have time-dependent data, do we put it over multiple columns? or recycle feature geometries in a long ("tidy") table?
  • (gg)plotting multiple attributes in facets requires recycling of geometries (tidyr::gather)
  • sf has no raster support, no solid vector \(\leftrightarrow\) raster integration
  • you could emulate them, but that won't bring you far
  • Why raster data?
  • variables that vary continuously over space (time) can be represented more naturally, by regular sampling (think: images)
  • Earth Observation data (visible: land, ocean; temperature; radar; atmospheric composition) come down, freely, in Copernicus now with 150 Tb/day
  • Climate model data, also free, is large; CMIP6 will generate 15-30 Pb
  • (numerous other sources)

Data cubes

Data cubes are array data where values (or records) are given for each combination of the dimension values. Examples:

  • sales by product, store, and week
  • population by gender, age class, region, and census round,
  • temperature by x-, y-, z-coordinate and time
  • screen pixel color as a function of row, column and time
  • forecast by time-of-forcast, time-to-forecast (and variable, x, y, and z)
  • Why not handle them entirely as tidy (long) tables?
  • storage, speed, indexes

support in base and dplyr

base::array has powerfull infrastructure for this:

  • dimnames to label dimension values (only character)
  • apply: apply functions to a dimension or sets of dimensions
  • [ subsetting (slice, crop)
  • abind:: abind, aperm, adrop
  • arrays are atomic

dplyr::tbl_cube has some further support

  • list of base::arrays with measurements
  • this mimics heterogeneous array records
  • list of vectors with dimension values
  • allows for e.g. Date or POSIXct dimension values
  • allows for filter on array dimensions, and group_by aggregations
  • no support for dimension list-vectors (e.g., sfc)
  • no support for regular dimensions

Raster- and vector data cubes: stars

stars data cubes have as special case

  • Raster data cubes, where x and y take a spatial dimension, and
  • Vector data cubes, where a set of feature geometries (points, lines, polygons) form the values of (at least) one dimension

and further

  • follow tbl_cube: a list of arrays and a set of dimensions
  • have a dimensions table with
    • dimensions as regular intervals (offset, cellsize)
    • dimensions as values (e.g. dates, also feature geometries)
    • measurement units, coordinate reference systems (PROJ)
  • read and write any format supported by GDAL
  • can read (directly) netcdf
  • supports out-of-core (stars_proxy), and
  • will support cloud processing

Raster types

suppressPackageStartupMessages(library(stars))
library(viridis)
## Loading required package: viridisLite
system.file("tif/L7_ETMs.tif", package = "stars") %>% read_stars() -> x
ggplot() + geom_stars(data = x) + coord_equal() + 
    facet_wrap(~band) + theme_void() + scale_fill_viridis() +
    scale_x_discrete(expand=c(0,0)) + scale_y_discrete(expand=c(0,0))

r = read_stars(s2, proxy = TRUE)
dim(r)
##     x     y  band 
## 10980 10980     4
plot(r)

# S2 10m: band 4: near infrared, band 1: red.
ndvi = function(x) (x[4]-x[1])/(x[4] + x[1])
system.time(s2.ndvi <- st_apply(r, c("x", "y"), ndvi))
##    user  system elapsed 
##       0       0       0
plot(s2.ndvi) # modifies execution order: reads downsampled, computes ndvi, plots

Cloud proxies; data cube views

Cloud proxy:

  • the actual imagery is in the cloud
  • proxy objects describe it
  • only the data viewed, or downloaded, is computed/retrieved

Data cube view:

  • dimension settings are no longer derived from imagery, but set independently
  • imagery is resampled on-the-fly to match the cube dimensions
  • allows integrating non-aligned time series, geometries, and coordinate reference systems (different sensors, or e.g. UTM zones for Sentinel2)
  • largely inspired and pioneered by Google Earth Engine

None-data cube extensions

  • spatio-temporal events (point patterns)
  • spatial networks, routing (tidygraph)
  • movement data (trajectories)

Hurricane dataset

  • rectilinear long/lat grid such that cells have aproximately constant area
  • constant time intervals
  • \(\Rightarrow\) counts are proportional to densities
  • st_intersects only counts hits, and ignores length/duration/strength

Concluding

  • Spatial data science is an exciting field, full with data, maps, challenges, and tidyverse extensions
  • With sf we extended tables to spatial tables
  • with stars we extend that to raster and vector data cubes, including spatial time series as special case
  • stars can do out-of-core raster, is lazy, will address cloud rasters processing
  • we can analyse big image collections interactively only when we downsample
  • read more about this in the Spatial Data Science upcoming book: https://www.r-spatial.org/book/
  • slides link: @edzerpebesma
  • Spatial Birds-of-Feather meeting: Tomorrow, 8 a.m., breakfast