Raster and vector data cubes in R

May 22, 2019, Jena edzer.github.io/jena/ (rmd)

What are data cubes?

They are annotated array data.
Arrays map from \(n\) dimensions to \(p\) attributes: \[(D_1,D_2,...,D_n)\rightarrow (A_1,A_2,...,A_p)\]
array dimensions are enumerated sets, and may
- represent a set of entities (cities, persons, species, spectral bands)
- discretize a continuous variable, such as
  - spatial dimension,
  - time,
  - time-of-day,
  - frequency or wavelength,
  - duration (e.g., time-to-forecast)

“Cube”:

We use the word cube short for

“real” cube: three dimensions
more than three dimensions: hypercube
two dimensions: matrix, raster

a special case is:

one dimensional cube: table with records (data.frame, tibble)

“Spatial” data cube

Cube dimensions can refer to spatial dimensions in several ways:

each continuous spatial dimension (x, y, z) maps to a single cube dimension (raster data cubes), e.g. regular grids
- \(c(i) = o + (i-1) \times \delta, \ \ i = 1,2,...,n\)
- index space is continuous: integer \(i\) implies \([i,i+1)\)
- this means that every coordinate value maps to a unique index (unlike polygons)
all spatial dimensions map to a single cube dimension (vector data cubes)
- cube dimension is a set of points/lines/polygons
combinations of these: e.g. origin-destination matrices

Raster - vector conversion

From raster to vector:

polygons or points are given:
- sample (“extract”)
- aggregate, e.g. mean or sum over polygon
polygons or lines are the result:
- polygonize
- contour

From vector to raster:

(points/polygons:) rasterize
(point sample:) interpolate
(point pattern:) density

Raster types

R package `raster`

is legacy, well understood, scales (automatically), powerful
“limits” data size to data on (local) disk; will auto-chunk
only implements regular raster, or a stack of these
stack can be time-related
pixels are scalar, real-valued (custom mimicing factors)

Is anything is wrong with R’s native array? No!
dimensions can be named, dimension values labeled:

array(1:8/8, c(2,2,2), dimnames = list(dim_1=c("a","b"), dim_2=c("p", "q"), 
     dim_3=c("x", "y")))

## , , dim_3 = x
## 
##      dim_2
## dim_1     p     q
##     a 0.125 0.375
##     b 0.250 0.500
## 
## , , dim_3 = y
## 
##      dim_2
## dim_1     p     q
##     a 0.625 0.875
##     b 0.750 1.000

but labels can only be character
dpyr::tbl_cube improves on this, but still only accepts vectors, not handling regular, rectilinear boundaries, CRS, units, sheared/rotated, curvilinear, …
arrays can only (sensibly) be atomic: no records with mixed fields
arrays are in-memory (which may at some point change, e.g.~using ALTREP)

R package `stars`

a stars object is a set (list) of arrays with possibly varying type (numeric, integer, factor, logical, character, list)
uses R’s native arrays for that: all array Ops work (pixel math)
supports arrays with measurement units and time stamps (throug GDAL, libnetcdf)
has a dimensions attribute “table”
does everything in memory, unless you use stars_proxy, which does nothing in memory
lets you define arbitrary number of dimensions
slice & dice with [, or filter
map/reduce with st_apply: apply a function to a (set of) dimension(s)
aggregate for spatial / temporal aggregations
supports rectilinear (all dims), rotated, sheared, and curvilinear grids (raster), and simple features (vector)

R package `stars` (2)

implements raster (GDAL, netcdf) and vector (sfc) data cubes
full support for PROJ coordinate reference systems
time support: POSIXct, Date, as well as PCICt (360, 365, noleap)
integration with some tidyverse verbs (filter, select, mutate, geom_stars)
integrates with gstat (spatial, spatiotemporal geostatistics)

library(stars)
tif = system.file("tif/L7_ETMs.tif", package = "stars")
(x = read_stars(tif))

## stars object with 3 dimensions and 1 attribute
## attribute(s):
##   L7_ETMs.tif    
##  Min.   :  1.00  
##  1st Qu.: 54.00  
##  Median : 69.00  
##  Mean   : 68.91  
##  3rd Qu.: 86.00  
##  Max.   :255.00  
## dimension(s):
##      from  to  offset delta                       refsys point values    
## x       1 349  288776  28.5 +proj=utm +zone=25 +south... FALSE   NULL [x]
## y       1 352 9120761 -28.5 +proj=utm +zone=25 +south... FALSE   NULL [y]
## band    1   6      NA    NA                           NA    NA   NULL

plot(x)

library(stars)
library(ggplot2)
ggplot() + geom_stars(data = x) + coord_equal() + facet_wrap(~band) + 
  theme_void()  +  scale_x_discrete(expand=c(0,0)) + scale_y_discrete(expand=c(0,0))

`stars_proxy` objects

contain pointers to the data (file, URL), which by no means uniquely defines an array
use a strategy to go through them (chunking: currently only spatial)
are lazy: only compute when data is requested
can read downsampled arrays (e.g. to plot): optimises execution order of call stack
can time-compose e.g. a time stack of NetCDF files, with a time slice in each file

granule = system.file("sentinel/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.zip", package = "starsdata")
base_name = strsplit(basename(granule), ".zip")[[1]]
s2 = paste0("SENTINEL2_L1C:/vsizip/", granule, "/", base_name, ".SAFE/MTD_MSIL1C.xml:10m:EPSG_32632")
(p = read_stars(s2, proxy = TRUE))

## stars_proxy object with 1 attribute in file:
## $`MTD_MSIL1C.xml:10m:EPSG_32632`
## [1] "SENTINEL2_L1C:/vsizip//home/edzer/R/x86_64-pc-linux-gnu-library/3.6/starsdata/sentinel/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.zip/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.SAFE/MTD_MSIL1C.xml:10m:EPSG_32632"
## 
## dimension(s):
##      from    to offset delta                       refsys point values    
## x       1 10980  3e+05    10 +proj=utm +zone=32 +datum...    NA   NULL [x]
## y       1 10980  6e+06   -10 +proj=utm +zone=32 +datum...    NA   NULL [y]
## band    1     4     NA    NA                           NA    NA   NULL

object.size(p)

## 7336 bytes

system.time(plot(p))

##    user  system elapsed 
##   0.758   0.149   0.411

ndvi = function(x) (x[4]-x[3])/(x[4]+x[3])
system.time(plot(st_apply(p, c("x", "y"), ndvi)))

##    user  system elapsed 
##   1.021   0.180   0.699

What if data are not nicely aligned in a cube?

subsequent images have different, consecutive time stamps
space is not rasterized (e.g., tiles in UTM zones)

`gdalcubes`:

create and process Earth observation data cubes from GDAL image collections
written by Marius Appel: C++, R interface
indexes a set of GDAL-readable files
“reads” them, by resampling, to a user-defined data cube
resampling options can be controlled (e.g. nearest, linear, bilinear) in both space and time
can process this cube, in parallel (e.g., reduce, aggregate)
writes the resulting data cube to a NetCDF file

openEO : an API for cube processing

is an H2020 project that develops a modern, open API to connect R, python, javascript and other clients to big Earth observation cloud back-ends in a simple and unified way.
has (scalar-valued) raster and vector data cubes as its data primitive
allows map/reduce, aggregation, resampling, user-defined data cubes, all math ops, ..
allows for UDFs (user-defined functions, e.g. written in Python or R)
anticipates a federated, heterogeneous cloud big EO data computing landscape (EOSC, DIAS, TEPs/MEPs…)

stars is somewhat of a blueprint, and useful client-side, or for server-side UDFs

Wrapping up:

Get involved: try out stuff

if you do, don’t hesitate to give feedback on r-sig-geo, github issues, or twitter