May 22, 2019, Jena edzer.github.io/jena/ (rmd)

What are data cubes?

  • They are annotated array data.
  • Arrays map from \(n\) dimensions to \(p\) attributes: \[(D_1,D_2,...,D_n)\rightarrow (A_1,A_2,...,A_p)\]
  • array dimensions are enumerated sets, and may
    • represent a set of entities (cities, persons, species, spectral bands)
    • discretize a continuous variable, such as
      • spatial dimension,
      • time,
      • time-of-day,
      • frequency or wavelength,
      • duration (e.g., time-to-forecast)

“Cube”:

We use the word cube short for

  • “real” cube: three dimensions
  • more than three dimensions: hypercube
  • two dimensions: matrix, raster

a special case is:

  • one dimensional cube: table with records (data.frame, tibble)

“Spatial” data cube

Cube dimensions can refer to spatial dimensions in several ways:

  • each continuous spatial dimension (x, y, z) maps to a single cube dimension (raster data cubes), e.g. regular grids
    • \(c(i) = o + (i-1) \times \delta, \ \ i = 1,2,...,n\)
    • index space is continuous: integer \(i\) implies \([i,i+1)\)
    • this means that every coordinate value maps to a unique index (unlike polygons)
  • all spatial dimensions map to a single cube dimension (vector data cubes)
    • cube dimension is a set of points/lines/polygons
  • combinations of these: e.g. origin-destination matrices

Raster - vector conversion

From raster to vector:

  • polygons or points are given:
    • sample (“extract”)
    • aggregate, e.g. mean or sum over polygon
  • polygons or lines are the result:
    • polygonize
    • contour

From vector to raster:

  • (points/polygons:) rasterize
  • (point sample:) interpolate
  • (point pattern:) density

Raster types

R package raster

  • is legacy, well understood, scales (automatically), powerful
  • “limits” data size to data on (local) disk; will auto-chunk
  • only implements regular raster, or a stack of these
  • stack can be time-related
  • pixels are scalar, real-valued (custom mimicing factors)

  • Is anything is wrong with R’s native array? No!
  • dimensions can be named, dimension values labeled:
array(1:8/8, c(2,2,2), dimnames = list(dim_1=c("a","b"), dim_2=c("p", "q"), 
     dim_3=c("x", "y")))
## , , dim_3 = x
## 
##      dim_2
## dim_1     p     q
##     a 0.125 0.375
##     b 0.250 0.500
## 
## , , dim_3 = y
## 
##      dim_2
## dim_1     p     q
##     a 0.625 0.875
##     b 0.750 1.000

  • but labels can only be character
  • dpyr::tbl_cube improves on this, but still only accepts vectors, not handling regular, rectilinear boundaries, CRS, units, sheared/rotated, curvilinear, …
  • arrays can only (sensibly) be atomic: no records with mixed fields
  • arrays are in-memory (which may at some point change, e.g.~using ALTREP)

R package stars

  • a stars object is a set (list) of arrays with possibly varying type (numeric, integer, factor, logical, character, list)
  • uses R’s native arrays for that: all array Ops work (pixel math)
  • supports arrays with measurement units and time stamps (throug GDAL, libnetcdf)
  • has a dimensions attribute “table”
  • does everything in memory, unless you use stars_proxy, which does nothing in memory
  • lets you define arbitrary number of dimensions
  • slice & dice with [, or filter
  • map/reduce with st_apply: apply a function to a (set of) dimension(s)
  • aggregate for spatial / temporal aggregations
  • supports rectilinear (all dims), rotated, sheared, and curvilinear grids (raster), and simple features (vector)

R package stars (2)

  • implements raster (GDAL, netcdf) and vector (sfc) data cubes
  • full support for PROJ coordinate reference systems
  • time support: POSIXct, Date, as well as PCICt (360, 365, noleap)
  • integration with some tidyverse verbs (filter, select, mutate, geom_stars)
  • integrates with gstat (spatial, spatiotemporal geostatistics)

library(stars)
tif = system.file("tif/L7_ETMs.tif", package = "stars")
(x = read_stars(tif))
## stars object with 3 dimensions and 1 attribute
## attribute(s):
##   L7_ETMs.tif    
##  Min.   :  1.00  
##  1st Qu.: 54.00  
##  Median : 69.00  
##  Mean   : 68.91  
##  3rd Qu.: 86.00  
##  Max.   :255.00  
## dimension(s):
##      from  to  offset delta                       refsys point values    
## x       1 349  288776  28.5 +proj=utm +zone=25 +south... FALSE   NULL [x]
## y       1 352 9120761 -28.5 +proj=utm +zone=25 +south... FALSE   NULL [y]
## band    1   6      NA    NA                           NA    NA   NULL

plot(x)

library(stars)
library(ggplot2)
ggplot() + geom_stars(data = x) + coord_equal() + facet_wrap(~band) + 
  theme_void()  +  scale_x_discrete(expand=c(0,0)) + scale_y_discrete(expand=c(0,0))

stars_proxy objects

  • contain pointers to the data (file, URL), which by no means uniquely defines an array
  • use a strategy to go through them (chunking: currently only spatial)
  • are lazy: only compute when data is requested
  • can read downsampled arrays (e.g. to plot): optimises execution order of call stack
  • can time-compose e.g. a time stack of NetCDF files, with a time slice in each file

granule = system.file("sentinel/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.zip", package = "starsdata")
base_name = strsplit(basename(granule), ".zip")[[1]]
s2 = paste0("SENTINEL2_L1C:/vsizip/", granule, "/", base_name, ".SAFE/MTD_MSIL1C.xml:10m:EPSG_32632")
(p = read_stars(s2, proxy = TRUE))
## stars_proxy object with 1 attribute in file:
## $`MTD_MSIL1C.xml:10m:EPSG_32632`
## [1] "SENTINEL2_L1C:/vsizip//home/edzer/R/x86_64-pc-linux-gnu-library/3.6/starsdata/sentinel/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.zip/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.SAFE/MTD_MSIL1C.xml:10m:EPSG_32632"
## 
## dimension(s):
##      from    to offset delta                       refsys point values    
## x       1 10980  3e+05    10 +proj=utm +zone=32 +datum...    NA   NULL [x]
## y       1 10980  6e+06   -10 +proj=utm +zone=32 +datum...    NA   NULL [y]
## band    1     4     NA    NA                           NA    NA   NULL
object.size(p)
## 7336 bytes

system.time(plot(p))

##    user  system elapsed 
##   0.758   0.149   0.411

ndvi = function(x) (x[4]-x[3])/(x[4]+x[3])
system.time(plot(st_apply(p, c("x", "y"), ndvi)))

##    user  system elapsed 
##   1.021   0.180   0.699

What if data are not nicely aligned in a cube?

  • subsequent images have different, consecutive time stamps
  • space is not rasterized (e.g., tiles in UTM zones)

gdalcubes:

  • create and process Earth observation data cubes from GDAL image collections
  • written by Marius Appel: C++, R interface
  • indexes a set of GDAL-readable files
  • “reads” them, by resampling, to a user-defined data cube
  • resampling options can be controlled (e.g. nearest, linear, bilinear) in both space and time
  • can process this cube, in parallel (e.g., reduce, aggregate)
  • writes the resulting data cube to a NetCDF file

openEO : an API for cube processing

  • is an H2020 project that develops a modern, open API to connect R, python, javascript and other clients to big Earth observation cloud back-ends in a simple and unified way.
  • has (scalar-valued) raster and vector data cubes as its data primitive
  • allows map/reduce, aggregation, resampling, user-defined data cubes, all math ops, ..
  • allows for UDFs (user-defined functions, e.g. written in Python or R)
  • anticipates a federated, heterogeneous cloud big EO data computing landscape (EOSC, DIAS, TEPs/MEPs…)

stars is somewhat of a blueprint, and useful client-side, or for server-side UDFs

Wrapping up: