UCSB Environmental Informatics, Feb 6, 2018

Overview

  • Spatial data in R
  • Simple features, the sf R package
  • Geometrical operations
  • Tidy: sf, dplyr & ggplot2
  • Raster, arrays, spatial data cubes

Some personal context, I

  • studied physical geography, PhD in spatial statistics
  • now teach Geoinformatics in Muenster, Germany
  • wrote lots of C code in the 90's
  • wrote R packages in the 00/10's: gstat, sp, spacetime, trajectories, …

  • recently developed an interest in working with measurement units (in R)

What makes spatial data challenging?

  • The Earth is a sphere/spheroid/potato*
  • coordinates consist of two or three numbers that loose most of their meaning when considered individually
  • the most common form is Longitude, Latitude (LL) pairs
  • from LL data, stats::dist will not give you distances
  • maps and screens are flat, and hence can only show projected data
  • projected distances are distorted, and possibly areas, shapes, directions and shortest paths too
  • the meaning of a LL coordinate depends on the geographic datum (e.g., WGS84, ETRS89, NAD27 etc)
  • a datum is unlikely important when mapping continents, but it is when drones try to deliver pizza's

Simple features

  • feature: abstraction of real world phenomena (type or instance); has a geometry and other attributes (properties)
  • simple feature: feature with all geometric attributes described piecewise by straight line or planar interpolation between sets of points (no curves)
  • represent geometry by points, lines or polygons, or collections thereof
  • a formal standard (ISO, OGC) since 2004
  • supported by OSGeo libraries GEOS and GDAL
  • adopted by GeoJSON, GeoSPARQL
  • has well-known text (WKT) and binary (WKB) encodings
  • WKB used by spatial databases (PostGIS, MariaDB, SQLite, …)
  • standard specifies a number of topological metrics, predicates and operations

Operations on geometries:

Single:

  • logical predicates: is_valid, is_simple, is_empty
  • quantities: length, area
  • dimension: 0 = point(s), 1 = linear, 2 = surface
  • derived geometries: buffer, centroid, boundary, convex_hull, simplify, linemerge, polygonize, node, point_on_surface, triangulate

Pairs/sets:

  • quantities: distance
  • predicates: intersects, within, contains, covers, covered_by, crosses, touches, overlaps, equals, disjoint, all other DE-9IM
  • new geometries: intersection, difference, union, sym_difference

Intersection special cases

  • (typed) EMPTY: think of as missing (NA) geometry
  • GEOMETRYCOLLECTION: single feature geometry with mix of anything

Package sf

  • sf stores simple feature geometries as a list-column
  • It does that in sf objects, extending data.frame or tibble
  • How does it work?

sfg : geometry for one feature

library(sf)
## Linking to GEOS 3.5.1, GDAL 2.2.1, proj.4 4.9.3

Package sf features

  • sf objects extend data.frame or tbl_df with a geometry list-column
  • fast (C++) WKB \(\Longleftrightarrow\) R conversion, used for I/O with libraries and databases
  • spatial indexes created on-the-fly to scale up geometrical predicates (intersects) and operations (intersection)
  • simple and small API
  • functions/methods start with st_, as in
st_is_simple(st_point(0:1))
## [1] TRUE

sf & tidyverse

  • sf spatial objects are data.frames (or tibbles)
  • you can always un-sf, and work with tbl_df or data.frame having an sfc list-column
  • sf methods for filter, arrange, distinct, group_by, ungroup, mutate, select have sticky geometry
  • st_join joins tables based on a spatial predicate
  • summarise unions geometry by group (or altogether)

```

suppressPackageStartupMessages(library(dplyr))
gpkg = system.file("gpkg/nc.gpkg", package="sf")
options(pillar.sigfig = 3)
read_sf(gpkg) %>% select(1:3, 12)
## Simple feature collection with 100 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
## # A tibble: 100 x 5
##      AREA PERIMETER CNTY_ BIR79                                       geom
##     <dbl>     <dbl> <dbl> <dbl>                         <MULTIPOLYGON [°]>
##  1 0.114       1.44  1825  1364 (((-81.5 36.2, -81.5 36.3, -81.6 36.3, -8…
##  2 0.0610      1.23  1827   542 (((-81.2 36.4, -81.2 36.4, -81.3 36.4, -8…
##  3 0.143       1.63  1828  3616 (((-80.5 36.2, -80.5 36.3, -80.5 36.3, -8…
##  4 0.0700      2.97  1831   830 (((-76 36.3, -76 36.3, -76 36.3, -76 36.4…
##  5 0.153       2.21  1832  1606 (((-77.2 36.2, -77.2 36.2, -77.3 36.2, -7…
##  6 0.0970      1.67  1833  1838 (((-76.7 36.2, -77 36.2, -77 36.2, -77.1 …
##  7 0.0620      1.55  1834   350 (((-76 36.3, -76 36.2, -76 36.2, -76.2 36…
##  8 0.0910      1.28  1835   594 (((-76.6 36.3, -76.6 36.3, -76.6 36.3, -7…
##  9 0.118       1.42  1836  1190 (((-78.3 36.3, -78.3 36.3, -78.3 36.5, -7…
## 10 0.124       1.43  1837  2038 (((-80 36.3, -80.5 36.3, -80.4 36.6, -80 …
## # ... with 90 more rows

read_sf(gpkg) %>% select(1:5,12) %>% st_transform(2264) # NC state plane, us foot
## Simple feature collection with 100 features and 6 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 406262.2 ymin: 48374.87 xmax: 3052887 ymax: 1044158
## epsg (SRID):    2264
## proj4string:    +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.2192024384 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs
## # A tibble: 100 x 7
##      AREA PERIMETER CNTY_ CNTY_ID NAME   BIR79                        geom
##     <dbl>     <dbl> <dbl>   <dbl> <chr>  <dbl> <MULTIPOLYGON [US_survey_f>
##  1 0.114       1.44  1825    1825 Ashe    1364 (((1270814 913342, 1251094…
##  2 0.0610      1.23  1827    1827 Alleg…   542 (((1340555 959396, 1340434…
##  3 0.143       1.63  1828    1828 Surry   3616 (((1570591 910399, 1564747…
##  4 0.0700      2.97  1831    1831 Curri…   830 (((2881208 948582, 2878542…
##  5 0.153       2.21  1832    1832 North…  1606 (((2525701 911403, 2520876…
##  6 0.0970      1.67  1833    1833 Hertf…  1838 (((2665113 911666, 2595658…
##  7 0.0620      1.55  1834    1834 Camden   350 (((2881208 948582, 2897865…
##  8 0.0910      1.28  1835    1835 Gates    594 (((2717989 951755, 2705925…
##  9 0.118       1.42  1836    1836 Warren  1190 (((2203890 914328, 2211423…
## 10 0.124       1.43  1837    1837 Stokes  2038 (((1697624 911605, 1571651…
## # ... with 90 more rows

geom_sf

library(ggplot2) # install_github("tidyverse/ggplot2")
nc <- read_sf(gpkg) %>% st_transform(2264)
ggplot() + geom_sf(data = nc) + aes(fill = BIR74) +
  theme(panel.grid.major = element_line(color = "white")) +
  scale_fill_gradientn(colors = sf.colors(20))

library(tidyr)
nc2 <- nc %>% select(SID74, SID79) %>% gather(VAR, SID, -geom)
ggplot() + geom_sf(data = nc2, aes(fill = SID)) + facet_wrap(~VAR, ncol = 1) +
  scale_y_continuous(breaks = 34:36) +
  scale_fill_gradientn(colors = sf.colors(20)) +
  theme(panel.grid.major = element_line(color = "white"))

suppressPackageStartupMessages(library(mapview))
nc %>% mapview(zcol = "BIR74", legend = TRUE, col.regions = sf.colors)

quantities

library(sf)
suppressPackageStartupMessages(library(units))

pts = rbind(c(0,80), c(120,80), c(240,80), c(0,80))
pol = st_sfc(st_polygon(list(pts)), crs = "+proj=longlat")

pol %>% st_area %>% set_units(km^2)
## 1634783 km^2

# Equidistant Cylindrical (Plate Carrée):
pol %>% st_transform("+proj=eqc") %>% st_area
## 0 m^2

pol %>% st_set_crs(NA) %>% st_area
## [1] 0

pts =  st_sfc(st_point(c(0,90)), st_point(c(0,70)), st_point(c(60,70)), 
              crs = "+proj=longlat")
st_distance(pol, pts) %>% set_units(km)
## Units: km
##      [,1]     [,2]     [,3]
## [1,]    0 1116.159 1670.244

What about raster data?

  • package raster is powerful, stable, and works well with pipes
  • simple features don't scale for raster / imagery data
  • (long) time series on features do neither

Package github.com/r-spatial/stars for:

  • raster and vector data cubes (arrays)
  • take on raster limitations, e.g. 4D+ rasters (band, time)
  • take on data sets larger than local disk
  • R Consortium project planned to finish 2018 Q2

library(stars)
## Loading required package: magrittr
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
## 
##     extract
## Linking to GDAL 2.2.1
# http://github.com/r-spatial/stars: not yet on CRAN
tif = system.file("tif/L7_ETMs.tif", package = "stars")
(img = st_stars(tif))
## stars object with 3 dimensions and 1 attribute
## attribute(s):
##  /home/edzer/R/x86_64-pc-linux-gnu-library/3.4/stars/tif/L7_ETMs.tif 
##  Min.   :  1.00                                                      
##  1st Qu.: 54.00                                                      
##  Median : 69.00                                                      
##  Mean   : 68.91                                                      
##  3rd Qu.: 86.00                                                      
##  Max.   :255.00                                                      
## dimension(s):
##      from  to  offset delta                       refsys point values
## x       1 349  288776  28.5 +proj=utm +zone=25 +south... FALSE   NULL
## y       1 352 9120761 -28.5 +proj=utm +zone=25 +south... FALSE   NULL
## band    1   6      NA    NA                           NA    NA   NULL

plot(img, main = paste("Band", 1:6))

setwd("/home/edzer/data/sentinel")
x = st_stars("SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20170314T104011_N0204_R008_T32ULC_20170314T104411.SAFE.zip/S2A_MSIL1C_20170314T104011_N0204_R008_T32ULC_20170314T104411.SAFE/MTD_MSIL1C.xml:10m:EPSG_32632")
names(x) = "S2A_L1C"
x
## stars object with 3 dimensions and 1 attribute
## attribute(s), of first 1e+05 cells:
##     S2A_L1C    
##  Min.   : 601  
##  1st Qu.: 913  
##  Median :1008  
##  Mean   :1050  
##  3rd Qu.:1131  
##  Max.   :3142  
## dimension(s):
##      from    to  offset delta                       refsys point values
## x       1 10980   3e+05    10 +proj=utm +zone=32 +datum...    NA   NULL
## y       1 10980 5800020   -10 +proj=utm +zone=32 +datum...    NA   NULL
## band    1     4      NA    NA                           NA    NA   NULL

So far,

  • stars does everything locally, and in memory
  • raster scales up to local hard drive, by caching memory, and won't do band+time
  • stars will scale by having proxy objects in memory, pointing to an image collection in another R instance, interfaced through http/REST (httr, plumber)
  • this should scale to distributed architectures
  • what should proxy objects have? (pixels at lower res? only granule/tile outlines?)

stars status

Can currently:

  • read, subset attr, subset dimensions, plot (efficiently)
  • sort out time from NetCDF files (read through GDAL)
  • deal with units

To do soon:

  • implement web-service interface, define proxy objects
  • find useful interactions with the data proxied
  • write rasters/arrays back to disk
  • apply functions to array dimensions
  • implement generic filtering capability

Conclusions

  • sf provides tidy spatial analysis, for vector data
  • there are 6 package vignettes for further reading
  • tidy raster and vector data cubes (space/time array data) are WIP (stars - stay tuned!)
  • all involvement of any kind is always appreciated!

Thanks to:

  • the #rspatial community
  • R Consortium for supporting sf and stars development
  • rstudio for funding this trip!