rstudio::conf, Feb 2, 2018

Overview

  • Spatial data in R
  • Simple features, the sf R package
  • Geometrical operations
  • Tidy: sf, dplyr & ggplot2
  • Raster, arrays, spatial data cubes

Some personal context, I

  • studied physical geography, PhD in spatial statistics
  • now teach Geoinformatics in Muenster, Germany
  • wrote lots of C code in the 90's
  • wrote R packages in the 00/10's: gstat, sp, spacetime, trajectories, …

  • recently developed an interest in working with measurement units (in R)

What makes spatial data challenging?

  • The Earth is a sphere/spheroid/potato*
  • coordinates consist of two or three numbers that loose most of their meaning when considered individually
  • the most common form is Longitude, Latitude (LL) pairs
  • from LL data, stats::dist will not give you distances
  • maps and screens are flat, and hence can only show projected data
  • projected distances are distorted, and possibly areas, shapes, directions and shortest paths too
  • the meaning of a LL coordinate depends on the geographic datum (e.g., WGS84, ETRS89, NAD27 etc)
  • a datum is unlikely important when mapping continents, but it is when drones try to deliver pizza's

Simple features

  • feature: abstraction of real world phenomena (type or instance); has a geometry and other attributes (properties)
  • simple feature: feature with all geometric attributes described piecewise by straight line or planar interpolation between sets of points (no curves)
  • represent geometry by points, lines or polygons, or collections thereof
  • a formal standard (ISO, OGC) since 2004
  • supported by OSGeo libraries GEOS and GDAL
  • adopted by GeoJSON, GeoSPARQL
  • has well-known text (WKT) and binary (WKB) encodings
  • WKB used by spatial databases (PostGIS, MariaDB, SQLite, …)
  • standard specifies a number of topological metrics, predicates and operations

Operations on geometries:

Single:

  • logical predicates: is_valid, is_simple, is_empty
  • quantities: length, area
  • dimension: 0 = point(s), 1 = linear, 2 = surface
  • derived geometries: buffer, centroid, boundary, convex_hull, simplify, linemerge, polygonize, node, point_on_surface, triangulate

Pairs/sets:

  • quantities: distance
  • predicates: intersects, within, contains, covers, covered_by, crosses, touches, overlaps, equals, disjoint, all other DE-9IM
  • new geometries: intersection, difference, union, sym_difference

Intersection special cases

  • (typed) EMPTY: think of as missing (NA) geometry
  • GEOMETRYCOLLECTION: single feature geometry with mix of anything

Package sf

  • sf stores simple feature geometries as a list-column
  • It does that in sf objects, extending data.frame or tibble
  • How does it work?

sfg : geometry for one feature

library(sf)
## Linking to GEOS 3.5.1, GDAL 2.2.1, proj.4 4.9.3

Package sf features

  • sf objects extend data.frame or tbl_df with a geometry list-column
  • fast (C++) WKB \(\Longleftrightarrow\) R conversion, used for I/O with libraries and databases
  • spatial indexes created on-the-fly to scale up geometrical predicates (intersects) and operations (intersection)
  • simple and small API
  • functions/methods start with st_, as in
st_is_simple(st_point(0:1))
## [1] TRUE

sf & tidyverse

  • sf spatial objects are data.frames (or tibbles)
  • you can always un-sf, and work with tbl_df or data.frame having an sfc list-column
  • sf methods for filter, arrange, distinct, group_by, ungroup, mutate, select have sticky geometry
  • st_join joins tables based on a spatial predicate
  • summarise unions geometry by group (or altogether)

```

suppressPackageStartupMessages(library(dplyr))
gpkg = system.file("gpkg/nc.gpkg", package="sf")
options(pillar.sigfig = 3)
read_sf(gpkg) %>% select(1:3, 12)
## Simple feature collection with 100 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
## # A tibble: 100 x 5
##      AREA PERIMETER CNTY_ BIR79                                       geom
##     <dbl>     <dbl> <dbl> <dbl>                         <MULTIPOLYGON [°]>
##  1 0.114       1.44  1825  1364 (((-81.5 36.2, -81.5 36.3, -81.6 36.3, -8…
##  2 0.0610      1.23  1827   542 (((-81.2 36.4, -81.2 36.4, -81.3 36.4, -8…
##  3 0.143       1.63  1828  3616 (((-80.5 36.2, -80.5 36.3, -80.5 36.3, -8…
##  4 0.0700      2.97  1831   830 (((-76 36.3, -76 36.3, -76 36.3, -76 36.4…
##  5 0.153       2.21  1832  1606 (((-77.2 36.2, -77.2 36.2, -77.3 36.2, -7…
##  6 0.0970      1.67  1833  1838 (((-76.7 36.2, -77 36.2, -77 36.2, -77.1 …
##  7 0.0620      1.55  1834   350 (((-76 36.3, -76 36.2, -76 36.2, -76.2 36…
##  8 0.0910      1.28  1835   594 (((-76.6 36.3, -76.6 36.3, -76.6 36.3, -7…
##  9 0.118       1.42  1836  1190 (((-78.3 36.3, -78.3 36.3, -78.3 36.5, -7…
## 10 0.124       1.43  1837  2038 (((-80 36.3, -80.5 36.3, -80.4 36.6, -80 …
## # ... with 90 more rows

read_sf(gpkg) %>% select(1:5,12) %>% st_transform(2264) # NC state plane, us foot
## Simple feature collection with 100 features and 6 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 406262.2 ymin: 48374.87 xmax: 3052887 ymax: 1044158
## epsg (SRID):    2264
## proj4string:    +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.2192024384 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs
## # A tibble: 100 x 7
##      AREA PERIMETER CNTY_ CNTY_ID NAME   BIR79                        geom
##     <dbl>     <dbl> <dbl>   <dbl> <chr>  <dbl> <MULTIPOLYGON [US_survey_f>
##  1 0.114       1.44  1825    1825 Ashe    1364 (((1270814 913342, 1251094…
##  2 0.0610      1.23  1827    1827 Alleg…   542 (((1340555 959396, 1340434…
##  3 0.143       1.63  1828    1828 Surry   3616 (((1570591 910399, 1564747…
##  4 0.0700      2.97  1831    1831 Curri…   830 (((2881208 948582, 2878542…
##  5 0.153       2.21  1832    1832 North…  1606 (((2525701 911403, 2520876…
##  6 0.0970      1.67  1833    1833 Hertf…  1838 (((2665113 911666, 2595658…
##  7 0.0620      1.55  1834    1834 Camden   350 (((2881208 948582, 2897865…
##  8 0.0910      1.28  1835    1835 Gates    594 (((2717989 951755, 2705925…
##  9 0.118       1.42  1836    1836 Warren  1190 (((2203890 914328, 2211423…
## 10 0.124       1.43  1837    1837 Stokes  2038 (((1697624 911605, 1571651…
## # ... with 90 more rows

geom_sf

library(ggplot2) # install_github("tidyverse/ggplot2")
nc <- read_sf(gpkg) %>% st_transform(2264)
ggplot() + geom_sf(data = nc) + aes(fill = BIR74) +
  theme(panel.grid.major = element_line(color = "white")) +
  scale_fill_gradientn(colors = sf.colors(20))

library(tidyr)
nc2 <- nc %>% select(SID74, SID79) %>% gather(VAR, SID, -geom)
ggplot() + geom_sf(data = nc2, aes(fill = SID)) + facet_wrap(~VAR, ncol = 1) +
  scale_y_continuous(breaks = 34:36) +
  scale_fill_gradientn(colors = sf.colors(20)) +
  theme(panel.grid.major = element_line(color = "white"))

suppressPackageStartupMessages(library(mapview))
nc %>% mapview(zcol = "BIR74", legend = TRUE, col.regions = sf.colors)

quantities

library(sf)
suppressPackageStartupMessages(library(units))

pts = rbind(c(0,80), c(120,80), c(240,80), c(0,80))
pol = st_sfc(st_polygon(list(pts)), crs = "+proj=longlat")

pol %>% st_area %>% set_units(km^2)
## 1634783 km^2

# Equidistant Cylindrical (Plate Carrée):
pol %>% st_transform("+proj=eqc") %>% st_area
## 0 m^2

pol %>% st_set_crs(NA) %>% st_area
## [1] 0

pts =  st_sfc(st_point(c(0,90)), st_point(c(0,70)), st_point(c(60,70)), 
              crs = "+proj=longlat")
st_distance(pol, pts) %>% set_units(km)
## Units: km
##      [,1]     [,2]     [,3]
## [1,]    0 1116.159 1670.244

What about raster data?

  • package raster is powerful, and works well with pipes
  • simple features don't scale for raster / imagery data
  • (long) time series on features do neither

Package github.com/r-spatial/stars for:

  • raster and vector data cubes (arrays)
  • take on raster limitations, e.g. 4D+ rasters (band, time)
  • take on data sets larger than local disk
  • R Consortium project planned to finish 2018 Q2

suppressPackageStartupMessages(library(stars))
# http://github.com/r-spatial/stars: not yet on CRAN
tif = system.file("tif/L7_ETMs.tif", package = "stars")
plot(st_stars(tif), main = paste("Band", 1:6), col = grey(1:9/10))

Conclusions

  • sf provides tidy spatial analysis, for vector data
  • there are 6 package vignettes for further reading

Work in progress:

  • tidy raster data and space/time array data

Thanks to:

  • the #rspatial community
  • R Consortium for funding sf and stars

Join the BoF meeting: Today, 5:30-6:30, Pier room, 3rd floor