The R-markdown source of the tutorial is found here.

Required packages:

install.packages(c("sf", "tidyverse", "devtools"))
devtools::install_github("tidyverse/ggplot2")

A short history of handling spatial data in R

Simple feature access in R: package sf

Simple feature access is an ISO standard that is widely adopted. It is used in spatial databases, GIS, open source libraries, GeoJSON, GeoSPARQL, etc.

What is this about?

library(sf)
## Linking to GEOS 3.5.1, GDAL 2.1.3, proj.4 4.9.2, lwgeom 2.3.2 r15302

links to GEOS, GDAL, Proj.4, liblwgeom; see also the package vignettes

access refers to standardised encodings, such as well-known text (WKT):

(pt = st_point(c(2,4)))
## POINT(2 4)

and well-known binary,

(pt_bin = st_as_binary(pt))
##  [1] 01 01 00 00 00 00 00 00 00 00 00 00 40 00 00 00 00 00 00 10 40

the binary form in which spatial databases put geometries in BLOBs, binary large objects, converted back by

st_as_sfc(list(pt_bin))[[1]]
## POINT(2 4)

as well as to names of functions to manipulate objects, e.g.

st_dimension(pt)
## [1] 0
st_intersects(pt, pt, sparse = FALSE)
##      [,1]
## [1,] TRUE

package sf uses simple R structures to store geometries:

str(pt)
## Classes 'XY', 'POINT', 'sfg'  num [1:2] 2 4
str(st_linestring(rbind(c(0,0), c(0,1), c(1,1))))
##  XY [1:3, 1:2] 0 0 1 0 1 1
str(st_polygon(list(rbind(c(0,0), c(0,1), c(1,1), c(0,0)))))
## List of 1
##  $ : num [1:4, 1:2] 0 0 1 0 0 1 1 0
##  - attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"

Tidyverse, list-columns

According to the “tidy data” paper (Wickham 2014), data is tidy when

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

but it is not directly clear how this maps to geometries: should a coordinate dimension (e.g. latitude) form a column, should each coordinate (x,y pair) form a colum, or should a whole geometry (e.g. polygon, with holes), form a column? Early attempts (and ggplot2 up to version 2.2.1) wanted simple columns, meaning each coordinate split over two columns. An approach called fortify would mold (tidy?) polygons into simple data.frames. It is well known that this approach has its limitations when polygons have holes.

Since the UseR! 2016 keynote of Hadley, list-columns have been declared tidy. One of the arguments for this was exactly this: polygons with holes are hard to represent in simple data.frames. Other cases are: nested data.frames, or columns that contain, for each record, a model object e.g. obtained from lm.

The tidy data rule for simple feature means: we have a data.frame where each feature forms a row. A single column (a list-column) contains the geometry for each observation. This resembles spatial databases, such as PostGIS.

Package sf puts features in sf tables deriving from data.frame or tbl_df, which have geometries in a list-column of class sfc, where each list element is a single feature’s geometry of class sfg. Feature geometries are represented in R by

(all other classes also fall in one of these categories)

Other tidy aspects of sf:

Reference systems

If one wants to know which position of the Earth we refer to, coordinates of geospatial data require a reference system:

To handle coordinate reference systems, to convert coordinates (projections) and do datum transformations, Proj.4 is the code base most widely adopted, and actively maintained, by the open source geospatial community.

Package sf has crs objects that register coordinate reference systems:

st_crs("+proj=longlat +datum=WGS84")  # "Proj.4 string"
## $epsg
## [1] 4326
## 
## $proj4string
## [1] "+proj=longlat +datum=WGS84 +no_defs"
## 
## attr(,"class")
## [1] "crs"
st_crs(3857)                          # EPSG code
## $epsg
## [1] 3857
## 
## $proj4string
## [1] "+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs"
## 
## attr(,"class")
## [1] "crs"
st_crs(3857)$units                    # reveal units
## [1] "m"
st_crs(NA)                            # unknown (assumed planar/Cartesian)
## $epsg
## [1] NA
## 
## $proj4string
## [1] NA
## 
## attr(,"class")
## [1] "crs"

crs objects are registered as an attribute of geometry collections.

sf: handling real data

reading and writing spatial data

fname = system.file("shape/nc.shp", package = "sf")
nc = read_sf(fname)
print(nc, n = 3)
## Simple feature collection with 100 features and 14 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
## First 3 features:
##    AREA PERIMETER CNTY_ CNTY_ID      NAME  FIPS FIPSNO CRESS_ID BIR74
## 1 0.114     1.442  1825    1825      Ashe 37009  37009        5  1091
## 2 0.061     1.231  1827    1827 Alleghany 37005  37005        3   487
## 3 0.143     1.630  1828    1828     Surry 37171  37171       86  3188
##   SID74 NWBIR74 BIR79 SID79 NWBIR79                       geometry
## 1     1      10  1364     0      19 MULTIPOLYGON(((-81.47275543...
## 2     0      10   542     3      12 MULTIPOLYGON(((-81.23989105...
## 3     5     208  3616     6     260 MULTIPOLYGON(((-80.45634460...
plot(nc)
## Warning: plotting the first 10 out of 14 attributes; use max.plot = 14 to
## plot all

coordinate transformation/conversion

  • st_transform: transforms or converts coordinates to new reference system
(a1 = st_area(nc[1,]))                      # area, using geosphere::areaPolygon
## 1137388604 m^2
(a2 = st_area(st_transform(nc[1,], 32119))) # NC state plane, m
## 1137598162 m^2
(a3 = st_area(st_transform(nc[1,], 2264)))  # NC state plane, US foot
## 12244955726 US_survey_foot^2
units::set_units(a1, km^2)
## 1137.389 km^2
units::set_units(a2, km^2)
## 1137.598 km^2
units::set_units(a3, km^2)
## 1137.598 km^2

From here on, we will work with the nc dataset in projected coordinates:

nc = st_transform(nc, 32119) # NC state plane, m

Methods for simple features

A complete list of methods for a particular class is obtained, after loading the class, by

methods(class = "sf")

book keeping, low-level I/O

  • st_as_text: convert to WKT
  • st_as_binary: convert to WKB
  • st_as_sfc: convert geometries to sfc (e.g., from WKT, WKB)
  • as(x, "Spatial"): convert to Spatial*
  • st_as_sf: convert to sf (e.g., convert from Spatial*)

logical binary geometry predicates

  • st_intersects: touch or overlap
  • st_disjoint: !intersects
  • st_touches: touch
  • st_crosses: cross (don’t touch)
  • st_within: within
  • st_contains: contains
  • st_overlaps: overlaps
  • st_covers: cover
  • st_covered_by: covered by
  • st_equals: equals
  • st_equals_exact: equals, with some fuzz

returns a sparse (default) or dense logical matrix:

nc1 = nc[1:5,]
st_intersects(nc1, nc1)
## [[1]]
## [1] 1 2
## 
## [[2]]
## [1] 1 2 3
## 
## [[3]]
## [1] 2 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5
st_intersects(nc1, nc1, sparse = FALSE)
##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,]  TRUE  TRUE FALSE FALSE FALSE
## [2,]  TRUE  TRUE  TRUE FALSE FALSE
## [3,] FALSE  TRUE  TRUE FALSE FALSE
## [4,] FALSE FALSE FALSE  TRUE FALSE
## [5,] FALSE FALSE FALSE FALSE  TRUE

geometry generating logical operators

  • st_union: union of several geometries
  • st_intersection: intersection of pairs of geometries
  • st_difference: difference between pairs of geometries
  • st_sym_difference: symmetric difference (xor)
opar = par(mfrow = c(1,2))
ncg = st_geometry(nc[1:3,])
plot(ncg, col = sf.colors(3, categorical = TRUE))
u = st_union(ncg)
plot(u, lwd = 2)
plot(st_intersection(ncg[1], ncg[2]), col = 'red', add = TRUE)
plot(st_buffer(u, 10000), border = 'blue', add = TRUE)
# st_buffer(u, units::set_unit(10, km)) # with sf devel
plot(st_buffer(u, -5000), border = 'green', add = TRUE)

par(opar)

higher-level operations: summarise, interpolate, aggregate, st_join

  • aggregate and summarise use st_union (by default) to group feature geometries
  • st_interpolate_aw: area-weighted interpolation, uses st_intersection to interpolate or redistribute attribute values, based on area of overlap:
  • st_join uses one of the logical binary geometry predicates (default: st_intersects) to join records in table pairs
g = st_make_grid(nc, n = c(20,10))
a1 = st_interpolate_aw(nc["BIR74"], g, extensive = FALSE)
## Warning in st_interpolate_aw(nc["BIR74"], g, extensive = FALSE):
## st_interpolate_aw assumes attributes are constant over areas of x
sum(a1$BIR74) / sum(nc$BIR74) # not close to one: property is assumed spatially intensive
## [1] 1.191945
a2 = st_interpolate_aw(nc["BIR74"], g, extensive = TRUE)
## Warning in st_interpolate_aw(nc["BIR74"], g, extensive = TRUE):
## st_interpolate_aw assumes attributes are constant over areas of x
sum(a2$BIR74) / sum(nc$BIR74)
## [1] 1
#a1$intensive = a1$BIR74
#a1$extensive = a2$BIR74
#plot(a1[c("intensive", "extensive")])
a1$what = "intensive"
a2$what = "extensive"
# devtools::install_github("tidyverse/ggplot2")
library(ggplot2)
l = st_cast(nc, "LINESTRING")
ggplot() + geom_sf(data = rbind(a1,a2), aes(fill = BIR74)) + 
    geom_sf(data = l, col = 'lightgray') + facet_grid(what~.) +
    scale_fill_gradientn(colors = sf.colors(10))

Example of st_join:

# find county matching a coordinate:
pt = st_point(c(472038, 210385))
pt_sfc = st_sfc(pt, crs = st_crs(nc))
pt_sf = st_sf(a = 1, geom = pt_sfc)
st_join(pt_sf, nc)
## Simple feature collection with 1 feature and 15 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 472038 ymin: 210385 xmax: 472038 ymax: 210385
## epsg (SRID):    32119
## proj4string:    +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
##   a  AREA PERIMETER CNTY_ CNTY_ID  NAME  FIPS FIPSNO CRESS_ID BIR74 SID74
## 1 1 0.134      1.59  1980    1980 Rowan 37159  37159       80  4606     3
##   NWBIR74 BIR79 SID79 NWBIR79                 geom
## 1    1057  6427     8    1504 POINT(472038 210385)
st_join(nc, pt_sf, left = FALSE) # full join
## Simple feature collection with 1 feature and 15 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 449767.8 ymin: 194348 xmax: 502410 ymax: 234481
## epsg (SRID):    32119
## proj4string:    +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
##     AREA PERIMETER CNTY_ CNTY_ID  NAME  FIPS FIPSNO CRESS_ID BIR74 SID74
## 50 0.134      1.59  1980    1980 Rowan 37159  37159       80  4606     3
##    NWBIR74 BIR79 SID79 NWBIR79 a                       geometry
## 50    1057  6427     8    1504 1 MULTIPOLYGON(((491842.71265...
# join-with-self:
nrow(nc)
## [1] 100
x = st_join(nc, nc)
nrow(x) # neighbours AND each state with itself
## [1] 590
x = st_join(nc, nc, join = st_touches)
nrow(x) # neighbours, now excluding itself
## [1] 490
st_rook = function(a, b, prepared) st_relate(a, b, pattern = "F***1****")
x = st_join(nc, nc, join = st_rook)
nrow(x) # "rook" neighbours: touch over a line, not in a point
## [1] 462
# which states touch another state in a point only?
sel = unlist(st_relate(nc, nc, pattern = "F***0****"))
plot(st_geometry(nc))
plot(st_geometry(nc[sel,]), add = TRUE, col = 'grey')

see also this issue.

Which points touch four states?

pts = st_intersection(nc, nc)
## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries
pts = pts[st_dimension(pts) == 0, ]
plot(st_geometry(nc))
plot(st_geometry(nc[sel,]), add = TRUE, col = 'grey')
plot(st_geometry(pts), add = TRUE, col = '#ff000044', pch = 16)

# how many neighbours does each of these points have?
unique(lengths(st_intersects(pts, nc)))
## [1] 4

manipulating geometries

  • st_line_merge: merges lines
  • st_segmentize: adds points to straight lines
  • st_voronoi: creates voronoi tesselation
  • st_centroid: gives centroid of geometry
  • st_convex_hull: creates convex hull of set of points
  • st_triangulate: triangulates set of points (not constrained)
  • st_polygonize: creates polygon from lines that form a closed ring
  • st_simplify: simplifies lines by removing vertices
  • st_split: split a polygon given line geometry
  • st_buffer: compute a buffer around this geometry/each geometry
  • st_make_valid: tries to make an invalid geometry valid (requires lwgeom)
  • st_boundary: return the boundary of a geometry

convenience functions

  • st_zm: sets or removes z and/or m geometry
  • st_coordinates: retrieve coordinates in a matrix or data.frame
  • st_geometry: set, or retrieve sfc from an sf object
  • st_is: check whether geometry is of a particular type

handling mixes: GEOMETRY, GEOMETRYCOLLECTION

We can have mixes of geometries with different types at two levels: at the feature level, and at the feature set level. At the feature level:

p = st_point(c(0,1))
mp = st_multipoint(rbind(c(1,1), c(3,3)))
ls = st_linestring(rbind(c(1,1), c(3,3)))
(gc = st_geometrycollection(list(p, mp)))
## GEOMETRYCOLLECTION(POINT(0 1), MULTIPOINT(1 1, 3 3))
st_sfc(gc)
## Geometry set for 1 feature 
## geometry type:  GEOMETRYCOLLECTION
## dimension:      XY
## bbox:           xmin: 0 ymin: 1 xmax: 3 ymax: 3
## epsg (SRID):    NA
## proj4string:    NA
## GEOMETRYCOLLECTION(POINT(0 1), MULTIPOINT(1 1, ...

Where do these come from? For instance from st_intersection:

opar <- par(mfrow = c(1, 2))
a <- st_polygon(list(cbind(c(0,0,7.5,7.5,0),c(0,-1,-1,0,0))))
b <- st_polygon(list(cbind(c(0,1,2,3,4,5,6,7,7,0),c(1,0,.5,0,0,0.5,-0.5,-0.5,1,1))))
plot(a, ylim = c(-1,1))
title("intersecting two polygons:")
plot(b, add = TRUE, border = 'red')
(i <- st_intersection(a,b))
## GEOMETRYCOLLECTION(POINT(1 0), LINESTRING(4 0, 3 0), POLYGON((5.5 0, 7 0, 7 -0.5, 6 -0.5, 5.5 0)))
plot(a, ylim = c(-1,1))
title("GEOMETRYCOLLECTION")
plot(b, add = TRUE, border = 'red')
plot(i, add = TRUE, col = 'green', lwd = 2)

par(opar)

At the feature set level we end up with GEOMETRY objects:

(gm = st_sfc(p, mp, ls))
## Geometry set for 3 features 
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: 0 ymin: 1 xmax: 3 ymax: 3
## epsg (SRID):    NA
## proj4string:    NA
## POINT(0 1)
## MULTIPOINT(1 1, 3 3)
## LINESTRING(1 1, 3 3)

How to deal with such messy data? One thing we could do is select:

sf = st_sf(a = 1:3, geom = gm)
st_dimension(sf)
## [1] 0 0 1
sf[st_dimension(sf) == 1,]
## Simple feature collection with 1 feature and 1 field
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: 1 ymin: 1 xmax: 3 ymax: 3
## epsg (SRID):    NA
## proj4string:    NA
##   a                 geom
## 3 3 LINESTRING(1 1, 3 3)

another is to cast geometries to a common class:

st_cast(sf, "MULTIPOINT")
## Simple feature collection with 3 features and 1 field
## geometry type:  MULTIPOINT
## dimension:      XY
## bbox:           xmin: 0 ymin: 1 xmax: 3 ymax: 3
## epsg (SRID):    NA
## proj4string:    NA
##   a             geometry
## 1 1      MULTIPOINT(0 1)
## 2 2 MULTIPOINT(1 1, 3 3)
## 3 3 MULTIPOINT(1 1, 3 3)

which converts the LINESTRING to a MULTIPOINT as well.

st_cast also tries to unravel GEOMETRYCOLLECTION geometries in their parts:

(parts = st_cast(st_sf(a = 1, geometry = st_sfc(gc))))
## Warning in st_cast.sf(st_sf(a = 1, geometry = st_sfc(gc))): repeating
## attributes for all sub-geometries for which they may not be constant
## Simple feature collection with 2 features and 1 field
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: 0 ymin: 1 xmax: 3 ymax: 3
## epsg (SRID):    NA
## proj4string:    NA
##   a             geometry
## 1 1           POINT(0 1)
## 2 1 MULTIPOINT(1 1, 3 3)

potentially repeating associated attributes, and tries to find a common class when given no argument

st_cast(parts)
## Simple feature collection with 2 features and 1 field
## geometry type:  MULTIPOINT
## dimension:      XY
## bbox:           xmin: 0 ymin: 1 xmax: 3 ymax: 3
## epsg (SRID):    NA
## proj4string:    NA
##   a             geometry
## 1 1      MULTIPOINT(0 1)
## 2 1 MULTIPOINT(1 1, 3 3)

empty geometries

Empty geometries server the role of NA, or missing values, for geometries, but they are typed:

st_point()
## POINT(NA NA)
st_geometrycollection()
## GEOMETRYCOLLECTION()

They can be detected by st_dimension returning NA:

st_sfc(st_point(0:1), st_point(), st_geometrycollection()) %>%
    st_dimension
## [1]  0 NA NA

Pipe-based workflows and tidyverse

When both sf and dplyr are loaded, a number of generic methods in the tidyverse become available for sf objects. They are:

m1 = attr(methods(class = "sf"), "info")$generic
library(tidyverse)
## + ggplot2 2.2.1.9000        Date: 2017-07-04
## + tibble  1.3.3                R: 3.4.0
## + tidyr   0.6.1               OS: Ubuntu 16.04.2 LTS
## + readr   1.1.0              GUI: X11
## + purrr   0.2.2.9000      Locale: en_US.UTF-8
## + dplyr   0.7.1               TZ: Europe/Berlin
## + stringr 1.2.0           
## + forcats 0.2.0
## Conflicts -----------------------------------------------------------------
## * filter(),  from dplyr, masks stats::filter()
## * lag(),     from dplyr, masks stats::lag()
m2 = attr(methods(class = "sf"), "info")$generic
noquote(setdiff(m2, m1))
##  [1] anti_join   arrange_    arrange     distinct_   distinct   
##  [6] filter_     filter      full_join   gather_     group_by_  
## [11] group_by    inner_join  left_join   mutate_     mutate     
## [16] nest_       rename_     rename      right_join  sample_frac
## [21] sample_n    select_     select      semi_join   separate_  
## [26] slice_      slice       spread_     summarise_  summarise  
## [31] transmute_  transmute   ungroup     unite_

Most of these do little more than simply make sure the returned object is of class sf again, but some of them do more, e.g. summarise by default aggregates (unions) the summarised geometries:

cent <- st_geometry(nc) %>% st_centroid %>% st_coordinates 
nc1 <- nc %>% 
  mutate(area = as.numeric(st_area(.)), longitude = cent[,1]) %>%
  group_by(cut(longitude,4))  %>% 
  summarise(area = sum(area), BIR79 = sum(BIR79), dens = mean(BIR79/area))
ggplot(nc1) + geom_sf(aes(fill = dens))

Array data: rasters, spatial time series

Although sp has some simple infrastructure for them in the Spatial class framework, raster data in R are best handled by the raster package. It is feature-rich, well integrated with the Spatial class framework, and can deal with massive rasters, as long as they fit on the local disk. It does not integrate particularly well with the tidyverse, or with simple features. Neither does it handle four- or higher-dimensional dimensional data (e.g. x,y.t,color), or time series data for features.

A follow-up project of the “simple features for R” project tries to address a number of these issues

Summary/outlook

In comparison to sp, package sf

What’s really new, and better in sf, compared to sp?

Still to do: