The R-markdown source of the tutorial is found here.

Required packages:

install.packages(c("sf", "tidyverse", "devtools"))

A short history of handling spatial data in R

Simple feature access in R: package sf

Simple feature access is an ISO standard that is widely adopted. It is used in spatial databases, GIS, open source libraries, GeoJSON, GeoSPARQL, etc.

What is this about?

library(sf)
## Linking to GEOS 3.6.2, GDAL 2.2.3, proj.4 4.9.3

links to GEOS, GDAL, Proj.4, liblwgeom; see also the package vignettes

access refers to standardised encodings, such as well-known text (WKT):

(pt = st_point(c(2,4)))
## POINT (2 4)

and well-known binary,

(pt_bin = st_as_binary(pt))
##  [1] 01 01 00 00 00 00 00 00 00 00 00 00 40 00 00 00 00 00 00 10 40

the binary form in which spatial databases put geometries in BLOBs, binary large objects, converted back by

st_as_sfc(list(pt_bin))[[1]]
## POINT (2 4)

as well as to names of functions to manipulate objects, e.g.

st_dimension(pt)
## [1] 0
st_intersects(pt, pt, sparse = FALSE)
##      [,1]
## [1,] TRUE

package sf uses simple R structures to store geometries:

str(pt)
##  'XY' num [1:2] 2 4
str(st_linestring(rbind(c(0,0), c(0,1), c(1,1))))
##  'XY' num [1:3, 1:2] 0 0 1 0 1 1
str(st_polygon(list(rbind(c(0,0), c(0,1), c(1,1), c(0,0)))))
## List of 1
##  $ : num [1:4, 1:2] 0 0 1 0 0 1 1 0
##  - attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"

Tidyverse, list-columns

According to the “tidy data” paper (Wickham 2014), data is tidy when

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

but it is not directly clear how this maps to geometries: should a coordinate dimension (e.g. latitude) form a column, should each coordinate (x,y pair) form a colum, or should a whole geometry (e.g. polygon, with holes), form a column? Early attempts (and ggplot2 up to version 2.2.1) wanted simple columns, meaning each coordinate split over two columns. An approach called fortify would mold (tidy?) polygons into simple data.frames. It is well known that this approach has its limitations when polygons have holes.

Since the UseR! 2016 keynote of Hadley, list-columns have been declared tidy. One of the arguments for this was exactly this: polygons with holes are hard to represent in simple data.frames. Other cases are: nested data.frames, or columns that contain, for each record, a model object e.g. obtained from lm.

The tidy data rule for simple feature means: we have a data.frame where each feature forms a row. A single column (a list-column) contains the geometry for each observation. This resembles spatial databases, such as PostGIS.

Package sf puts features in sf tables deriving from data.frame or tbl_df, which have geometries in a list-column of class sfc, where each list element is a single feature’s geometry of class sfg. Feature geometries are represented in R by

(all other classes also fall in one of these categories)

Other tidy aspects of sf:

Reference systems

If one wants to know which position of the Earth we refer to, coordinates of geospatial data require a reference system:

To handle coordinate reference systems, to convert coordinates (projections) and do datum transformations, Proj.4 is the code base most widely adopted, and actively maintained, by the open source geospatial community.

Package sf has crs objects that register coordinate reference systems:

st_crs("+proj=longlat +datum=WGS84")  # "Proj.4 string"
## Coordinate Reference System:
##   EPSG: 4326 
##   proj4string: "+proj=longlat +datum=WGS84 +no_defs"
st_crs(3857)                          # EPSG code
## Coordinate Reference System:
##   EPSG: 3857 
##   proj4string: "+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs"
st_crs(3857)$units                    # reveal units
## [1] "m"
st_crs(NA)                            # unknown (assumed planar/Cartesian)
## Coordinate Reference System: NA

crs objects are registered as an attribute of geometry collections.

sf: handling real data

reading and writing spatial data

fname = system.file("shape/nc.shp", package = "sf")
nc = read_sf(fname)
print(nc, n = 3)
## Simple feature collection with 100 features and 14 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
## # A tibble: 100 x 15
##    AREA PERIMETER CNTY_ CNTY_ID NAME  FIPS  FIPSNO CRESS_ID BIR74 SID74
##   <dbl>     <dbl> <dbl>   <dbl> <chr> <chr>  <dbl>    <int> <dbl> <dbl>
## 1 0.114      1.44  1825    1825 Ashe  37009  37009        5  1091     1
## 2 0.061      1.23  1827    1827 Alle… 37005  37005        3   487     0
## 3 0.143      1.63  1828    1828 Surry 37171  37171       86  3188     5
## # ... with 97 more rows, and 5 more variables: NWBIR74 <dbl>, BIR79 <dbl>,
## #   SID79 <dbl>, NWBIR79 <dbl>, geometry <MULTIPOLYGON [°]>
plot(nc)
## Warning: plotting the first 10 out of 14 attributes; use max.plot = 14 to
## plot all

coordinate transformation/conversion

  • st_transform: transforms or converts coordinates to new reference system
(a1 = st_area(nc[1,]))                      # area, using geosphere::areaPolygon
## 1137388604 m^2
(a2 = st_area(st_transform(nc[1,], 32119))) # NC state plane, m
## 1137598162 m^2
(a3 = st_area(st_transform(nc[1,], 2264)))  # NC state plane, US foot
## 12244955726 US_survey_foot^2
units::set_units(a1, km^2)
## 1137.389 km^2
units::set_units(a2, km^2)
## 1137.598 km^2
units::set_units(a3, km^2)
## 1137.598 km^2

From here on, we will work with the nc dataset in projected coordinates:

nc = st_transform(nc, 32119) # NC state plane, m

Methods for simple features

A complete list of methods for a particular class is obtained, after loading the class, by

methods(class = "sf")

book keeping, low-level I/O

  • st_as_text: convert to WKT
  • st_as_binary: convert to WKB
  • st_as_sfc: convert geometries to sfc (e.g., from WKT, WKB)
  • as(x, "Spatial"): convert to Spatial*
  • st_as_sf: convert to sf (e.g., convert from Spatial*)

logical binary geometry predicates

  • st_intersects: touch or overlap
  • st_disjoint: !intersects
  • st_touches: touch
  • st_crosses: cross (don’t touch)
  • st_within: within
  • st_contains: contains
  • st_overlaps: overlaps
  • st_covers: cover
  • st_covered_by: covered by
  • st_equals: equals
  • st_equals_exact: equals, with some fuzz

returns an object of class sgbp (sparse geometry binary predicate); there is an as.matrix method for these objects to get the dense logical matrix (which may become huge when you have many features):

nc1 = nc[1:5,]
(i = st_intersects(nc1, nc1))
## Sparse geometry binary predicate list of length 5, where the predicate was `intersects'
##  1: 1, 2
##  2: 1, 2, 3
##  3: 2, 3
##  4: 4
##  5: 5
as.matrix(i)[1:5,1:5]
##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,]  TRUE  TRUE FALSE FALSE FALSE
## [2,]  TRUE  TRUE  TRUE FALSE FALSE
## [3,] FALSE  TRUE  TRUE FALSE FALSE
## [4,] FALSE FALSE FALSE  TRUE FALSE
## [5,] FALSE FALSE FALSE FALSE  TRUE

geometry generating logical operators

  • st_union: union of several geometries
  • st_intersection: intersection of pairs of geometries
  • st_difference: difference between pairs of geometries
  • st_sym_difference: symmetric difference (xor)
opar = par(mfrow = c(1,2))
ncg = st_geometry(nc[1:3,])
plot(ncg, col = sf.colors(3, categorical = TRUE))
u = st_union(ncg)
plot(u, lwd = 2)
plot(st_intersection(ncg[1], ncg[2]), col = 'red', add = TRUE)
plot(st_buffer(u, 10000), border = 'blue', add = TRUE)
# st_buffer(u, units::set_unit(10, km)) # with sf devel
plot(st_buffer(u, -5000), border = 'green', add = TRUE)

par(opar)

higher-level intersections and differences

If you want not only intersections of differences for pairs of geometries, but e.g. all unique, non-intersecting polygons resulting from higher-order intersections, or unique differences from sequential differencing, you can use st_intersection or st_difference with a single argument.

See the examples in the documentation.

This means that st_intersection(x) does something different from st_intersection(x,x), which may confuse sometimes.

higher-level operations: summarise, interpolate, aggregate, st_join

  • aggregate and summarise use st_union (by default) to group feature geometries
  • st_interpolate_aw: area-weighted interpolation, uses st_intersection to interpolate or redistribute attribute values, based on area of overlap:
  • st_join uses one of the logical binary geometry predicates (default: st_intersects) to join records in table pairs
g = st_make_grid(nc, n = c(20,10))
a1 = st_interpolate_aw(nc["BIR74"], g, extensive = FALSE)
## Warning in st_interpolate_aw(nc["BIR74"], g, extensive = FALSE):
## st_interpolate_aw assumes attributes are constant over areas of x
sum(a1$BIR74) / sum(nc$BIR74) # not close to one: property is assumed spatially intensive
## [1] 1.430389
a2 = st_interpolate_aw(nc["BIR74"], g, extensive = TRUE)
## Warning in st_interpolate_aw(nc["BIR74"], g, extensive = TRUE):
## st_interpolate_aw assumes attributes are constant over areas of x
sum(a2$BIR74) / sum(nc$BIR74)
## [1] 1
#a1$intensive = a1$BIR74
#a1$extensive = a2$BIR74
#plot(a1[c("intensive", "extensive")])
a1$what = "intensive"
a2$what = "extensive"
# devtools::install_github("tidyverse/ggplot2")
library(ggplot2)
l = st_cast(nc, "MULTILINESTRING")
ggplot() + geom_sf(data = rbind(a1,a2), aes(fill = BIR74)) + 
    geom_sf(data = l, col = 'lightgray') + facet_grid(what~.) +
    scale_fill_gradientn(colors = sf.colors(10))

Example of st_join:

# find county matching a coordinate:
pt = st_point(c(472038, 210385))
pt_sfc = st_sfc(pt, crs = st_crs(nc))
pt_sf = st_sf(a = 1, geom = pt_sfc)
st_join(pt_sf, nc)
## Simple feature collection with 1 feature and 15 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 472038 ymin: 210385 xmax: 472038 ymax: 210385
## epsg (SRID):    32119
## proj4string:    +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
##   a  AREA PERIMETER CNTY_ CNTY_ID  NAME  FIPS FIPSNO CRESS_ID BIR74 SID74
## 1 1 0.134      1.59  1980    1980 Rowan 37159  37159       80  4606     3
##   NWBIR74 BIR79 SID79 NWBIR79                  geom
## 1    1057  6427     8    1504 POINT (472038 210385)
st_join(nc, pt_sf, left = FALSE) # full join
## Simple feature collection with 1 feature and 15 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 449767.8 ymin: 194348 xmax: 502410 ymax: 234481
## epsg (SRID):    32119
## proj4string:    +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
##     AREA PERIMETER CNTY_ CNTY_ID  NAME  FIPS FIPSNO CRESS_ID BIR74 SID74
## 50 0.134      1.59  1980    1980 Rowan 37159  37159       80  4606     3
##    NWBIR74 BIR79 SID79 NWBIR79 a                       geometry
## 50    1057  6427     8    1504 1 MULTIPOLYGON (((491842.7 19...
# join-with-self:
nrow(nc)
## [1] 100
x = st_join(nc, nc)
nrow(x) # neighbours AND each state with itself
## [1] 590
x = st_join(nc, nc, join = st_touches)
nrow(x) # neighbours, now excluding itself
## [1] 490
st_rook = function(a, b, prepared) st_relate(a, b, pattern = "F***1****")
x = st_join(nc, nc, join = st_rook)
nrow(x) # "rook" neighbours: touch over a line, not in a point
## [1] 462
# which states touch another state in a point only?
sel = unlist(st_relate(nc, nc, pattern = "F***0****"))
plot(st_geometry(nc))
plot(st_geometry(nc[sel,]), add = TRUE, col = 'grey')

see also this issue.

Which points touch four states?

pts = st_intersection(nc, nc)
## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries
pts = pts[st_dimension(pts) == 0, ]
plot(st_geometry(nc))
plot(st_geometry(nc[sel,]), add = TRUE, col = 'grey')
plot(st_geometry(pts), add = TRUE, col = '#ff000044', pch = 16)

# how many neighbours does each of these points have?
unique(lengths(st_intersects(pts, nc)))
## [1] 4

manipulating geometries

  • st_line_merge: merges lines
  • st_segmentize: adds points to straight lines
  • st_voronoi: creates voronoi tesselation
  • st_centroid: gives centroid of geometry
  • st_convex_hull: creates convex hull of set of points
  • st_triangulate: triangulates set of points (not constrained)
  • st_polygonize: creates polygon from lines that form a closed ring
  • st_simplify: simplifies lines by removing vertices
  • st_split: split a polygon given line geometry
  • st_buffer: compute a buffer around this geometry/each geometry
  • st_make_valid: tries to make an invalid geometry valid (requires lwgeom)
  • st_boundary: return the boundary of a geometry

convenience functions

  • st_zm: sets or removes z and/or m geometry
  • st_coordinates: retrieve coordinates in a matrix or data.frame
  • st_geometry: set, or retrieve sfc from an sf object
  • st_is: check whether geometry is of a particular type

handling mixes: GEOMETRY, GEOMETRYCOLLECTION

We can have mixes of geometries with different types at two levels: at the feature level, and at the feature set level. At the feature level:

p = st_point(c(0,1))
mp = st_multipoint(rbind(c(1,1), c(3,3)))
ls = st_linestring(rbind(c(1,1), c(3,3)))
(gc = st_geometrycollection(list(p, mp)))
## GEOMETRYCOLLECTION (POINT (0 1), MULTIPOINT (1 1, 3 3))
(g = st_sfc(gc))
## Geometry set for 1 feature 
## geometry type:  GEOMETRYCOLLECTION
## dimension:      XY
## bbox:           xmin: 0 ymin: 1 xmax: 3 ymax: 3
## epsg (SRID):    NA
## proj4string:    NA
## GEOMETRYCOLLECTION (POINT (0 1), MULTIPOINT (1 ...
st_collection_extract(g, "POINT")
## Geometry set for 2 features 
## geometry type:  MULTIPOINT
## dimension:      XY
## bbox:           xmin: 0 ymin: 1 xmax: 3 ymax: 3
## epsg (SRID):    NA
## proj4string:    NA
## MULTIPOINT (0 1)
## MULTIPOINT (1 1, 3 3)

Where do these come from? For instance from st_intersection:

opar <- par(mfrow = c(1, 2))
a <- st_polygon(list(cbind(c(0,0,7.5,7.5,0),c(0,-1,-1,0,0))))
b <- st_polygon(list(cbind(c(0,1,2,3,4,5,6,7,7,0),c(1,0,.5,0,0,0.5,-0.5,-0.5,1,1))))
plot(a, ylim = c(-1,1))
title("intersecting two polygons:")
plot(b, add = TRUE, border = 'red')
(i <- st_intersection(a,b))
## GEOMETRYCOLLECTION (POINT (1 0), LINESTRING (4 0, 3 0), POLYGON ((5.5 0, 7 0, 7 -0.5, 6 -0.5, 5.5 0)))
plot(a, ylim = c(-1,1))
title("GEOMETRYCOLLECTION")
plot(b, add = TRUE, border = 'red')
plot(i, add = TRUE, col = 'green', lwd = 2)

par(opar)

At the feature set level we end up with GEOMETRY objects:

(gm = st_sfc(p, mp, ls))
## Geometry set for 3 features 
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: 0 ymin: 1 xmax: 3 ymax: 3
## epsg (SRID):    NA
## proj4string:    NA
## POINT (0 1)
## MULTIPOINT (1 1, 3 3)
## LINESTRING (1 1, 3 3)

How to deal with such messy data? One thing we could do is select:

sf = st_sf(a = 1:3, geom = gm)
st_dimension(sf)
## [1] 0 0 1
sf[st_dimension(sf) == 1,]
## Simple feature collection with 1 feature and 1 field
## geometry type:  LINESTRING
## dimension:      XY
## bbox:           xmin: 1 ymin: 1 xmax: 3 ymax: 3
## epsg (SRID):    NA
## proj4string:    NA
##   a                  geom
## 3 3 LINESTRING (1 1, 3 3)

another is to cast geometries to a common class:

st_cast(sf, "MULTIPOINT")
## Simple feature collection with 3 features and 1 field
## geometry type:  MULTIPOINT
## dimension:      XY
## bbox:           xmin: 0 ymin: 1 xmax: 3 ymax: 3
## epsg (SRID):    NA
## proj4string:    NA
##   a                  geom
## 1 1      MULTIPOINT (0 1)
## 2 2 MULTIPOINT (1 1, 3 3)
## 3 3 MULTIPOINT (1 1, 3 3)

which converts the LINESTRING to a MULTIPOINT as well.

st_cast also tries to unravel GEOMETRYCOLLECTION geometries in their parts:

(parts = st_cast(st_sf(a = 1, geometry = st_sfc(gc))))
## Warning in st_cast.sf(st_sf(a = 1, geometry = st_sfc(gc))): repeating
## attributes for all sub-geometries for which they may not be constant
## Simple feature collection with 2 features and 1 field
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: 0 ymin: 1 xmax: 3 ymax: 3
## epsg (SRID):    NA
## proj4string:    NA
##     a              geometry
## 1   1           POINT (0 1)
## 1.1 1 MULTIPOINT (1 1, 3 3)

potentially repeating associated attributes, and tries to find a common class when given no argument

st_cast(parts)
## Simple feature collection with 2 features and 1 field
## geometry type:  MULTIPOINT
## dimension:      XY
## bbox:           xmin: 0 ymin: 1 xmax: 3 ymax: 3
## epsg (SRID):    NA
## proj4string:    NA
##     a              geometry
## 1   1      MULTIPOINT (0 1)
## 1.1 1 MULTIPOINT (1 1, 3 3)

empty geometries

Empty geometries server the role of NA, or missing values, for geometries, but they are typed:

st_point()
## POINT EMPTY
st_geometrycollection()
## GEOMETRYCOLLECTION EMPTY

They can be detected by st_is_empty:

st_sfc(st_point(0:1), st_point(), st_geometrycollection()) %>% st_is_empty
## [1] FALSE  TRUE  TRUE

Pipe-based workflows and tidyverse

When both sf and dplyr are loaded, a number of generic methods in the tidyverse become available for sf objects. They are:

m1 = attr(methods(class = "sf"), "info")$generic
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  1.4.2     ✔ purrr   0.2.5
## ✔ tidyr   0.8.1     ✔ dplyr   0.7.6
## ✔ readr   1.1.1     ✔ stringr 1.3.1
## ✔ tibble  1.4.2     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
m2 = attr(methods(class = "sf"), "info")$generic
noquote(setdiff(m2, m1))
##  [1] anti_join   arrange     distinct    full_join   gather     
##  [6] group_by    inner_join  left_join   mutate      nest       
## [11] rename      right_join  sample_frac sample_n    select     
## [16] semi_join   separate    slice       spread      summarise  
## [21] transmute   ungroup     unite       unnest

Most of these do little more than simply make sure the returned object is of class sf again, but some of them do more, e.g. summarise by default aggregates (unions) the summarised geometries:

cent <- st_geometry(nc) %>% st_centroid %>% st_coordinates 
nc1 <- nc %>% 
  mutate(area = as.numeric(st_area(.)), longitude = cent[,1]) %>%
  group_by(cut(longitude,4))  %>% 
  summarise(area = sum(area), BIR79 = sum(BIR79), dens = mean(BIR79/area))
ggplot(nc1) + geom_sf(aes(fill = dens))

Array data: rasters, spatial time series

Although sp has some simple infrastructure for them in the Spatial class framework, raster data in R are best handled by the raster package. It is feature-rich, well integrated with the Spatial class framework, and can deal with massive rasters, as long as they fit on the local disk. It does not integrate particularly well with the tidyverse, or with simple features. Neither does it handle four- or higher-dimensional dimensional data (e.g. x,y.t,color), or time series data for features.

A follow-up project of the “simple features for R” project tries to address a number of these issues

Summary/outlook

In comparison to sp, package sf

What’s really new, and better in sf, compared to sp?

Exercises

Load library mapview by

library(mapview)

possibly after installing it. Consider the datasets breweries, trails and franconia, which are available after loading mapview. (Note that the trails dataset is a subset from a larger dataset, based on constraining distances to existing breweries.)

  1. For each of the datasets, describe which geometry type it has, how many features (observations) and attributes (variables) it has.
  2. Plot each of the three objects.
  3. Export the franconia dataset to a geopackage file (franconia.gpkg) and try to view that file in QGIS (if you have that available on your computer).
  4. compare the coordinate reference systems (CRS) of the three objects. In case of differing coordinate reference systems, convert all of them to a single CRS.
  5. For each of the trails, find out which brewery is closest to the trail.
  6. For each of the trails, find out the distance to the clostest brewery.
  7. For each of the trails, find out how many breweries it has that are less than 500 m away from the trail, per km of trail.
  8. For each of the regions in franconia, find out how many breweries it contains, and add this number as a field to the franconia object.
  9. For each of the regions in franconia, count the total number.of.types of beers brewed by the all the breweries in the region.
  10. Count the total length of the trails per region in franconia