10/16/2019, Spatial Data Science Conference, NY; https://edzer.github.io/sdsc19/ (repo)

Data science: the table view

CNTY_ NAME BIR74 SID74 NWBIR74 BIR79 SID79 NWBIR79
1825 Ashe 1091 1 10 1364 0 19
1827 Alleghany 487 0 10 542 3 12
1828 Surry 3188 5 208 3616 6 260
1831 Currituck 508 1 123 830 2 145
1832 Northampton 1421 9 1066 1606 3 1197
1833 Hertford 1452 7 954 1838 5 1237
1834 Camden 286 0 115 350 2 139

… with coordinates

CNTY_ NAME BIR74 SID74 longitude latitude
1825 Ashe 1091 1 -81.49826 36.43140
1827 Alleghany 487 0 -81.12515 36.49101
1828 Surry 3188 5 -80.68575 36.41252
1831 Currituck 508 1 -76.02750 36.40728
1832 Northampton 1421 9 -77.41056 36.42228
1833 Hertford 1452 7 -76.99478 36.36145
1834 Camden 286 0 -76.23435 36.40120

… with POINT geometries

CNTY_ NAME BIR74 SID74 geometry
1825 Ashe 1091 1 POINT (-81.49826 36.4314)
1827 Alleghany 487 0 POINT (-81.12515 36.49101)
1828 Surry 3188 5 POINT (-80.68575 36.41252)
1831 Currituck 508 1 POINT (-76.0275 36.40728)
1832 Northampton 1421 9 POINT (-77.41056 36.42228)
1833 Hertford 1452 7 POINT (-76.99478 36.36145)
1834 Camden 286 0 POINT (-76.23435 36.4012)

… with POLYGON geometries

CNTY_ NAME BIR74 SID74 geometry
1825 Ashe 1091 1 MULTIPOLYGON (((-81.47276 36.23436, -81.54084 3…
1827 Alleghany 487 0 MULTIPOLYGON (((-81.23989 36.36536, -81.24069 3…
1828 Surry 3188 5 MULTIPOLYGON (((-80.45634 36.24256, -80.47639 3…
1831 Currituck 508 1 MULTIPOLYGON (((-76.00897 36.3196, -76.01735 36…
1832 Northampton 1421 9 MULTIPOLYGON (((-77.21767 36.24098, -77.23461 3…
1833 Hertford 1452 7 MULTIPOLYGON (((-76.74506 36.23392, -76.98069 3…
1834 Camden 286 0 MULTIPOLYGON (((-76.00897 36.3196, -75.95718 36…

the "shapefile" view

## Coordinate Reference System:
##   EPSG: 4267 
##   proj4string: "+proj=longlat +datum=NAD27 +no_defs"

Coordinate reference systems

  • are the "measurement units" of spatial coordinates
  • relate a location to a particular reference ellipsoid ("datum")
  • may describe a 2-D projection

https://xkcd.com/977/

pts = st_centroid(st_geometry(nc))
## Warning in st_centroid.sfc(st_geometry(nc)): st_centroid does not give
## correct centroids for longitude/latitude data
st_crs(pts)
## Coordinate Reference System:
##   EPSG: 4267 
##   proj4string: "+proj=longlat +datum=NAD27 +no_defs"
pts[[1]]
## POINT (-81.49826 36.4314)
st_transform(pts[1], "+proj=longlat +datum=WGS84")[[1]]
## POINT (-81.49809 36.43152)

Joining tables, spatially

  • if longitude and latitude are in columns (fields), then regular join will work
  • if you have POINT geometries in a column, use st_equals (or st_within_distance)
  • if you have other geometries, there are a lot of options:
    • st_intersects, st_disjoint
    • st_covers, st_covered_by, st_touches, st_within
    • st_relate with pattern

Machine learning with coordinates as features

Potential issues:

  • you are trying to explain variability by where, instead of (or in addition to, or as an alternative to) what is going on there (feature variables); this may be competitive, but might be so for the wrong reason (extrapolation?)
  • in case you have clustered training data and use naive (random) cross validation, location may pretty well point out the cluster, and result in overly optimistic performance measures
  • as with regression modelling strategies, check that the models are not sensitive to translations and rotations of coordinates (in \(R^2\), or \(S^2\), or both?)

GIS and the curse of two-dimensionality