10/16/2019, Spatial Data Science Conference, NY; https://edzer.github.io/sdsc19/ (repo)

Data science: the table view

CNTY_ NAME BIR74 SID74 NWBIR74 BIR79 SID79 NWBIR79
1825 Ashe 1091 1 10 1364 0 19
1827 Alleghany 487 0 10 542 3 12
1828 Surry 3188 5 208 3616 6 260
1831 Currituck 508 1 123 830 2 145
1832 Northampton 1421 9 1066 1606 3 1197
1833 Hertford 1452 7 954 1838 5 1237
1834 Camden 286 0 115 350 2 139

… with coordinates

CNTY_ NAME BIR74 SID74 longitude latitude
1825 Ashe 1091 1 -81.49826 36.43140
1827 Alleghany 487 0 -81.12515 36.49101
1828 Surry 3188 5 -80.68575 36.41252
1831 Currituck 508 1 -76.02750 36.40728
1832 Northampton 1421 9 -77.41056 36.42228
1833 Hertford 1452 7 -76.99478 36.36145
1834 Camden 286 0 -76.23435 36.40120

… with POINT geometries

CNTY_ NAME BIR74 SID74 geometry
1825 Ashe 1091 1 POINT (-81.49826 36.4314)
1827 Alleghany 487 0 POINT (-81.12515 36.49101)
1828 Surry 3188 5 POINT (-80.68575 36.41252)
1831 Currituck 508 1 POINT (-76.0275 36.40728)
1832 Northampton 1421 9 POINT (-77.41056 36.42228)
1833 Hertford 1452 7 POINT (-76.99478 36.36145)
1834 Camden 286 0 POINT (-76.23435 36.4012)

… with POLYGON geometries

CNTY_ NAME BIR74 SID74 geometry
1825 Ashe 1091 1 MULTIPOLYGON (((-81.47276 36.23436, -81.54084 3…
1827 Alleghany 487 0 MULTIPOLYGON (((-81.23989 36.36536, -81.24069 3…
1828 Surry 3188 5 MULTIPOLYGON (((-80.45634 36.24256, -80.47639 3…
1831 Currituck 508 1 MULTIPOLYGON (((-76.00897 36.3196, -76.01735 36…
1832 Northampton 1421 9 MULTIPOLYGON (((-77.21767 36.24098, -77.23461 3…
1833 Hertford 1452 7 MULTIPOLYGON (((-76.74506 36.23392, -76.98069 3…
1834 Camden 286 0 MULTIPOLYGON (((-76.00897 36.3196, -75.95718 36…

the "shapefile" view

## Coordinate Reference System:
##   EPSG: 4267 
##   proj4string: "+proj=longlat +datum=NAD27 +no_defs"

Coordinate reference systems

  • are the "measurement units" of spatial coordinates
  • relate a location to a particular reference ellipsoid ("datum")
  • may describe a 2-D projection

https://xkcd.com/977/

pts = st_centroid(st_geometry(nc))
## Warning in st_centroid.sfc(st_geometry(nc)): st_centroid does not give
## correct centroids for longitude/latitude data
st_crs(pts)
## Coordinate Reference System:
##   EPSG: 4267 
##   proj4string: "+proj=longlat +datum=NAD27 +no_defs"
pts[[1]]
## POINT (-81.49826 36.4314)
st_transform(pts[1], "+proj=longlat +datum=WGS84")[[1]]
## POINT (-81.49809 36.43152)

Joining tables, spatially

  • if longitude and latitude are in columns (fields), then regular join will work
  • if you have POINT geometries in a column, use st_equals (or st_within_distance)
  • if you have other geometries, there are a lot of options:
    • st_intersects, st_disjoint
    • st_covers, st_covered_by, st_touches, st_within
    • st_relate with pattern

Machine learning with coordinates as features

Potential issues:

  • you are trying to explain variability by where, instead of (or in addition to, or as an alternative to) what is going on there (feature variables); this may be competitive, but might be so for the wrong reason (extrapolation?)
  • in case you have clustered training data and use naive (random) cross validation, location may pretty well point out the cluster, and result in overly optimistic performance measures
  • as with regression modelling strategies, check that the models are not sensitive to translations and rotations of coordinates (in \(R^2\), or \(S^2\), or both?)

GIS and the curse of two-dimensionality

g[c(1,36)] %>% st_touches()
## although coordinates are longitude/latitude, st_touches assumes that they are planar
## Sparse geometry binary predicate list of length 2, where the predicate was `touches'
##  1: (empty)
##  2: (empty)

g[c(1,36)] %>% st_transform(3031) %>% st_touches()
## Sparse geometry binary predicate list of length 2, where the predicate was `touches'
##  1: (empty)
##  2: (empty)
g[c(1,36)] %>% st_transform(3031) %>% st_set_precision(units::set_units(1, mm)) %>% st_touches()
## Sparse geometry binary predicate list of length 2, where the predicate was `touches'
##  1: 2
##  2: 1

Now we have three problems:

  1. are polygons 1 (red) and 36 (green) square, or triangular?
  2. do they touch each other, or not?
  3. when should we round polygon coordinates?

Is this a real problem?

  • treating geodetic coordinates (degrees) as 2-D (GIS) often works fine, in particular for small areas not close to the poles
  • if you really assume angles are interpreted as equirectangular, all is fine!
  • writing software that warns only in case of problems is very hard: when is a an answer wrong enough to trigger an error?
  • writing software that emits warnings (like sf) invites downstream developers to silence these warnings…
  • PostGIS ("geography", and sf through pkg lwgeom) currently take a half-hearted approach:
    • do many things "right" (areas, distances)
    • don't offer, or emulate, complicated things (st_intersects, st_buffer) or warn;
  • modern systems (s2geometry/h3; Google BigQuery GIS, GEE) take an integral global approach, which makes a lot of sense to me
  • modellers map the world from 0 to 360 degrees; many plots call for including the antimeridian; many model grids are in long-lat after rotating poles

GeoJSON rfc9746

GeoJSON writes the GIS legacy problem in the standard: lines or polygons shall never cross the antimeridian (180E/-180W):

Clause 3.1.9. Antimeridian Cutting:

  • In representing Features that cross the antimeridian, interoperability is improved by modifying their geometry. Any geometry that crosses the antimeridian SHOULD be represented by cutting it in two such that neither part's representation crosses the antimeridian.

What does the polygon mean?

for the polygon

## POLYGON ((-180 -90, -170 -90, -170 -80, -180 -80, -180 -90))

the simple feature access (OGC) standard says that we interpret this as points, connected by straight-line interpolation. What do straight lines mean here?

  • Equirectangular: follow lines with constant longitude or latitude (think graticule)
  • Constant direction (loxodrome): straight line on a Mercator projection
  • On the sphere:
  • real straight lines: take you from \(S^2\) to \(R^3\), meaning go subsurface (and require 3D coordinates)
  • great circle arcs (orthodrome): arc of the circle passing through the two points and the Earth's center
  • such an arc, but on an ellipsoid

st_area(p1) %>% units::set_units(km^2) # segmentized along parallel
## 108570.1 [km^2]
st_area(p2) %>% units::set_units(km^2) # segmentized along great circle
## 108033.6 [km^2]
## 0.4966177 [%]

w[7,] %>% st_geometry() %>% st_is_valid()
## [1] TRUE
w[7,] %>% st_geometry() %>% st_transform(3031) %>% st_is_valid()
## [1] FALSE

Floating point operations

1/3 == (1 - 2/3)
## [1] FALSE
(x = c(1/3, 1 - 2/3))
## [1] 0.3333333 0.3333333
print(x, digits = 20)
## [1] 0.33333333333333331483 0.33333333333333337034
diff(x)
## [1] 5.551115e-17

8-byte doubles for coordinates: what does it mean?

  • 1 degree is approximately 110 km:
  • 8-byte doubles represent difference in numbers up to roughly 2e-16
.Machine$double.eps
## [1] 2.220446e-16
2.2e-16 * units::set_units(110, km) %>% units::set_units(nm)
## 0.0242 [nm]
  • unlike integers, floating point numbers are not uniformly distributed
  • this leads often to numerical problems (invalid geometries) when doing geometrical operations, e.g. with GEOS
  • solution may be to limit precision (round coordinates e.g. to mm), but this is often trial-and-error work

Spatial data: Support

  • land use: point support (n is large)
  • population density: polygon support (n is 1)

  • downscaling, point support:

    • simply by sampling
  • downscaling, area support:

    • assume constant/uniform (naive!), or use dasymetric mapping
    • decide whether the variable is extensive (divide) or intensive (average)

Spatial data: raster

  • is "the coordinate of a raster cell" the raster center, or a corner?
  • which corners, which sides, belong to a raster cell?
  • is a raster cell:
    • a small polygon with constant value
    • a small polygon with an aggregated value
    • the value of the variable at the cell center?

Spatiotemporal data: 1. irregular

  • events: earthquakes, disease cases, births, bird sightings etc.
  • in addition to a geometry, every event has a time stamp (or period)
  • interest is in frequencies, densities, space/time interactions, generating process …
  • movement data: life lines, bus tracks, migration patterns
  • sequences of (locations, times) form a track
  • collections of tracks are often analysed jointly
  • ecology: home range / utility distribution function estimation
  • interest is density, traffic, space/time interactions, generating process, aliby problems, …

Many operations (aggregation, densities) lead to regular spatiotemporal data structures

Spatiotemporal data: 2. regular (data cubes)

For every location, the same time sequence is available * (static) sensor data: weather or air quality monitoring, streamflow measurement, traffic sensors * imagery from fixed cameras (video) * satellite imagery * weather and climate modelling

One can use tables for this, but it often makes more sense to use multidimensional arrays for this, which map dimensions to attributes; for this reason these data are called data cubes

  • AQ: location, time, parameter \(\rightarrow\) value (vector data cube)
  • satellite: row, col, wavelength (color) \(\rightarrow\) value (raster data cube)
  • weather model: long, lat, time of forecast, time to forecast, pressure (altitude) \(\rightarrow\) temperature, wind, humidity, …

raster data cube 1

raster data cube 1

vector data cube 1

vector data cube 2

Using data cubes

A lot of weather, climate model, and satellite imagery is available for free

  • ERA5, global reanalysis 1979-2019, .25\(^o\) grid, global, 37 pressure levels (9 Pb)
  • Climate model intercomparison programs: CMIP4, CMIP5 (3 Pb), CMIP6 (15 Pb)
  • Satellite imagery: NASA, Copernicus data (…)

If a small dataset is needed, it can be downloaded; for larger analysis this quickly becomes impractical.

  • on cloud platforms serving the data, going through the files is often still tedious
  • a platform that offers high-level and fast (data cube based) access is Google Earth Engine (GEE)
  • GEE is, however, closed source and its free tier has usage restrictions
  • we (openEO.org) developed an open API (and several clients) which uses a data cube interface to analyse data on a variety of cloud platform architectures (Geotrellis/GeoPySpark, Grass GIS/actinia, WCPS, GEE, …)

Summarizing

  1. spatial data can be seen as
    • tables, extended with geometries that have a coordinate reference system (DS view)
    • geometry sets, with attributes in tables (GIS view)
  2. for DS, many new operators become available (predicates, measures, intersection/buffer/union)
  3. it matters whether you conceive your data as flat, or on a sphere, for instance for what a "straight line" means
  4. the legacy 2D flat/square GIS view may have had its days
  5. floating point ops often cause trouble in geometrical operations
  6. spatial data have a support, which is important for meaningfully analysing them, but many data formats/services don't reveal information about support
  7. spatiotemporal data can be irregular or regular
    • irregular spatiotemporal data is often handled in tables (or nested tables)
    • regular spatiotemporal data can be handled as vector or raster data cubes
  8. there is a wealth of cube data openly available, still challenging to analyse, but with huge potential