Towards Spatial Data Science

Jul 12, 2019, Spatial Statistics Conference; https://edzer.github.io/spasta/

Overview

data science, open data science
the Earth is round
Spatial data have support
big data, interoperability, and luxury problems?

What is data Science?

Combining

statististical understanding
data manipulation / software engineering skills
domain knowledge (understanding data)

into a single person. And then, what they do.

Bryan & Wickham, 2017

Data Science has at least one advantage over Statistics, which partially explains its existence. Re-defining an existing field like Statistics is terribly difficult, whereas it’s much easier to define something new from scratch.
We see a substantial mismatch between what is needed to learn from data and the much smaller subset of activity that is structurally rewarded in academic statistics today.

From: Jennifer Bryan, and Hadley Wickham, 2017. Data Science: A Three Ring Circus or a Big Tent? Journal of Computational and Graphical Statistics 26 (4) (part of a collection of discussion pieces on David Donoho’s paper "50 Years of Data Science".)

Why do we have data science?

"data is the new oil"; data is an economic resource for:
- profiling
- what will be my next purchase
- which movie will I play next
- Big IT, many Smaller IT, Government (security/intelligence, also public/open purposes: energy, traffic)
data science is an enabler for open (data) science
- this should be of interest to scientists

Open (data) science

"open science" should be a tautology, but isn't, and needs to be brought up until it is
science vs. engineering
reproducibility:
- share data, to the extent possible
- share software and/or scripts
- uses open source software ("the science of where")
- open access to all material, including publications
- follow FAIR principles: findable, accessible, interoperable and reusable
In principle, we can do this, in practice, we don't

Open Data Science is a mindset

(Julia Stewart Lowndes, "R for better science in less time", UseR!2019, Jul 10)

rather than exchange finished products (papers), learn to share work at an early stage
(an idea embraced earlier in open source software development)
"Harness the power of welcome" (create safe, diverse, and inclusive spaces)
different social interaction: code of conducts,
focus on diversity (RLadies, scholarships), recent datacamp pushback,
Build community: organize hackatons, unconferences, meetups, developer days:
Communicate by: github issues, slack, hex stickers, twitter

(How) do we foster an open science mindset?

How we do science; thought experiment 1:

You are going to meet for three days with 100 (other) experts to solve problems you have in common, and get feedback on your work. How would you organize these three days?

Who of you would propose to use this time to split the complete time in 13+2 minutes oral presentations?

How we do science; thought experiment 2:

Besides conferences, we also communicate research findings by publishing research papers. How many comments or questions do you receive about the contents of a published paper, on average?

What then is Spatial (open) Data Science?

Waldo Tobler's First Law of Geography? ("Everything is related to everything else, but near things are more related than distant things"). Check. Fisher knew.

the Earth is round (after Nicholas Chrisman, 20xx)
spatial and spatiotemporal data have a support (Journel & Huijbregts, 1978), and this matters.

The Earth is round

and most spatial data, Today, come with geographical (long/lat) coordinates, in degrees, usually related to WGS84.

library(sf)

## Linking to GEOS 3.7.0, GDAL 2.4.0, PROJ 5.2.0

(line = st_linestring(rbind(c(0,10), c(10,10))))

## LINESTRING (0 10, 10 10)

(point = st_point(c(5, 10)))

## POINT (5 10)

st_intersection(st_sfc(line, crs = 4326), st_sfc(point, crs = 4326))[[1]]

## although coordinates are longitude/latitude, st_intersection assumes that they are planar

## POINT (5 10)

More seriously:

suppressPackageStartupMessages(library(spatstat))
suppressPackageStartupMessages(library(maptools))
p = rbind(c(-110, 85), c(120, 87), c(140, 86))
pts = SpatialPoints(rbind(p), proj4string = CRS("+proj=longlat"))
as.ppp(pts)

## Planar point pattern: 3 points
## window: rectangle = [-110, 140] x [85, 87] units

bb =st_as_sfc(st_bbox(st_as_sfc(pts)))
bb2 = st_transform(st_set_crs(st_segmentize(st_set_crs(bb, NA), 1), 4326), 3995)
plot(bb2, border = 'red', lwd = 2, graticule = TRUE)
plot(st_transform(st_as_sfc(pts), 3995), add = TRUE, pch = 16, cex=3)

The error

the error (unwanted anisotropy) is proportional to \[\cos^{-1}(\varphi)\] with \(\varphi\) the (center?) latitude
if you think POINT(-179 50) and POINT(179 51) are close, it is much larger

Support

nc = read_sf(system.file("gpkg/nc.gpkg", package="sf")) # read as sf-tibble
agr = c(AREA = "aggregate", PERIMETER = "aggregate", CNTY_ = "identity",
  CNTY_ID = "identity", NAME = "identity", FIPS = "identity", FIPSNO = "identity",
  CRESS_ID = "identity", BIR74 = "aggregate", SID74 = "aggregate", NWBIR74 = "aggregate",
  BIR79 = "aggregate", SID79 = "aggregate", NWBIR79  = "aggregate")
st_agr(nc) = agr 
nc[c(9:11,15)]

## Simple feature collection with 100 features and 3 fields
## Attribute-geometry relationship: 0 constant, 3 aggregate, 0 identity
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
## # A tibble: 100 x 4
##    BIR74 SID74 NWBIR74                                                 geom
##    <dbl> <dbl>   <dbl>                                   <MULTIPOLYGON [°]>
##  1  1091     1      10 (((-81.47276 36.23436, -81.54084 36.27251, -81.5619…
##  2   487     0      10 (((-81.23989 36.36536, -81.24069 36.37942, -81.2628…
##  3  3188     5     208 (((-80.45634 36.24256, -80.47639 36.25473, -80.5368…
##  4   508     1     123 (((-76.00897 36.3196, -76.01735 36.33773, -76.03288…
##  5  1421     9    1066 (((-77.21767 36.24098, -77.23461 36.2146, -77.29861…
##  6  1452     7     954 (((-76.74506 36.23392, -76.98069 36.23024, -76.9947…
##  7   286     0     115 (((-76.00897 36.3196, -75.95718 36.19377, -75.98134…
##  8   420     0     254 (((-76.56251 36.34057, -76.60424 36.31498, -76.6482…
##  9   968     4     748 (((-78.30876 36.26004, -78.28293 36.29188, -78.3212…
## 10  1612     1     160 (((-80.02567 36.25023, -80.45301 36.25709, -80.4353…
## # … with 90 more rows

pt = st_as_sfc("POINT (-78.25073 34.07663)")
st_intersection(nc, st_sfc(pt, crs = st_crs(nc)))

## although coordinates are longitude/latitude, st_intersection assumes that they are planar

## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries

## Simple feature collection with 1 feature and 14 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -78.25073 ymin: 34.07663 xmax: -78.25073 ymax: 34.07663
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
## # A tibble: 1 x 15
##    AREA PERIMETER CNTY_ CNTY_ID NAME  FIPS  FIPSNO CRESS_ID BIR74 SID74
##   <dbl>     <dbl> <dbl>   <dbl> <chr> <chr>  <dbl>    <int> <dbl> <dbl>
## 1 0.212      2.02  2241    2241 Brun… 37019  37019       10  2181     5
## # … with 5 more variables: NWBIR74 <dbl>, BIR79 <dbl>, SID79 <dbl>,
## #   NWBIR79 <dbl>, geom <POINT [°]>

i = st_intersection(nc["CNTY_"], st_sfc(pt, crs = st_crs(nc)))

## although coordinates are longitude/latitude, st_intersection assumes that they are planar

nc1 = st_transform(nc, 2264) # NC state plain, US feet
pt1 = st_transform(st_sfc(pt, crs = st_crs(nc)), 2264)
i1 = st_intersection(nc1["CNTY_"], pt1)

Interoperability

Data cubes:

Big data

New(er) data sources

are typically messy
have large samples (really, really large): won't move; clouds
often come with uncontrolled and/or unknown sampling processes

To what extent are the methods we share on this conference ready for tackling this?

The "normal" spatial statistics contribution contains a methodological innovation, quite often seemingly more as a purpose on itself, rather than motivated (as of: needed) by the application.

In the light of problems society is facing (like: food security, disease transmissions, climate change), to what extent are these luxury problems? Is it reasonable to call for maximum likelyhood / Bayesian methods / … ?

Other challenges / dangers

Now that source code is dominantly open, the next lock-in is that of (open) standards and cloud platforms: what is our answer to that?
If (spatial) data science is to be a new academic discipline, how much are we going to manage it?

Suppose we have an academic tenure track position in data science: do we give it to the candidate with 30 CRAN package and 2 M downloads yearly, or to the candidate with an h factor of 15 and 1200 citations in total?