- data science, open data science
- the Earth is round
- Spatial data have support
- big data, interoperability, and luxury problems?
Jul 12, 2019, Spatial Statistics Conference; https://edzer.github.io/spasta/
Combining
into a single person. And then, what they do.
Data Science has at least one advantage over Statistics, which partially explains its existence. Re-defining an existing field like Statistics is terribly difficult, whereas it’s much easier to define something new from scratch.
We see a substantial mismatch between what is needed to learn from data and the much smaller subset of activity that is structurally rewarded in academic statistics today.
From: Jennifer Bryan, and Hadley Wickham, 2017. Data Science: A Three Ring Circus or a Big Tent? Journal of Computational and Graphical Statistics 26 (4) (part of a collection of discussion pieces on David Donoho’s paper "50 Years of Data Science".)
(Julia Stewart Lowndes, "R for better science in less time", UseR!2019, Jul 10)
You are going to meet for three days with 100 (other) experts to solve problems you have in common, and get feedback on your work. How would you organize these three days?
Besides conferences, we also communicate research findings by publishing research papers. How many comments or questions do you receive about the contents of a published paper, on average?
Waldo Tobler's First Law of Geography? ("Everything is related to everything else, but near things are more related than distant things"). Check. Fisher knew.
and most spatial data, Today, come with geographical (long/lat) coordinates, in degrees, usually related to WGS84.
library(sf)
## Linking to GEOS 3.7.0, GDAL 2.4.0, PROJ 5.2.0
(line = st_linestring(rbind(c(0,10), c(10,10))))
## LINESTRING (0 10, 10 10)
(point = st_point(c(5, 10)))
## POINT (5 10)
st_intersection(st_sfc(line, crs = 4326), st_sfc(point, crs = 4326))[[1]]
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
## POINT (5 10)
suppressPackageStartupMessages(library(spatstat)) suppressPackageStartupMessages(library(maptools)) p = rbind(c(-110, 85), c(120, 87), c(140, 86)) pts = SpatialPoints(rbind(p), proj4string = CRS("+proj=longlat")) as.ppp(pts)
## Planar point pattern: 3 points ## window: rectangle = [-110, 140] x [85, 87] units
bb =st_as_sfc(st_bbox(st_as_sfc(pts))) bb2 = st_transform(st_set_crs(st_segmentize(st_set_crs(bb, NA), 1), 4326), 3995) plot(bb2, border = 'red', lwd = 2, graticule = TRUE) plot(st_transform(st_as_sfc(pts), 3995), add = TRUE, pch = 16, cex=3)
POINT(-179 50)
and POINT(179 51)
are close, it is much largernc = read_sf(system.file("gpkg/nc.gpkg", package="sf")) # read as sf-tibble agr = c(AREA = "aggregate", PERIMETER = "aggregate", CNTY_ = "identity", CNTY_ID = "identity", NAME = "identity", FIPS = "identity", FIPSNO = "identity", CRESS_ID = "identity", BIR74 = "aggregate", SID74 = "aggregate", NWBIR74 = "aggregate", BIR79 = "aggregate", SID79 = "aggregate", NWBIR79 = "aggregate") st_agr(nc) = agr nc[c(9:11,15)]
## Simple feature collection with 100 features and 3 fields ## Attribute-geometry relationship: 0 constant, 3 aggregate, 0 identity ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 ## epsg (SRID): 4267 ## proj4string: +proj=longlat +datum=NAD27 +no_defs ## # A tibble: 100 x 4 ## BIR74 SID74 NWBIR74 geom ## <dbl> <dbl> <dbl> <MULTIPOLYGON [°]> ## 1 1091 1 10 (((-81.47276 36.23436, -81.54084 36.27251, -81.5619… ## 2 487 0 10 (((-81.23989 36.36536, -81.24069 36.37942, -81.2628… ## 3 3188 5 208 (((-80.45634 36.24256, -80.47639 36.25473, -80.5368… ## 4 508 1 123 (((-76.00897 36.3196, -76.01735 36.33773, -76.03288… ## 5 1421 9 1066 (((-77.21767 36.24098, -77.23461 36.2146, -77.29861… ## 6 1452 7 954 (((-76.74506 36.23392, -76.98069 36.23024, -76.9947… ## 7 286 0 115 (((-76.00897 36.3196, -75.95718 36.19377, -75.98134… ## 8 420 0 254 (((-76.56251 36.34057, -76.60424 36.31498, -76.6482… ## 9 968 4 748 (((-78.30876 36.26004, -78.28293 36.29188, -78.3212… ## 10 1612 1 160 (((-80.02567 36.25023, -80.45301 36.25709, -80.4353… ## # … with 90 more rows
pt = st_as_sfc("POINT (-78.25073 34.07663)") st_intersection(nc, st_sfc(pt, crs = st_crs(nc)))
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
## Warning: attribute variables are assumed to be spatially constant ## throughout all geometries
## Simple feature collection with 1 feature and 14 fields ## geometry type: POINT ## dimension: XY ## bbox: xmin: -78.25073 ymin: 34.07663 xmax: -78.25073 ymax: 34.07663 ## epsg (SRID): 4267 ## proj4string: +proj=longlat +datum=NAD27 +no_defs ## # A tibble: 1 x 15 ## AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 ## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <dbl> ## 1 0.212 2.02 2241 2241 Brun… 37019 37019 10 2181 5 ## # … with 5 more variables: NWBIR74 <dbl>, BIR79 <dbl>, SID79 <dbl>, ## # NWBIR79 <dbl>, geom <POINT [°]>
i = st_intersection(nc["CNTY_"], st_sfc(pt, crs = st_crs(nc)))
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
nc1 = st_transform(nc, 2264) # NC state plain, US feet pt1 = st_transform(st_sfc(pt, crs = st_crs(nc)), 2264) i1 = st_intersection(nc1["CNTY_"], pt1)
New(er) data sources
Now that source code is dominantly open, the next lock-in is that of (open) standards and cloud platforms: what is our answer to that?
If (spatial) data science is to be a new academic discipline, how much are we going to manage it?