```{r echo=FALSE}
ev <- Sys.getenv("USER") == "edzer"
```
### Analysing lattice data; big geospatial datasets
Understand
- what lattice data are, and which concepts, tools and models are used to analyse them
- how big big datasets are, or can be
- the difference between big vector and raster datasets, and datacubes
- how to find data in complex programs
- how to work with big data: download, or compute in the cloud near the data?
### Reading materials
From [Spatial Data Science: with applications in R](https://r-spatial.org/book/):
- Chapter 14-17: Lattice data analysis
- Chapter 9: Large data and cloud native
[stars vignettes 2: proxy objects](https://r-spatial.github.io/stars/articles/stars2.html)
## Exercises you could do to prepare for Today
- Exercises of Ch 14: Lattice data
- Exercises of Ch 9: Big Data and Cloud Native
::: {.callout-tip title="Summary"}
- What is big?
- Raster or vector?
- How to access large data sets?
- Spatial statistics on large datasets
:::
## Analysing lattice data: neighbours, weights, models
```{r}
library(sf)
data(pol_pres15, package = "spDataLarge")
pol_pres15 |>
subset(select = c(TERYT, name, types)) |>
head()
library(tmap, warn.conflicts = FALSE)
tm_shape(pol_pres15) + tm_fill("types")
```
We need to make the geometries valid first,
```{r}
st_is_valid(pol_pres15) |> all()
pol_pres15 <- st_make_valid(pol_pres15)
st_is_valid(pol_pres15) |> all()
```
First, we will consider polygons in relationship to their direct neighbours
```{r}
library(spdep)
pol_pres15 |> poly2nb(queen = TRUE) -> nb_q
nb_q
```
Is the graph connected?
```{r}
(nb_q |> n.comp.nb())$nc
```
```{r}
par(mar = rep(0, 4))
pol_pres15 |>
st_geometry() |>
st_centroid(of_largest_polygon = TRUE) -> coords
plot(st_geometry(pol_pres15), border = 'grey')
plot(nb_q, coords = coords, add = TRUE, points = FALSE)
```
Alternative approaches to form neighbourhood matrices:
- based on distance, e.g. setting a distance threshold or selecting a fixed number of nearest neighbours
- based on triangulating points, for instance polygon centroids
- sphere of influence, a modification of triangulation
- include neighbours from neighbours
### Weights matrices
Weight matrices are needed in analysis, they determine how observations
(or residuals) are weighted in a regression model.
```{r}
(nb_q |> nb2listw(style = "B") -> lw_q_B)
```
### Spatial correlation: Moran's I
Moran's I is defined as
$$
I = \frac{n \sum_{(2)} w_{ij} z_i z_j}{S_0 \sum_{i=1}^{n} z_i^2}
$$
where $x_i, i=1, \ldots, n$ are $n$ observations on the numeric variable of interest, $z_i = x_i - \bar{x}$, $\bar{x} = \sum_{i=1}^{n} x_i / n$, $\sum_{(2)} = \stackrel{\sum_{i=1}^{n} \sum_{j=1}^{n}}{i \neq j}$, $w_{ij}$ are the spatial weights, and $S_0 = \sum_{(2)} w_{ij}$.
We can compute it as
```{r}
pol_pres15$I_turnout |>
moran.test(lw_q_B, randomisation = FALSE,
alternative = "two.sided")
plot(pol_pres15["I_turnout"])
```
A simple linear regression model, assuming independent observations, can be carried out using `lm`:
```{r}
summary(pol_pres15$I_entitled_to_vote)
(lm0 <- lm(I_turnout ~ I_entitled_to_vote, pol_pres15)) |> summary()
pol_pres15$res = residuals(lm0)
plot(pol_pres15["res"])
```
A spatial linear regression model (SEM: spatial error model),
assuming independent observations, can be carried out using `errorsarlm`:
```{r}
form = I_turnout ~ I_entitled_to_vote
library(spatialreg)
SEM_pres <- errorsarlm(form, data = pol_pres15, Durbin = FALSE,
listw = lw_q_B, zero.policy = TRUE)
SEM_pres |> summary()
```
## Exercises
1. Compare the results of the simple linear regression with the spatial error model
2. Compare the maps of residuals of both models
2. Fit a spatial Durbin error model (SDEM), using `Durbin = TRUE` in the same call to `errorsarlm`; compare the output of the Spatial Durbin model with that of the error model.
3. carry out a likelyhood ratio test to compare the SEM and SDEM models (`lmtest::lrtest()`, see the SDS book Ch 17)
## Big data: resource constraints in data science projects
Constraints concern the availability of:
- time (your time, time of team members)
- compute (pc's, cluster, private cloud)
- money (e.g. to hire and/or (re)train people, or to rent public cloud infrastructure)
Public clouds provide:
- infinite (in practice) compute
- infinite (in practice) storage
but cost
- hard money to use (compute, storage, network/data access)
- people capacity to setup and maintain
### There is no cloud!
it's just someone else's computer!
- which is true: the computers have a different shape, but are just like your laptop:
- they have a CPU, main memory, hard drive, possibly a GPU
- quite often you will find yourself on a virtual machine, which acts as a normal computer
- but see below: they have object storage!
## What is a big dataset?
- What is big?
- too big to handle in main memory (with some copying) (Gb)
- too big to fit in memory (20 Gb)
- too big to download (Tb)
- too big to fit on the hard drive, or local file storage (10 Tb)
- too big to move (copy) to your institution (100 Tb - Pb)
## R for big, tabular datasets
- In-memory solutions: `data.table`, `duckdb`, `polars` improve speed (use indexes)
- Out-of-memory solution: `DBI` or `tidyverse` via `dbplyr`, connect to
- a local, on-disc database like MariaDB, PostgreSQL, or MySQL
- cloud-based databases like Google BigQuery, Snowflake,
## Big geospatial
- Large **vector** datasets, examples:
- all building footprints of a continents, [link](https://github.com/microsoft/USBuildingFootprints)
- all rivers, e.g. of the US, [link](https://www.arcgis.com/home/item.html?id=1e29e33360c8441bbb018663273a046e)
- OpenStreetMap, [link](https://wiki.openstreetmap.org/wiki/Downloading_data)
- all agricultural parcels of a continent, e.g. [EuroCrops](https://github.com/maja601/EuroCrops)
- Large **raster** datasets, image collections and data cubes:
- [ERA-5](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels?tab=overview)
- [CMIP-6](https://pcmdi.llnl.gov/CMIP6/), partly on [google](https://console.cloud.google.com/marketplace/product/noaa-public/cmip6?project=erudite-gate-281920) and [AWS](https://registry.opendata.aws/cmip6/)
- Copernicus (Sentinel-1, 2, 3, 5p, etc), e.g. on [CDSE](https://dataspace.copernicus.eu/)
- Landsat, MODIS, ... [download](https://earthexplorer.usgs.gov/)
- Cloud solutions, cloud platforms, *with* platform lock-in:
- [ArcGIS online](https://www.esri.com/en-us/arcgis/products/arcgis-online/overview)
- [Sentinel Hub](https://www.sentinel-hub.com/)
- [Google Earth Engine](https://earthengine.google.com/)
- [Microsoft Planetary Computer](https://planetarycomputer.microsoft.com/)
- [Earth on Amazon](https://aws.amazon.com/earth/) (AWS US-west Oregon: COGS + STAC for S1 + S2)
- Copernicus Data Space Ecosystem (but has [openEO](https://openeo.org/): a fully open standard and open source software stack)
::: callout-tip
## Clouds and object storage
Object storage abstracts away hard drives and file systems!
- e.g. S3 bucket (AWS/OpenStack):
- total size is _unlimited_
- maximum object size 5 Tb (AWS S3)
- idea: write once, read many times
- large objects: write piece-wise
- http range requests
- price depends on size, access speed, amount of requests
- tabular data: Parquet
- large data processing: collocate processing and storage
- avoid network between locations / data centers
- network inside a data center is fast / cheap
:::
## Access mechanism
- API:
- process: [openEO cloud](https://openeo.cloud/), [openEO on CDSE](https://openeo.dataspace.copernicus.eu/),
- select, download, process: [Climate Data Store](https://cds.climate.copernicus.eu/#!/home)
- find "assets" (files): [STAC](https://stacspec.org/en), [stacindex](https://stacindex.org/)
- partial reads of data cubes: variable, bounding box, strided (low resolution), time period
- vector tiles: pmtiles, flatgeobuf
::: callout-tip
## Cloud-optimized, cloud-native geoospatial
- Cloud-optimized formats let you read *sections* of large, remote files using HTTP range requests
- examples: Cloud-optimized GeoTIFF (COG), GeoZarr, GeoParquet
- These are described in the [Cloud-Optimized Geospatial Formats Guide](https://guide.cloudnativegeo.org/)
:::
## Examples openEO
Two video's from me taken during the 2023 OpenGeoHub Summerschool, on the topic "Cloud-based analysis of Earth Observation data using openEO Platform, R and Python" can be found here:
- [part 1](https://www.youtube.com/watch?v=NurpU0V6JG8)
- [part 2](https://www.youtube.com/watch?v=mrY8VKOoz3c)
## Example [`rstac`](https://cran.r-project.org/web/packages/rstac/index.html)
Using [Sentinel-2 COGs at AWS](https://registry.opendata.aws/sentinel-2-l2a-cogs/), and its [stac](https://earth-search.aws.element84.com/v1):
```{r}
library(rstac) # modified from the package docs:
s_obj = stac("https://earth-search.aws.element84.com/v1")
collections(s_obj) |> get_request()
it_obj <- s_obj |>
stac_search(collections = "sentinel-2-l2a",
bbox = c(-47.02148, -17.35063, -42.53906, -12.98314),
datetime = "2022-02-12T00:00:00Z/2022-03-18T00:00:00Z",
limit = 1) |>
get_request()
it_obj
```
then, download (here only one item):
```{r eval=FALSE}
download_items <- it_obj |>
assets_download(assets_name = "thumbnail", items_max = 1, overwrite = TRUE)
```
(this does not work; probably needs access to s3 from within a cloud environment)
and examine
```{r eval=ev}
library(sf)
tif = "sentinel-s2-l2a-cogs/23/K/MA/2022/3/S2A_23KMA_20220317_0_L2A/B04.tif"
gdal_utils("info", tif)
library(stars)
read_stars(tif) |> plot()
```
::: {.callout-note title="Breakout session 2"}
Discuss:
- Have you used any cloud platforms for processing geospatial data?
- What is your position with respect to platform lock-in?
:::
## Further examples from r-spatial.org:
- [Cloud-based processing of satellite image collections in R using STAC, COGs, and on-demand data cubes](https://r-spatial.org/r/2021/04/23/cloud-based-cubes.html)
- [Reading Zarr files with R package stars](https://r-spatial.org/r/2022/09/13/zarr.html)
- [Processing large scale satellite imagery with openEO Platform and R](https://r-spatial.org/r/2022/11/24/openeo.html)
## Examples `/vsixxx`
```{r eval=ev}
curl::curl_download(
"https://github.com/paleolimbot/geoarrow-data/releases/download/v0.0.1/nshn_water_line.gpkg",
"nshn_water_line.gpkg"
)
```
```{r}
library(sf)
```
```{r eval=ev}
(w <- read_sf("nshn_water_line.gpkg"))
```
From https://github.com/microsoft/USBuildingFootprints downloaded [Maine.geojson.zip](https://minedbuildings.z5.web.core.windows.net/legacy/usbuildings-v2/Maine.geojson.zip), and read with
```{r eval=ev}
(m = read_sf("/vsizip/Maine.geojson.zip")) # /vsizip: indicates data source is a zipped file
```
or read directly from github into R:
```{r eval=FALSE}
m = st_read("/vsizip/vsicurl/https://minedbuildings.z5.web.core.windows.net/legacy/usbuildings-v2/Maine.geojson.zip")
# /vsicurl: indicates data source is a URL
```
## "Simple" analysis on large datasets
- process full archives, compute *in* the cloud
- select subsets, download, process locally:
- spatial subset
- temporal subset
- sampled at lower resolution (spatially, temporally)
- aggregated (=processed?) to lower resolution
- in some disciplines (Earth Observation?) there seems to be a belief that processing at the *full* resolution is the only thing that produces real science
- there is surprisingly little literature on the loss of information when processing at lower resolution, e.g. when the goal is to create a curve of yearly deforestation over an area as large as Brazil
## Spatial statistics on large datasets
### Geostatistics
- [RandomForestsGLS](https://cran.r-project.org/web/packages/RandomForestsGLS/)
- [spNNGP](https://cran.r-project.org/web/packages/spNNGP/index.html)
- [FRK](https://cran.r-project.org/web/packages/FRK/index.html)
A key paper comparing different approaches is Heaton, Matthew J., Abhirup Datta, Andrew O. Finley, Reinhard Furrer, Joseph Guinness, Rajarshi Guhaniyogi, Florian Gerber, et al. 2018. “A Case Study Competition Among Methods for Analyzing Large Spatial Data.” Journal of Agricultural, Biological and Environmental Statistics, December. [DOI](https://doi.org/10.1007/s13253-018-00348-w).
## Exercises:
1. Discuss: Have you used datasets obtained from cloud storage? For which case(s)?
2. Discuss: Have you used cloud processing? For which case(s)
3. Run
```r
granule = system.file("sentinel/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.zip", package = "starsdata")
s2 = paste0("SENTINEL2_L1C:/vsizip/", granule, "/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.SAFE/MTD_MSIL1C.xml:10m:EPSG_32632")
```
and open `s2` using `stars::read_stars()`. Look at the summary; is the image data actually loaded in memory?
4. Write the stars object to a GeoTIFF file, using `write_stars()`. How long does it take, and how large is the resulting file?
5. Compute the difference between the first two bands of the Sentinel-2 image, and write the result to a GeoTiff file.
In which command(s) are the image values actually loaded into memory?
6. Only if time is left: try the [Exercises of SDS chapter 9](https://r-spatial.org/book/09-Large.html#exercises)