5  Big spatial datasets

Learning goals

Understand

  • how big big datasets are, or can be
  • the difference between big vector and raster datasets, and datacubes
  • how to find data in complex programs
  • how to work with big data: download, or compute in the cloud near the data?

Reading materials

From Spatial Data Science: with applications in R:

  • Chapter 9: Large data and cloud native

stars vignettes 2: proxy objects

5.1 Exercises for Today

  • Exercises of Ch 9: Big Data and Cloud Native
Summary
  • What is big?
  • Raster or vector?
  • How to access large data sets?
  • Spatial statistics on large datasets

5.2 Resource constraints in data science projects

Constraints concern the availability of:

  • time (your time, time of team members)
  • compute (pc’s, cluster, private cloud)
  • money (e.g. to hire and/or (re)train people, or to rent public cloud infrastructure)

Public clouds provide:

  • infinite (in practice) compute
  • infinite (in practice) storage

but cost

  • hard money to use (compute, storage, network/data access)
  • people capacity to setup and maintain

There is no cloud!

it’s just someone else’s computer!

  • which is true: the computers have a different shape, but are just like your laptop:
    • they have a CPU, main memory, hard drive, possibly a GPU
    • quite often you will find yourself on a virtual machine, which acts as a normal computer
    • but see below: they have object storage!

5.3 What is a big dataset?

  • What is big?
    • too big to handle in main memory (with some copying) (Gb)
    • too big to fit in memory (20 Gb)
    • too big to download (Tb)
    • too big to fit on the hard drive, or local file storage (10 Tb)
    • too big to move (copy) to your institution (100 Tb - Pb)
Breakout session 1

Discuss:

  • Have you used datasets obtained from cloud storage? For which case(s)?
  • Have you used cloud processing? For which case(s)

5.4 R for big, tabular datasets

  • In-memory solutions: data.table, duckdb, polars improve speed (use indexes)
  • Out-of-memory solution: DBI or tidyverse via dbplyr, connect to
    • a local, on-disc database like MariaDB, PostgreSQL, or MySQL
    • cloud-based databases like Google BigQuery, Snowflake,

5.5 Big geospatial

Clouds and object storage

Object storage abstracts away hard drives and file systems!

  • e.g. S3 bucket (AWS/OpenStack):
    • total size is unlimited
    • maximum object size 5 Tb (AWS S3)
    • idea: write once, read many times
    • large objects: write piece-wise
    • http range requests
    • price depends on size, access speed, amount of requests
    • tabular data: Parquet
  • large data processing: collocate processing and storage
    • avoid network between locations / data centers
    • network inside a data center is fast / cheap

5.6 Access mechanism

Cloud-optimized, cloud-native geoospatial
  • Cloud-optimized formats let you read sections of large, remote files using HTTP range requests
  • examples: Cloud-optimized GeoTIFF (COG), GeoZarr, GeoParquet
  • These are described in the Cloud-Optimized Geospatial Formats Guide

5.7 Examples openEO

Two video’s from me taken during the 2023 OpenGeoHub Summerschool, on the topic “Cloud-based analysis of Earth Observation data using openEO Platform, R and Python” can be found here:

5.8 Example rstac

Using Sentinel-2 COGs at AWS, and its stac:

library(rstac) # modified from the package docs:
s_obj = stac("https://earth-search.aws.element84.com/v1")
collections(s_obj) |> get_request()
# ###Collections
# - collections (8 item(s)):
#   - cop-dem-glo-30
#   - naip
#   - sentinel-2-l2a
#   - sentinel-2-l1c
#   - cop-dem-glo-90
#   - landsat-c2-l2
#   - sentinel-1-grd
#   - sentinel-2-c1-l2a
# - field(s): collections, links, context
it_obj <- s_obj |>
  stac_search(collections = "sentinel-2-l2a",
              bbox = c(-47.02148, -17.35063, -42.53906, -12.98314),
              datetime = "2022-02-12T00:00:00Z/2022-03-18T00:00:00Z",
              limit = 1) |> 
  get_request()
it_obj
# ###Items
# - matched feature(s): 361
# - features (1 item(s) / 360 not fetched):
#   - S2A_23KMA_20220317_0_L2A
# - assets: 
# aot, aot-jp2, blue, blue-jp2, coastal, coastal-jp2, granule_metadata, green, green-jp2, nir, nir-jp2, nir08, nir08-jp2, nir09, nir09-jp2, red, red-jp2, rededge1, rededge1-jp2, rededge2, rededge2-jp2, rededge3, rededge3-jp2, scl, scl-jp2, swir16, swir16-jp2, swir22, swir22-jp2, thumbnail, tileinfo_metadata, visual, visual-jp2, wvp, wvp-jp2
# - item's fields: 
# assets, bbox, collection, geometry, id, links, properties, stac_extensions, stac_version, type

then, download (here only one item):

download_items <- it_obj |>
  assets_download(assets_name = "thumbnail", items_max = 1, overwrite = TRUE)

and examine

library(sf)
# Linking to GEOS 3.12.1, GDAL 3.8.3, PROJ 9.3.1; sf_use_s2() is TRUE
tif = "sentinel-s2-l2a-cogs/23/K/MA/2022/3/S2A_23KMA_20220317_0_L2A/B04.tif"
gdal_utils("info", tif)
# Driver: GTiff/GeoTIFF
# Files: sentinel-s2-l2a-cogs/23/K/MA/2022/3/S2A_23KMA_20220317_0_L2A/B04.tif
# Size is 10980, 10980
# Coordinate System is:
# PROJCRS["WGS 84 / UTM zone 23S",
#     BASEGEOGCRS["WGS 84",
#         DATUM["World Geodetic System 1984",
#             ELLIPSOID["WGS 84",6378137,298.257223563,
#                 LENGTHUNIT["metre",1]]],
#         PRIMEM["Greenwich",0,
#             ANGLEUNIT["degree",0.0174532925199433]],
#         ID["EPSG",4326]],
#     CONVERSION["UTM zone 23S",
#         METHOD["Transverse Mercator",
#             ID["EPSG",9807]],
#         PARAMETER["Latitude of natural origin",0,
#             ANGLEUNIT["degree",0.0174532925199433],
#             ID["EPSG",8801]],
#         PARAMETER["Longitude of natural origin",-45,
#             ANGLEUNIT["degree",0.0174532925199433],
#             ID["EPSG",8802]],
#         PARAMETER["Scale factor at natural origin",0.9996,
#             SCALEUNIT["unity",1],
#             ID["EPSG",8805]],
#         PARAMETER["False easting",500000,
#             LENGTHUNIT["metre",1],
#             ID["EPSG",8806]],
#         PARAMETER["False northing",10000000,
#             LENGTHUNIT["metre",1],
#             ID["EPSG",8807]]],
#     CS[Cartesian,2],
#         AXIS["(E)",east,
#             ORDER[1],
#             LENGTHUNIT["metre",1]],
#         AXIS["(N)",north,
#             ORDER[2],
#             LENGTHUNIT["metre",1]],
#     USAGE[
#         SCOPE["Navigation and medium accuracy spatial referencing."],
#         AREA["Between 48°W and 42°W, southern hemisphere between 80°S and equator, onshore and offshore. Brazil."],
#         BBOX[-80,-48,0,-42]],
#     ID["EPSG",32723]]
# Data axis to CRS axis mapping: 1,2
# Origin = (399960.000000000000000,8100040.000000000000000)
# Pixel Size = (10.000000000000000,-10.000000000000000)
# Metadata:
#   AREA_OR_POINT=Area
#   OVR_RESAMPLING_ALG=AVERAGE
# Image Structure Metadata:
#   COMPRESSION=DEFLATE
#   INTERLEAVE=BAND
#   PREDICTOR=2
# Corner Coordinates:
# Upper Left  (  399960.000, 8100040.000) ( 45d56'26.60"W, 17d10'56.12"S)
# Lower Left  (  399960.000, 7990240.000) ( 45d56'45.24"W, 18d10'28.55"S)
# Upper Right (  509760.000, 8100040.000) ( 44d54'29.58"W, 17d11' 3.94"S)
# Lower Right (  509760.000, 7990240.000) ( 44d54'27.77"W, 18d10'36.85"S)
# Center      (  454860.000, 8045140.000) ( 45d25'32.30"W, 17d40'48.86"S)
# Band 1 Block=1024x1024 Type=UInt16, ColorInterp=Gray
#   NoData Value=0
#   Overviews: 5490x5490, 2745x2745, 1373x1373, 687x687
library(stars)
# Loading required package: abind
read_stars(tif) |> plot()
# downsample set to 8

Breakout session 2

Discuss:

  • Have you used any cloud platforms for processing geospatial data?
  • What is your position with respect to platform lock-in?

5.9 Further examples from r-spatial.org:

5.10 Examples /vsixxx

curl::curl_download(
  "https://github.com/paleolimbot/geoarrow-data/releases/download/v0.0.1/nshn_water_line.gpkg",
  "nshn_water_line.gpkg"
)
(w <- read_sf("nshn_water_line.gpkg"))

From https://github.com/microsoft/USBuildingFootprints downloaded Maine.geojson.zip, and read with

m = st_read("/vsizip/Maine.geojson.zip") # /vsizip: indicates data source is a zipped file

or read directly from github into R:

m = st_read("/vsizip/vsicurl/https://usbuildingdata.blob.core.windows.net/usbuildings-v2/Maine.geojson.zip")
# /vsicurl: indicates data source is a URL

5.11 “Simple” analysis on large datasets

  • process full archives, compute in the cloud
  • select subsets, download, process locally:
    • spatial subset
    • temporal subset
    • sampled at lower resolution (spatially, temporally)
    • aggregated (=processed?) to lower resolution
  • in some disciplines (Earth Observation?) there seems to be a belief that processing at the full resolution is the only thing that produces real science
  • there is surprisingly little literature on the loss of information when processing at lower resolution, e.g. when the goal is to create a curve of yearly deforestation over an area as large as Brazil

5.12 Spatial statistics on large datasets

Geostatistics

A key paper comparing different approaches is Heaton, Matthew J., Abhirup Datta, Andrew O. Finley, Reinhard Furrer, Joseph Guinness, Rajarshi Guhaniyogi, Florian Gerber, et al. 2018. “A Case Study Competition Among Methods for Analyzing Large Spatial Data.” Journal of Agricultural, Biological and Environmental Statistics, December. DOI.

5.13 If time is left