Large datasets
When is a dataset large?
- Who of you works with (analyzes) datasets that do not fit on your local hard drive?
- How large are they?
- Does anyone of you do analysis that is distributed over more than one machine?
- Who of you thinks this will happen in the next five years?
Large Feature datasets
- e.g., openstreetmaps, the raw data
- PostGIS? What if it is larger than local drivers?
- Spark?
- GeoMesa?
- What if we have long, dense time series for a set of sensor stations? Would you store the station geometry with each observation?
Large Raster datasets
- global digital elevation models
- CMIP 4, 5, 6 (20-40 Pb)
- remote sensing: Landsat, sentinel, …
- the “set of layers” (
RasterStack
) model does not really work fo this
- how can we analyse these datasets without downloading them?
Current limitations
- How does a petabyte look like?
- Network capacity: although we think that networks are fast, at some stage (and pretty soon), shipping hard drives is faster than down/uploading.
- Handling imagery data may involve
- storing tiles (scenes),
- indexing them,
- mozaicing them? (replacing the original files?)
- harmonizing tiles from different sensors (e.g., MODIS, L8, S2)
- running atmospheric correction models?
- cloud removal? etc.
- these actions are pretty much the same for everyone, however you would like to know how this works, and control details of it
- put (or keep) data in files, or put them in a database?
- if database, how to backup? leave original files in place?
Google Earth Engine
It is good that:
- a graduate student can work with massive EO imagery after 1 day of training
- we can apply a lot of methods, from computing indexes, time series analysis, to machine learning
- we can work with the data as if it were a data cube:
- compute on the grid cells you see, rather than the native imagery resolution
- pick a temporal resolution
- regridding, downsampling, spatial and temporal, happens on the fly
Not trivial for
- working with datasets not in GEE (but integrates with google bigtable)
What you cannot do:
- run arbitrary, custom code on the imagery (python, R, javascript); restricted to GEE api
- see the source code of the GEE
- ensure reproducibility (but this is always relative!)
- have a guarantee that you get the capacity you would like to have (no SLA)
- fit arbitrary (R, python) models and use them to predict, in the cloud
Alternatives:
- TEPs (thematic exploration platforms): by theme, independently
- The upcoming Copernicus Data and Information Access Services (DIAS)
Cloud challenges
- Object storage (e.g. S3, GCS, document stores/object storage) vs. dedicated formats (HDF5, netcdf, sqlite): both compete for the same smartness, expect essential clashes (Cloud-optimized GeoTIFF, NetCDF seems to move now)
- How to distribute your computation over nodes (who has done this?); should users do this?
- How to make sure your data are “close” enough to your compute nodes (same data center? SDD of your node or the object storage?)
- these are challenges that the average (majority) of data scientists would like to not have to deal with
“openEO develops an open API to connect R, python and javascript clients to big Earth observation cloud back-ends in a simple and unified way.”
What is an API? A contract, a language
In some sense, this is similar to
- providing packages that provide APIs for spatial data in R, or
- what dplyr/dbplyr does with scaling up computing on tables
Large challenges of openEO: allow user-defined functions (UDFs) in R, python or JavaScript to be carried out by the backend.
openEO does NOT prescribe how back-ends should store their data, or organize their computations.
openEO challenges
Approach
- develop an API to access working systems, rather than try to compose one from existing standards (WPS, WCS, WCPS, CSW, OAuth2, …)
Analyse:
- build a process graph which is essentially a nested expression or function call;
- evaluate it lazily (i.e., only when asked for: show pixels on the screen, provide download link)
- ability to combine different openEO back-ends (show results, use one as input to the other) (poster)
- irrespective whether the data is stored as a data cube, provide a data cube view to the user
Use:
- user management
- where can users put data, and retrieve data from
- cost management
- cost estimation
Discover/describe data:
- how are image collections described (e.g. GEE: S2, L8), how are their bands described?
- how can results be published, so they can be reused downstream?