records are in rows and correspond to observations
variables are in columns, and are of homogeneous type
type can be integer, floating point, Date, text, …
variables have names, records usually not
Spatial data in tables is usually stored by adding a special column, the geometry column. Geometry is often stored as simple feature geometry (POINT, LINESTRING, MULTILINESTRING, POLYGON, MULTIPOLYGON etc), in which case records correspond to simple features (things with geometry and further attributes in non-geometry columns).
Columnar storage means that values in a column are stored continuously in memory. Instead of storing record 2 directly after record 1, the value 2 of var_a is stored directly after 1 and before 3. Using R, we can fake a row-wise storage, by making every record a list:
The first function extracts row elements in a for loop, the second uses sapply instead of that loop, the third works directly on the data.frame column.
We see that columnar storage (like a data.frame has) is much faster if the operation is columnar. A lot of statistical operations are columnar!
Other implementations
DuckDB
Google Big Query
SAP HANA
Why not use a regular database, like PostGIS?
Because:
very large databases are hard to manage (dump, upgrade software, restore)
databases require software to be running
running cloud instances is more expensive that storage
administrative burden to set it up scalable (storage, parallel)
Google BigQuery: stateless DB, “serverless data warehouse”
store data in big files (e.g. collections of csv or Parquet)
run query involves: read data, process, write output, quit
costs: storage + runtime of the query, not a running service
optimizing costs:
how can storage be reduced?
how can read & write be done faster?
More efficient than .csv:
binary
indexed
columnar storage model
analyse by variable, rather than by record
Parquet:
binary, columnar storage
indexed, compressed, chunked
meant for storage (write once, read often)
Arrow:
binary, columnar,
focus on memory layout,
meant for performance, not (primarily) for storage
backed by Voltrondata (VC)
Both Arrow and Parquet are Apache projects (open source, cloud oriented) Both have