Tabular data, columnar storage

Author

Edzer Pebesma

Published

August 28, 2022

Tables

Tabular data are data in tables, such as those in

  • CSV files
  • relational databases
  • (simple) Excel spreadsheets
  • R: data.frame
  • Python: panda

Tables have the following properties:

  • row/column layout:
    • records are in rows and correspond to observations
    • variables are in columns, and are of homogeneous type
    • type can be integer, floating point, Date, text, …
    • variables have names, records usually not

Spatial data in tables is usually stored by adding a special column, the geometry column. Geometry is often stored as simple feature geometry (POINT, LINESTRING, MULTILINESTRING, POLYGON, MULTIPOLYGON etc), in which case records correspond to simple features (things with geometry and further attributes in non-geometry columns).

Consider the following .csv file:

"var_a","date","name"
1,2022-08-29,"Alice"
2,2022-08-30,"Bob"
3,2022-08-31,"Charlie"
df = read.csv("example.csv", colClasses = c("numeric", "Date", "character"))
df
  var_a       date    name
1     1 2022-08-30   Alice
2     2 2022-08-31     Bob
3     3 2022-09-01 Charlie
df |> str()
'data.frame':   3 obs. of  3 variables:
 $ var_a: num  1 2 3
 $ date : Date, format: "2022-08-30" "2022-08-31" ...
 $ name : chr  "Alice" "Bob" "Charlie"
df |> unclass() |> class()
[1] "list"
df |> unclass()
$var_a
[1] 1 2 3

$date
[1] "2022-08-30" "2022-08-31" "2022-09-01"

$name
[1] "Alice"   "Bob"     "Charlie"

attr(,"row.names")
[1] 1 2 3

Row vs. Columnar storage

Columnar storage means that values in a column are stored continuously in memory. Instead of storing record 2 directly after record 1, the value 2 of var_a is stored directly after 1 and before 3. Using R, we can fake a row-wise storage, by making every record a list:

row_ify = function(x) {
  lapply(seq_len(nrow(x)), function(i) x[i,])
}
row_ify(df)
[[1]]
  var_a       date  name
1     1 2022-08-30 Alice

[[2]]
  var_a       date name
2     2 2022-08-31  Bob

[[3]]
  var_a       date    name
3     3 2022-09-01 Charlie
n = 100000
df = data.frame(x = runif(n), char = "foo", date = Sys.Date())
head(df)
          x char       date
1 0.6181779  foo 2022-08-29
2 0.3484395  foo 2022-08-29
3 0.9613860  foo 2022-08-29
4 0.8451987  foo 2022-08-29
5 0.8439952  foo 2022-08-29
6 0.8793984  foo 2022-08-29
df.rows = row_ify(df)
# iterate over rows
f0 = function(x) {
  m0 = 0.0
  for (i in seq_along(x))
    m0 = m0 + df.rows[[i]][[1]]
  m0 / length(x)
}
bench::mark(
  f0(df.rows), 
  mean(sapply(df.rows, `[[`, 1)),
  mean(df$x)
)
Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# A tibble: 3 × 6
  expression                          min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f0(df.rows)                       499ms    499ms      2.00   24.95KB     20.0
2 mean(sapply(df.rows, `[[`, 1))    509ms    509ms      1.96    3.31MB     21.6
3 mean(df$x)                        147µs    151µs   6448.          0B      0  

The first function extracts row elements in a for loop, the second uses sapply instead of that loop, the third works directly on the data.frame column.

We see that columnar storage (like a data.frame has) is much faster if the operation is columnar. A lot of statistical operations are columnar!

Other implementations

  • DuckDB
  • Google Big Query
  • SAP HANA

Why not use a regular database, like PostGIS?

Because:

  • very large databases are hard to manage (dump, upgrade software, restore)
  • databases require software to be running
  • running cloud instances is more expensive that storage
  • administrative burden to set it up scalable (storage, parallel)

Google BigQuery: stateless DB, “serverless data warehouse”

  • store data in big files (e.g. collections of csv or Parquet)
  • run query involves: read data, process, write output, quit
  • costs: storage + runtime of the query, not a running service
  • optimizing costs:
    • how can storage be reduced?
    • how can read & write be done faster?

More efficient than .csv:

  • binary
  • indexed
  • columnar storage model
  • analyse by variable, rather than by record

Parquet:

  • binary, columnar storage
  • indexed, compressed, chunked
  • meant for storage (write once, read often)

Arrow:

  • binary, columnar,
  • focus on memory layout,
  • meant for performance, not (primarily) for storage
  • backed by Voltrondata (VC)

Both Arrow and Parquet are Apache projects (open source, cloud oriented) Both have

  • GEO extensions (geometry column, simple features+)
  • (bleeding edge) GDAL drivers (>= 3.5), + GDAL columnar read interface

How to leverage parallel computing, distributing over

  • cores
  • machines