UCSB Thinkspatial Brown Bag, February 6, 2018

Overview

  • Motivation
  • What is BIPM, GUM, VIM
  • review of quantities, units, and values (according to VIM)
  • software
  • spatial data science

General frustrations I have:

  • we can describe (simple) features geometries pretty well, and we can describe feature attributes pretty well, but not how the two relate (exceptions maybe: coverage, and CF conventions)

General fears I have:

  • data science and citizen data science imply that everyone now tries to do anything, without spatial experts involved, and with varying motivations

  • data scientists like to think longitude and latitude are just two more variables

Lack of unit checking in practice:

(apples = c(5,8,12,3))
## [1]  5  8 12  3
(oranges = c(4,2,8,11))
## [1]  4  2  8 11
apples + oranges # meaningless?
## [1]  9 10 20 14
(speed1 = 55) # mile/hr
## [1] 55
(speed2 = 34.5) # km/hr
## [1] 34.5
speed1 + speed2 # wrong:
## [1] 89.5
with(mtcars[1:3,], mpg + cyl) # wrong and meaningless:
## [1] 27.0 27.0 26.8

BIPM, VIM, GUM

  • BIPM: Bureau Internationale des Poids et Mesures
  • manages SI, the International System of Units (briefly: what is a meter, what is a kg, and so on)

The Joint Committee for Guides in Metrology (JCGM) has responsibility for the following two publications:

  • Guide to the Expression of Uncertainty in Measurement (known as the GUM); and
  • International Vocabulary of Metrology – Basic and General Concepts and Associated Terms (known as the VIM).

(The following 7 slides are copied from the VIM)

What is a quantity?

  • quantity: property of a phenomenon, body, or substance, where the property has a magnitude that can be expressed as a number and a reference

  • NOTE2: A reference can be a measurement unit, a measurement procedure, a reference material, or a combination of such.

  • system of quantities: set of quantities together with a set of noncontradictory equations relating those quantities

  • base quantity quantity in a conventionally chosen subset of a given system of quantities, where no subset quantity can be expressed in terms of the others

Base quantity Symbol SI base unit Symbol
length \(l,x,r,\) etc. meter m
mass \(m\) kilogram kg
time, duration \(t\) second s
electric current \(I, i\) ampere A
thermodynamic temperature \(T\) kelvin K
amount of substance \(n\) mole mol
luminous intensity \(I_v\) candela cd

quantities

  • derived quantity: quantity, in a system of quantities, defined in terms of the base quantities of that system

  • (quantity) dimension: expression of the dependence of a quantity on the base quantities of a system of quantities as a product of powers of factors corresponding to the base quantities, omitting any numerical factor

  • quantity of dimension one (dimensionless quantity): quantity for which all the exponents of the factors corresponding to the base quantities in its quantity dimension are zero

units

  • measurement unit: real scalar quantity, defined and adopted by convention, with which any other quantity of the same kind can be compared to express the ratio of the two quantities as a number

  • base unit: measurement unit that is adopted by convention for a base quantity (e.g., m, kg)

NOTE 3: For number of entities, the number one, symbol 1, can be regarded as a base unit in any system of units.

values

  • quantity value (or value): number and reference together expressing magnitude of a quantity, e.g. \(15 ~ m^2\) which is short for \(15 \times 1 m^2\).

measurement, measurement error

  • measured quantity value (measured value): quantity value representing a measurement result

  • random measurement error component of measurement error that in replicate measurements varies in an unpredictable manner

  • … and so on

computing with units

the dimension of a quantity Q is denoted by

\[ \mbox{dim}~ Q = L ^\alpha M^\beta T^\gamma I^\delta Θ^\epsilon N^\zeta J^\eta \]

where the exponents \(\alpha,...,\eta\), named dimensional exponents, are positive, negative, or zero.

  • two values can be compared if and only if their dimensional exponents are identical
  • for adding/subtraction, units may need to be converted (e.g., km/h to m/s)
  • the dimension of a product (ratio) of two values is the sum (difference) of their exponents

software for units

shall

  • contain a database of units, derived units, prefixes and so on
  • provide functions to verify two values are compatible
  • provide functions to convert values, if they are compatible
  • help users avoid making mistakes, related to units and dimensions

Examples:

  • udunits (UNIDATA): C, actively supported
  • Unified Code for Units of Measure (UCUM), preferred by OGC (but why?)
  • R package units (which uses udunits)
  • python, C++ (boost), Julia, many more…

Examples

suppressPackageStartupMessages(library(units))
(a = set_units(1, m/s)) 
## 1 m/s
(b = set_units(1, km/h))
## 1 km/h
a + b
## 1.277778 m/s
b + a
## 4.6 km/h
a * b
## 1 km*m/h/s
(c = set_units(10, kg))
## 10 kg
a + c # You can't be serious...
## Error: cannot convert kg into m/s

However,

a = set_units(15, g/g)
(b = set_units(33)) # unitless
## 33 1
a + b
## 48 1
c = set_units(12, m/m) # or rad
a + c
## 27 1

David Flater proposes to extend unitless units to keep track which units were cancelled out, or of what it is a count, to catch such cases:

and we can do this

install_symbolic_unit("apples")
install_symbolic_unit("oranges")
install_conversion_constant("apples", "oranges", 1.5)
set_units(5, oranges) + set_units(5, apples)
## 12.5 oranges

(but it is not trivial to get right).

library(sf)
## Linking to GEOS 3.5.1, GDAL 2.2.1, proj.4 4.9.3
demo(nc, echo = FALSE, ask = FALSE)
## Reading layer `nc.gpkg' from data source `/home/edzer/R/x86_64-pc-linux-gnu-library/3.4/sf/gpkg/nc.gpkg' using driver `GPKG'
## Simple feature collection with 100 features and 14 fields
## Attribute-geometry relationship: 0 constant, 8 aggregate, 6 identity
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
nc[1:2,] %>% st_transform(2264) %>% st_area # NC state plane, us_ft
## Units: US_survey_foot^2
## [1] 12244955726  6578960164
nc[1:2,] %>% st_transform(2264) %>% st_area %>% set_units(m^2)
## Units: m^2
## [1] 1137598162  611207844
st_area(nc[1:2,]) # ellipsoidal surface, NAD27
## Units: m^2
## [1] 1137388604  611077263

nc <- nc %>% st_transform(2264)
g = st_make_grid(nc, n = c(20,10))
plot(st_geometry(nc), border = "#ff5555", lwd = 2)
plot(g, add = TRUE, border = "#0000bb")

st_agr(nc) = c("BIR74" = "constant")
a1 = st_interpolate_aw(nc["BIR74"], g, extensive = FALSE)
sum(a1$BIR74) / sum(nc$BIR74) # not close to one: spatially intensive
## [1] 1.191945
a2 = st_interpolate_aw(nc["BIR74"], g, extensive = TRUE)
sum(a2$BIR74) / sum(nc$BIR74)
## [1] 1

Can measurement units help discriminate intensive from extensive?

This is only relevant when distributing, so for 1- or 2-dimensional, flat geometries.

Speculating:

  • always intensive: temperature, color, density, …
  • always extensive: volume, mass, energy, heat capacity, …
  • dimensionless, but not a ratio \(\Rightarrow\) count: extensive
  • anything extensive divided by area: intensive

However:

  • length: total lenght of roads in a polygon: extensive
  • length: altitude: intensive

Height is again extensive when measured along vertical geometries.

st_agr(nc)
##      AREA PERIMETER     CNTY_   CNTY_ID      NAME      FIPS    FIPSNO 
## aggregate aggregate  identity  identity  identity  identity  identity 
##  CRESS_ID     BIR74     SID74   NWBIR74     BIR79     SID79   NWBIR79 
##  identity  constant aggregate aggregate aggregate aggregate aggregate 
## Levels: constant aggregate identity

agr: attribute-geometry-relationship:

  • constant: attribute is constant throughout geometry
  • identity: attribute identifies geometry (and hence is constant)
  • aggregate: attribute is an aggregation over the geometry
  • function st_aggregate and the sf method of summarise set the agr field to aggregate

st_agr(nc)
##      AREA PERIMETER     CNTY_   CNTY_ID      NAME      FIPS    FIPSNO 
## aggregate aggregate  identity  identity  identity  identity  identity 
##  CRESS_ID     BIR74     SID74   NWBIR74     BIR79     SID79   NWBIR79 
##  identity  constant aggregate aggregate aggregate aggregate aggregate 
## Levels: constant aggregate identity
pt = st_sfc(st_point(c(1260982, 994957)), crs = st_crs(nc))
x = st_intersection(nc["BIR79"], pt)
## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries
y = st_intersection(nc["BIR74"], pt) # forged
z = st_intersection(nc["NAME"], pt)

Concluding

  • units of measurement deserve (more) attention in data science
  • Flater's paper proposes a better way to deal with unitless units, which might be an opportunity to absorb (parts of) ontologies
  • the relationship between attribute and geometry is neglected in most spatial information systems (See also Scheider et al.)
  • for subsampling, resampling, downscaling, upscaling etc. one needs to know whether the variable in question is extensive or intensive
  • units of measure may help in this regard