```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE, collapse = TRUE, error = TRUE) ``` ## Overview - Motivation - What is BIPM, GUM, VIM - review of quantities, units, and values (according to VIM) - software - spatial data science --- --- General frustrations I have: - we can describe (simple) features geometries pretty well, and we can describe feature attributes pretty well, but not how the two relate (exceptions maybe: coverage, and CF conventions) General fears I have: - data science and citizen data science imply that everyone now tries to do anything, without spatial experts involved, and with varying motivations - data scientists like to think longitude and latitude are just two more variables --- Lack of unit checking in practice: ```{r,echo=TRUE} (apples = c(5,8,12,3)) (oranges = c(4,2,8,11)) apples + oranges # meaningless? (speed1 = 55) # mile/hr (speed2 = 34.5) # km/hr speed1 + speed2 # wrong: with(mtcars[1:3,], mpg + cyl) # wrong and meaningless: ``` ## BIPM, VIM, GUM - BIPM: _Bureau Internationale des Poids et Mesures_ - manages _SI_, the International System of Units (briefly: what is a meter, what is a kg, and so on) The _Joint Committee for Guides in Metrology_ (JCGM) has responsibility for the following two publications: - Guide to the Expression of Uncertainty in Measurement (known as the **GUM**); and - International Vocabulary of Metrology – Basic and General Concepts and Associated Terms (known as the **VIM**). (The following 7 slides are copied from the VIM) ## What is a quantity? - **quantity**: property of a phenomenon, body, or substance, where the property has a magnitude that can be expressed as a number and a reference - NOTE2: A reference can be a measurement unit, a measurement procedure, a reference material, or a combination of such. - **system of quantities**: set of quantities together with a set of noncontradictory equations relating those quantities - **base quantity** quantity in a conventionally chosen subset of a given system of quantities, where no subset quantity can be expressed in terms of the others ---- | Base quantity | Symbol | SI base unit | Symbol | | -----------|--------|----------|------| | length | $l,x,r,$ etc.| meter | m | | mass | $m$ | kilogram | kg | | time, duration | $t$ | second | s | | electric current| $I, i$ | ampere | A | | thermodynamic temperature | $T$ | kelvin | K | | amount of substance | $n$ | mole | mol | | luminous intensity | $I_v$ | candela | cd | ## quantities - **derived quantity**: quantity, in a system of quantities, defined in terms of the base quantities of that system - **(quantity) dimension**: expression of the dependence of a quantity on the base quantities of a system of quantities as a product of powers of factors corresponding to the base quantities, omitting any numerical factor - **quantity of dimension one** (dimensionless quantity): quantity for which all the exponents of the factors corresponding to the base quantities in its quantity dimension are zero ## units - **measurement unit**: real scalar quantity, defined and adopted by convention, with which any other quantity of the same kind can be compared to express the ratio of the two quantities as a number - **base unit**: measurement unit that is adopted by convention for a base quantity (e.g., m, kg) NOTE 3: For number of entities, the number one, symbol 1, can be regarded as a base unit in any system of units. ## values - **quantity value** (or value): number and reference together expressing magnitude of a quantity, e.g. $15 ~ m^2$ which is short for $15 \times 1 m^2$. ## measurement, measurement error - **measured quantity value** (measured value): quantity value representing a measurement result - **random measurement error** component of measurement error that in replicate measurements varies in an unpredictable manner - ... and so on ## computing with units the dimension of a quantity Q is denoted by $$ \mbox{dim}~ Q = L ^\alpha M^\beta T^\gamma I^\delta Θ^\epsilon N^\zeta J^\eta $$ where the exponents $\alpha,...,\eta$, named dimensional exponents, are positive, negative, or zero. - two _values_ can be compared if and only if their dimensional exponents are identical - for adding/subtraction, units may need to be converted (e.g., km/h to m/s) - the dimension of a product (ratio) of two values is the sum (difference) of their exponents ## software for units shall - contain a database of units, derived units, prefixes and so on - provide functions to verify two values are compatible - provide functions to convert values, if they are compatible - help users avoid making mistakes, related to units and dimensions Examples: - udunits (UNIDATA): C, actively supported - Unified Code for Units of Measure (UCUM), preferred by OGC (but why?) - R package `units` (which uses udunits) - python, C++ (boost), Julia, many more... ## Examples ```{r echo=TRUE} suppressPackageStartupMessages(library(units)) (a = set_units(1, m/s)) (b = set_units(1, km/h)) a + b b + a a * b (c = set_units(10, kg)) a + c # You can't be serious... ``` ## However, ```{r echo = TRUE} a = set_units(15, g/g) (b = set_units(33)) # unitless a + b c = set_units(12, m/m) # or rad a + c ``` ---- [David Flater](https://doi.org/10.1016/j.csi.2017.10.002) proposes to extend unitless units to keep track which units were cancelled out, or of what it is a count, to catch such cases: --- ## and we can do this ```{r echo=TRUE} install_symbolic_unit("apples") install_symbolic_unit("oranges") install_conversion_constant("apples", "oranges", 1.5) set_units(5, oranges) + set_units(5, apples) ``` (but it is not trivial to get right). --- ```{r echo=TRUE} library(sf) demo(nc, echo = FALSE, ask = FALSE) nc[1:2,] %>% st_transform(2264) %>% st_area # NC state plane, us_ft nc[1:2,] %>% st_transform(2264) %>% st_area %>% set_units(m^2) st_area(nc[1:2,]) # ellipsoidal surface, NAD27 ``` --- ```{r echo=TRUE} nc <- nc %>% st_transform(2264) g = st_make_grid(nc, n = c(20,10)) plot(st_geometry(nc), border = "#ff5555", lwd = 2) plot(g, add = TRUE, border = "#0000bb") ``` --- ```{r echo=TRUE} st_agr(nc) = c("BIR74" = "constant") a1 = st_interpolate_aw(nc["BIR74"], g, extensive = FALSE) sum(a1$BIR74) / sum(nc$BIR74) # not close to one: spatially intensive a2 = st_interpolate_aw(nc["BIR74"], g, extensive = TRUE) sum(a2$BIR74) / sum(nc$BIR74) ``` ---- ```{r echo=FALSE} a1$BIR74_int = a1$BIR74 a1$BIR74_ext = a2$BIR74 suppressPackageStartupMessages(library(tidyverse)) a <- a1 %>% select(BIR74_int, BIR74_ext) %>% gather(VAR, BIR74, -geometry) ggplot() + geom_sf(data = a, aes(fill = BIR74)) + facet_wrap(~VAR, ncol = 1) + scale_fill_gradientn(colors = sf.colors(20)) + theme(panel.grid.major = element_line(color = "white")) ``` ## Can measurement units help discriminate intensive from extensive? This is only relevant when distributing, so for 1- or 2-dimensional, flat geometries. Speculating: - always intensive: temperature, color, density, ... - always extensive: volume, mass, energy, heat capacity, ... - dimensionless, but not a ratio $\Rightarrow$ count: extensive - anything extensive divided by area: intensive ## However: - length: total lenght of roads in a polygon: extensive - length: altitude: intensive Height is again extensive when measured along vertical geometries. --- ```{r echo=TRUE, collapse=FALSE} st_agr(nc) ``` `agr`: attribute-geometry-relationship: - _constant_: attribute is constant throughout geometry - _identity_: attribute identifies geometry (and hence is constant) - _aggregate_: attribute is an aggregation over the geometry > - function `st_aggregate` and the `sf` method of `summarise` set the `agr` field to _aggregate_ --- ```{r echo=TRUE, collapse=FALSE} st_agr(nc) pt = st_sfc(st_point(c(1260982, 994957)), crs = st_crs(nc)) x = st_intersection(nc["BIR79"], pt) y = st_intersection(nc["BIR74"], pt) # forged z = st_intersection(nc["NAME"], pt) ``` ## Concluding - units of measurement deserve (more) attention in data science - Flater's paper proposes a better way to deal with unitless units, which might be an opportunity to absorb (parts of) ontologies - the relationship between attribute and geometry is neglected in most spatial information systems (See also [Scheider et al.](https://dx.doi.org/10.1080/13658816.2016.1151520)) - for subsampling, resampling, downscaling, upscaling etc. one needs to know whether the variable in question is extensive or intensive - units of measure may help in this regard