The source R-Markdown (.Rmd) file for this html document is found here; here is the zip file with included figures.
A list of package is found at the end of this page. The packages this script attaches can be installed by running
pkg = c("sp", "xts", "zoo", "spacetime", "trajectories",
"maptools", "maps", "mapview", "leaflet", "rglwidget",
"rgl", "RColorBrewer", "ggmap", "ggplot2", "dplyr",
"rmarkdown", "units")
install.packages(pkg)
To re-create (reproduce) this html, using knit html
in Rstudio, have all packages installed, and then:
setwd("/path/to/this_dir")
tutorial.Rmd
file and click on the button knit html
Two quotes from Cobb and Moore, 1997,
Data are not just numbers, they are numbers with a context
and
in data analysis, context provides meaning
illustrate that data analysis without context is essentially meaningless. What is context? We can think of
1
refer to?We can think of the who, how and why question referring to pragmatics of data collection, where when, where and what refer to data semantics. Reference systems relate numbers (or words, or symbols) to the real world; much of this tutorial is about reference systems for space, time and attributes.
At a higher level, reference systems can describe phenomena in terms of how attributes and identity relate to space and time; in particular whether they are continuous or discrete matters for what we can meaningfully do with data obtained from these phenomena (Scheider et al., 2016).
Base R has some infrastructure to annotate measurement units, in particular for time and date information:
now = Sys.time() + c(0, 3600)
today = Sys.Date() + 0:1
which represent numeric values pointing to the number of seconds or days elapsed since Jan 1, 1970, 00:00 UTC:
as.numeric(now)
## [1] 1474359906 1474363506
as.numeric(today)
## [1] 17064 17065
The class
attribute of these objects
class(now)
## [1] "POSIXct" "POSIXt"
class(today)
## [1] "Date"
make it behave meaningfully like time:
now * now
## Error in Ops.POSIXt(now, now): '*' not defined for "POSIXt" objects
now + now
## Error in `+.POSIXt`(now, now): binary '+' is not defined for "POSIXt" objects
(dt = now[2] - now[1])
## Time difference of 1 hours
This last quantity, dt
has a class
class(dt)
## [1] "difftime"
but also a units
attribute, which can be retrieved by the units
method:
units(dt)
## [1] "hours"
but can also be set to other values, e.g.
units(dt) = "days"
dt
## Time difference of 0.04166667 days
units(dt) = "secs"
dt
## Time difference of 3600 secs
Beyond time and time differences, more general support for SI units and derived units is provided by CRAN package units, which builds upon CRAN package udunits2 and the external udunits library, from UNIDATA. It checks compatibility of units, does unit conversion on the fly
units_file = system.file("share/udunits2.xml", package="udunits2")
if (units_file == "") stop("install package udunits2 first")
Sys.setenv(UDUNITS2_XML_PATH = units_file) # cope with bug in udunits2 0.8.1 on Win & Mac
library(units)
m = with(ud_units, m)
s = with(ud_units, s)
km = with(ud_units, km)
h = with(ud_units, h)
x = 1:3 * m/s
xkmh = x
units(xkmh) = km/h # explicit conversion
xkmh
## Units: km/h
## [1] 3.6 7.2 10.8
(y = 1:3 * km/h) # something different
## Units: km/h
## [1] 1 2 3
x + y # implicit conversions
## Units: m/s
## [1] 1.277778 2.555556 3.833333
y + x
## Units: km/h
## [1] 4.6 9.2 13.8
c(x, y)
## Units: m/s
## [1] 1.0000000 2.0000000 3.0000000 0.2777778 0.5555556 0.8333333
it creates derived units where appropriate
x * y
## Units: km*m/h/s
## [1] 1 4 9
x / y
## Units: 1
## [1] 3.6 3.6 3.6
log(x)
## Units: ln(m/s)
## [1] 0.0000000 0.6931472 1.0986123
and also gives meaningful warnings when units are not compatible:
x + x * y
## Error in `units<-.units`(`*tmp*`, value = structure(list(numerator = "m", : cannot convert km*m/h/s into m/s
as udunits parses units, they can become arbitrarily complex:
u1 = make_unit(paste0(rep("m", 100), collapse = "*"))
u2 = m^100
1:3 * u2 + 1:3 * u1
## Units: m^100
## [1] 2 4 6
Summarizing, measurement units
and provide, in that sense, part of the context of data.
Measures of physical quantities such as length, mass, duration have a natural, absolute zero value. When measuring absolute time (when) and location (where), we need a reference.
For time, we have Coordinated Universal Time (UTC); different time zones and DST rules are defined with respect to UTC.
For space, different geodetic datums exist; the best known being the WGS84 ellipsoid (``world geodetic system-1984’’)
WGS84 is a global reference system, and has larger errors (deviations from the geoid, the `mean sea level’ surface) than ellipsoids that are fitted locally. For that reason, local ellipsoids might better fit the Earth locally, and are prefered for particular surveying projects. Coordinate transformations move from one geodetic datum (ellispoid) to another; these transformation may be non-unique, non-invertible, and approximate.
Projecting longitude/latitude degrees to a flat representation (such as UTM, or Web Mercator) is called coodinate conversion; this process is usually accurate and invertible.
For a doctor, the identity and age of the patient are more relevant than the location of the patient and date of the examination. An MRI scan of a patient’s brain is understood with reference to the skull and other reference points, not with reference to the geographical location f the MRI scanner.
Many data sets come without information about space and time; this usually indicates that these aspects were not considered relevant. For the co2
dataset,
plot(co2)
the location of measurements (Mauna Loa) can be derived from the URL mentioned in ?co2
, or even better, from the information the URL points to. For the mtcars
dataset, ?mtcars
reveals when the car models described were current (1973-4).
We have seen so far mostly time information, but not information that is associated with time, or time series data.
The co2
data set is an example of this, as a time series:
class(co2)
## [1] "ts"
summary(co2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 313.2 323.5 335.2 337.1 350.3 366.8
The ts
class, part of base R, accomodates for regular time series, and does not use POSIXt
for time:
attributes(co2)
## $tsp
## [1] 1959.000 1997.917 12.000
##
## $class
## [1] "ts"
Many of the methods for ts
do not deal well with semi-regular time series, which are regular but contain missing values.
Approaches to deal with irregular time series are found in packages its, zoo and xts. Package xts builds on POSIXt
as the time index, and extends zoo
. The combination has nice features for
What follows is an example where the string “2000:2007” selects all data for these three years, and the aggregate
method (from zoo
) uses the function as.yearmon
to cut a time period into months, convert these into years, and to compute yearly averages:
data(air, package = "spacetime")
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(xts)
pm10 = xts(t(air), dates)
as.year <- function(x) as.numeric(floor(as.yearmon(x)))
plot(aggregate(pm10["2000::2007", 1], as.year, mean, na.rm = TRUE),
ylab = "PM10 (ppm)")
title("Yearly averaged PM10 values, 2000-2007")
Physical properties can be extensive or intensive, depending on whether the property changes when the size (extent) of the system changes:
In the aggregation example above, we computed the yearly mean values, but when examining ?aggregate.zoo
, we’d find out that the default aggregation function is sum
, so
a = aggregate(pm10["2000::2007", 1], as.year, na.rm = TRUE)
mean(a)
## [1] 7176.552
computes yearly total pm10, from monthly averages, which is not physically meaningful for an intensive property like pm10.
Time series data, as represented by the packages mentioned above, do not distinguish events from continuous processes: ``values’’ of something are simply associated with a time stamp. Yet, for meaningful analysis, their difference could not be larger:
Many time series data (co2
and pm10
above being two examples) concern temporally aggregated values. Yet, their value is not explicitly registered with the aggregation period, but rather with the starting time of this interval. If, however, instantaneous measurements of the same variables would be available (co2 or pm10 measured at the start of the month), they would be stored identically. If time periods would be registered explicitly, we would in addition have to specify how the associated (co2, pm10) value relates to this interval, possibilities are:
A large majority of spatial data constitute, roughly, of points, lines, polygons or grids. As described in Bivand, Pebesma and Gomez-Rubio (2013), package sp provides basic infrastructure for this, and supports the following classes:
The geometry-only classes, SpatialXxx
, derive from Spatial
and share a bounding box and coordinate reference system (proj4string
); all the SpatialXxxDataFrame
classes extend this with a data.frame
carrying attributes associated with the geometries.
Important reasons why many think that using sp
is a good idea include:
sp
supports reading and writing to and from all 142 raster and 84 vector formats supported by GDALsp
objects can be used for all DE-9IM intersections, including topological predicates like touches, overlaps, intersects, contains, etc., and also create buffers, unite polygons etc.`Single’ geometry entries for polygons and lines can consist of multiple polygons or lines, to accomodate real-word data. In the following example we see a state consisting of multiple polygons, one of them containing a hole:
data(air, package = "spacetime")
library(sp)
nds = DE_NUTS1["Niedersachsen",]
library(ggmap)
## Loading required package: ggplot2
bgMap = get_map(as.vector(bbox(nds)), source = "google", zoom = 7)
## Warning: bounding box given to google - spatial extent only approximate.
## converting bounding box to center/zoom specification. (experimental)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=52.632009,9.116023&zoom=7&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
par(mar = rep(0,4))
merc = CRS("+init=epsg:3857")
plot(spTransform(nds, merc), bgMap = bgMap, col = grey(.5, alpha = .5))
We can examine the data interactively, thanks to the excellent mapview package:
library(mapview)
## Loading required package: leaflet
mapview(nds)
## Warning in spCheckObject(x): Columns NL_NAME_1, CC_1, REMARKS_1 in
## attribute table contain only NA values and are dropped.
When properly plotting data, we need to know which hole polygon belongs to which enclosing polygon, and this makes it difficult (if not impossible) to encode polygons in simple tables such as propagated by the tidy data framework. The fortify
method in ggplot2
destroys this structure, and the ggplot2 wiki explains why ggplot2
cannot properly plot polygons with holes.
In the plot above, we can see that the border of nds
does not correspond very well with that of openstreetmap. This may be due to an outdated version of DE_NUTS1
in the spacetime
package.
DEU_adm1
, and register in which directory they arelibrary(rgdal); x = readOGR(".", "DEU_adm1")
, where you replace .
with the right directorymapview(x[9,])
)As we’ve seen above, we can use the [
subset operator on Spatial
objects to select geometries/features (rows), or particular attributes, similar to how this is done for data.frame
objects. Further functionality is obtained when we use a Spatial
object as selector, as in
par(mar = c(4,4,4,1))
plot(DE_NUTS1[nds,], col = "#FFDDDD", axes = TRUE) # all states touching Lower Saxony;
plot(nds, col = grey(.5), add = TRUE) # .. which is grey.
this carries out an intersection, and returns intersecting geometries, for any Spatial object type.
We can compute yearly mean pm10 values by station, by
yearmeans = aggregate(pm10, as.year, mean, na.rm = TRUE)
# merge mean matrix (rows: year; cols: station ) with station data:
s = SpatialPointsDataFrame(stations, data.frame(t(yearmeans)))
spplot(s, names.attr = index(yearmeans), sp.layout = list(DE_NUTS1, col = grey(.8)),
as.table = TRUE, colorkey = TRUE)
Suppose that we want state mean pm10 concentrations by year; we can do this by spatial aggregation:
a = aggregate(s, DE_NUTS1, mean, na.rm = TRUE)
spplot(a, names.attr = index(yearmeans), as.table = TRUE,
main = "state mean pm10 of rural background stations")
Here, the by
argument (2nd arg) of aggregate
is a Spatial
object; spatial intersection yields the aggregation predicate.
Methods for statistically analyzing spatial data are well described in the spatial statistics literature. Other R packages that deal with handling or analyzing spatial data are categorized and described in the CRAN Task View on Spatial Data.
We’ve seen above how temporal data can be stored in Spatial
objects – by storing variables associated with different times as different attributes. This is not optimal, because
There is also a CRAN Task View on SpatialTemporal Data that categorizes and describes R packages for handling and analyzing spatiotemporal data, including movement data.
Building on xts
and sp
, R package spacetime provides classes for spatiotemporal data that solve both issues. The class diagram is given here:
STF
and STS
objects cater spatial panel data (or longitudinal data), where each spatial entity has an identical set of time referenced obvservations associated with it; examples are
Note that these classes only require a space time lattice layout:
Package spacetime
comes with routines to read data from space-wide tables (where columns reflect different spatial features), time-wide tables (where columns reflect different times), and long tables (where each space/time combination has a row entry); examples of all three are given in Pebesma, 2012. What follows is an example reading from a long table.
A long table with panel data is found in package plm:
data("Produc", package = "plm")
Produc[1:5,1:9]
## state year region pcap hwy water util pc gsp
## 1 ALABAMA 1970 6 15032.67 7325.80 1655.68 6051.20 35793.80 28418
## 2 ALABAMA 1971 6 15501.94 7525.94 1721.02 6254.98 37299.91 29375
## 3 ALABAMA 1972 6 15972.41 7765.42 1764.75 6442.23 38670.30 31303
## 4 ALABAMA 1973 6 16406.26 7907.66 1742.41 6756.19 40084.01 33430
## 5 ALABAMA 1974 6 16762.67 8025.52 1734.85 7002.29 42057.31 33749
Since the data contains no spatial information (states) or temporal information as POSIXct
, we need to construct these – states:
library(maps)
states.m = map('state', plot=FALSE, fill=TRUE)
IDs <- sapply(strsplit(states.m$names, ":"), function(x) x[1])
library(maptools)
## Checking rgeos availability: TRUE
states = map2SpatialPolygons(states.m, IDs=IDs)
years:
yrs = 1970:1986
time = as.POSIXct(paste(yrs, "-01-01", sep=""), tz = "GMT")
When combining all this information, we do not need to reorder states because states
and Produc
order states alphabetically. We need to de-select District of Columbia, which is not present in Produc
:
# deselect District of Columbia, polygon 8, which is not present in Produc:
library(spacetime)
Produc.st = STFDF(states[-8], time, Produc[order(Produc[2], Produc[1]),])
library(RColorBrewer)
stplot(Produc.st[,,"unemp"], yrs, col.regions = brewer.pal(9, "YlOrRd"), cuts = 9)
As can be seen from the CRAN Task View on SpatialTemporal Data, there are quite a few R packages dealing with movement data, many of which originate in the ecology domain.
The trajectories tries to take a more general approach, but uses many of the ideas e.g. from adehabitatLT. In particular, it acknowleges that, when we observe movement, we often end up having
Track
, burst, trip),Tracks
)Tracks
are collected (TracksCollection
)Think of registering a person’s movement with a mobile phone or GPS receiver:
For this reason, trajectories
offers Track
, Tracks
and TracksCollection
objects to organize such data:
Examples of open trajectory data of various kinds are
Except for the Argo buoys, trajectories
contains demo scripts for all these examples.
trajectories
has, using command demo
(and look into its help page)Methods that Track
, Tracks
or TracksCollection
share with the spatial family include
[[
, $
, [
retrieve or replace attributes, or select Track
or Tracks
stplot
create space-time plotaggregate
spatially aggregate track properties (coercing fixes to points)bbox
retrieve spatial bounding boxcoordinates
retrieve coordinates of fixescoordnames
retrieve coordinate names of fixesover
intersect spatially (coercing tracks to SpatialLines
)plot
simple plot methodsproj4string
retrieve coordinate reference systemspTransform
(re)project coordinates to new coordinate reference system, or unprojectsummary
print simple summaryMethods that are unique for the Track
family are
compare
compares two Track
objects: for the common time period, a difftrack
object is created with all distances between them (approximating linearly when times of fixes don’t match)dists
compares two Tracks
(two sets of Track
elements) using mean distance, or Frechet distancedownsample
remove fixes from a Track
, starting with the most densely sampled onesfrechetDist
compute Frechet distance between two Track
objectsstcube
plot space-time cube for Track
objects, possibly with openstreetmap basegeneralize
resample track to lower freqency or minimal distancestbox
print the space-time bounding boxr = read.csv("http://pebesma.staff.ifgi.de/pm10/3_2a.txt")
time = as.POSIXct(strptime(paste(r$Date, r$Time), "%Y/%m/%d %H:%M:%S"))
require(sp)
pts = SpatialPoints(r[c("Long", "Lat")], CRS("+init=epsg:4326")) # WGS84 long/lat
library(spacetime)
library(trajectories)
tr = Track(STIDF(pts, time, r["pm10"]))
stbox(tr)
## Long Lat time
## min 7.594638 51.94583 2010-03-23 10:02:24
## max 7.636554 51.98394 2010-03-23 12:52:05
Some basic plots:
plot(as(tr, "STIDF"))
gives a space-time plot, where space indicats the index of each spatial item; the plot indicates that there were a few short breaks during the data collection.
Plotting the Track
, as
plot(tr)
shows the spatial path followed, without anything else. With stplot
stplot(tr)
we get the plot of the STIDF
object; it cuts time in 6 equal pieces, and plots those.
library(rgl)
library(rglwidget)
stcube(tr)
rglwidget(elementId="plot3d")
shows the 3-D, interactive space-time track, which even looks nicer when you try
stcube(tr, showMap = TRUE)
The mapview
of the complete trajectory shows a wealth of points:
require(mapview)
mapview(as(tr, "SpatialPointsDataFrame"), zcol = "pm10", legend = TRUE)