All the material presented here, to the extent it is original, is available under CC-BY-SA.
R is a program, that runs a REPL (read-evaluate-print-loop). If you start it interactively, you see a prompt; if you type a command on the prompt and press “enter”, the command is evaluated and the result is printed:
1+2
## [1] 3
a <- 2
The second command is an assignment: 2 is assigned to a
. The result is not printed. We could force printing by
(a <- 2)
## [1] 2
An alternative to assignment using the “left-arrow”, <-
, is using =
(a = 2)
## [1] 2
RStudio is an application (program) that provides an IDE (integrated development environment) for R, that makes it easy to manage R data, R scripts, R markdown files, and develop R packages. The basic RStudio is free of charge, commercial versions provide all kind of extra’s like business integration, user management, collaboration, cloud deployment, and so on.
RStudio uses R.
It makes sense to try to understand when a particular problem is caused by RStudio, or by R; sometimes this is difficult.
In his book “Extending R”, John Chambers summarizes R with
This also applies to infix expressions such as 2 + 3
: these are just alternative writings of the function evaluation
`+`(2, 3)
## [1] 5
of function +
and arguments 2 and 3.
We can also make our lives difficult by
`+` = `-` # replace + by -
3 + 2
## [1] 1
but we will not do this, and remove the bad plus by:
rm(`+`)
3 + 2 # check:
## [1] 5
See ?typeof
for the complete list of object types. Some of them are concerned with language: you can manipulate language objects, and e.g. manipulate expressions before they are evaluated (executed). But that is advanced.
The most interesting objects are data objects, and functions.
Objects always have a class,
(a = 3:1)
## [1] 3 2 1
class(a)
## [1] "integer"
and data objects have a length and a mode:
length(a)
## [1] 3
mode(a)
## [1] "numeric"
R does not have scalars: single numbers are vectors of length 1:
(a = 1)
## [1] 1
length(a)
## [1] 1
We can index (select) vector elements using [
:
(a = c(1,2,3,4,5,10,11,12))
## [1] 1 2 3 4 5 10 11 12
a[3]
## [1] 3
a[3:5]
## [1] 3 4 5
Each data type has a special value, NA
, denoting a “not available” (missing) value.
c(1,3,NA,5)
## [1] 1 3 NA 5
c(TRUE,TRUE,NA,FALSE)
## [1] TRUE TRUE NA FALSE
c("alice", "bob", NA, "dylan")
## [1] "alice" "bob" NA "dylan"
R has both integer
and double
representations of numeric
, but you rarely need to know this.
Data objects can have attributes, which contain metadata of an object.
a = 3
attr(a, "foo") = "bar"
a
## [1] 3
## attr(,"foo")
## [1] "bar"
Attributes with predefined semantics are e.g. names
, class
, and dim
:
a = 1:4
attr(a, "names") = c("first", "second", "third", "fourth")
a
## first second third fourth
## 1 2 3 4
attr(a, "dim") = c(2, 2) # now a is interpreted as a 2 x 2 matrix:
a
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## attr(,"names")
## [1] "first" "second" "third" "fourth"
Matices and arrays are created from vectors, by setting the number of rows/columns, or the dimensions. They have a dim
attribute and a dim
method to get or set dim
:
(m = matrix(1:10, nrow = 2, ncol = 5))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
m[1:2, 2:3]
## [,1] [,2]
## [1,] 3 5
## [2,] 4 6
dim(m)
## [1] 2 5
dim(m) = c(5, 2) # but not transpose!
attributes(m)
## $dim
## [1] 5 2
m
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
(a = array(1:24, c(2,3,4)))
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 13 15 17
## [2,] 14 16 18
##
## , , 4
##
## [,1] [,2] [,3]
## [1,] 19 21 23
## [2,] 20 22 24
a[,2,] # second slice for dimension 2, retain all others:
## [,1] [,2] [,3] [,4]
## [1,] 3 9 15 21
## [2,] 4 10 16 22
Mixing types in single vectors doesn’t go well: elements are coerced to the type that can ultimately hold everything:
c(1, FALSE)
## [1] 1 0
c(1, FALSE, "foo")
## [1] "1" "FALSE" "foo"
For this, lists of arbitrary objects can be used:
(a = list(1, c(TRUE, FALSE, TRUE, NA), c("foo", "bar")))
## [[1]]
## [1] 1
##
## [[2]]
## [1] TRUE FALSE TRUE NA
##
## [[3]]
## [1] "foo" "bar"
class(a)
## [1] "list"
Indexing lists is special: a single [
returns a list:
a[1]
## [[1]]
## [1] 1
a[2:1]
## [[1]]
## [1] TRUE FALSE TRUE NA
##
## [[2]]
## [1] 1
and a double [[
retrieves the contents of a single list element:
a[[1]]
## [1] 1
a[[2]]
## [1] TRUE FALSE TRUE NA
data.frame
objects are VERY common in R, and are used to represent tabular data, with table columns potentially of different type.
(d = data.frame(a = 1:3, b = c("alice", "bob", "charly"), t = as.Date("2019-08-31") + c(1,3,5)))
## a b t
## 1 1 alice 2019-09-01
## 2 2 bob 2019-09-03
## 3 3 charly 2019-09-05
d[1:2, ] # first two rows
## a b t
## 1 1 alice 2019-09-01
## 2 2 bob 2019-09-03
d[, 2:3] # last two columns
## b t
## 1 alice 2019-09-01
## 2 bob 2019-09-03
## 3 charly 2019-09-05
d[2, 3] # single element, as value
## [1] "2019-09-03"
d[2, 3, drop = FALSE] # single element, but as data.frame
## t
## 2 2019-09-03
Functions are also objects; if you type their name, they are printed:
sd
## function (x, na.rm = FALSE)
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
## na.rm = na.rm))
## <bytecode: 0x55aecf1d0bc8>
## <environment: namespace:stats>
For many functions, the result is a bit obscure, e.g. mean
:
mean
## function (x, ...)
## UseMethod("mean")
## <bytecode: 0x55aecf50edf0>
## <environment: namespace:base>
the UseMethod
indicates that mean
is a generic, which has methods that depend on the class of the first argument; we list methods by
methods(mean)
## [1] mean.Date mean.default mean.difftime mean.POSIXct mean.POSIXlt
## see '?methods' for accessing help and source code
and mean.default
is the default method, called when mean
is called with anything else (so, for example, a numeric
vector). We can then list mean.default
mean.default
## function (x, trim = 0, na.rm = FALSE, ...)
## {
## if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
## warning("argument is not numeric or logical: returning NA")
## return(NA_real_)
## }
## if (na.rm)
## x <- x[!is.na(x)]
## if (!is.numeric(trim) || length(trim) != 1L)
## stop("'trim' must be numeric of length one")
## n <- length(x)
## if (trim > 0 && n) {
## if (is.complex(x))
## stop("trimmed means are not defined for complex data")
## if (anyNA(x))
## return(NA_real_)
## if (trim >= 0.5)
## return(stats::median(x, na.rm = FALSE))
## lo <- floor(n * trim) + 1
## hi <- n + 1 - lo
## x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi]
## }
## .Internal(mean(x))
## }
## <bytecode: 0x55aecfa28310>
## <environment: namespace:base>
and see that that calls, after optional trimming and handling NA
values, an .Internal
version of mean, which, in the end, for real numbers calls this C function (which does more than a simple sum(x)/length(x)
!)
This trick does not always work:
methods(quantile)
## [1] quantile.default* quantile.ecdf* quantile.POSIXt*
## see '?methods' for accessing help and source code
quantile.default
## Error in eval(expr, envir, enclos): object 'quantile.default' not found
We see that the method is marked with a *
: use ?quantile
to discover that quantile
comes from package stats, then obtain the function with
stats:::quantile.default # (output suppressed)
What happens here is that package stats
exports the default method for quantile
, but not the function quantile.default
; :::
allows one to peek into non-exported functions. For those who are curious what is going on.
We can also create functions:
mean_plus_one = function(x) {
mean(x) + 1
}
R does not need a return
statement: the result of the last expression in a function is the value returned:
mean_plus_one(c(1,2,3))
## [1] 3
If we want other arguments (like trim
and na.rm
) to also work here, we can pass them on using the ...
trick:
mean_plus_one = function(x, ...) {
mean(x, ...) + 1
}
mean_plus_one(c(1,2,NA,3))
## [1] NA
mean_plus_one(c(1,2,NA,3), na.rm = TRUE)
## [1] 3
We can also create functions that have no name:
function(x){mean(x)+1}
## function(x){mean(x)+1}
and call them
(function(x){mean(x)+1})(c(5,6,7))
## [1] 7
We can also pass functions as arguments:
furniture = data.frame(
what = rep(c("table", "chair"), each = 2),
weight = c(25, 27, 8, 11))
aggregate(x = furniture["weight"], by = furniture["what"], FUN = max)
## what weight
## 1 chair 11
## 2 table 27
Here, max
is passed as an argument to aggregate
, which is applied to x
values of each group defined by by
.
Where does R find things? In environments, similar to (in-memory) directories in your search path:
search()
## [1] ".GlobalEnv" "package:stats" "package:graphics"
## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"
library(sf)
## Linking to GEOS 3.7.1, GDAL 2.4.2, PROJ 5.2.0
search()
## [1] ".GlobalEnv" "package:sf" "package:stats"
## [4] "package:graphics" "package:grDevices" "package:utils"
## [7] "package:datasets" "package:methods" "Autoloads"
## [10] "package:base"
As you see, adding a library puts it on the search path, behind .GlobalEnv
. .GlobalEnv
is the global environment, where things like
a = 5:10
ls()
## [1] "a" "d" "furniture" "m"
## [5] "mean_plus_one"
a
## [1] 5 6 7 8 9 10
are put, and retrieved from.
You can use R (RStudio) in interactive mode (called console in RStudio), which helps trying things out and learning, but for serious work you will never do it. Although R saves your .GlobalEnv (optionally) before it shuts down, and reads it at startup next time, and also saves/restores your command history, working this way has large disadvantages:
.GlobalEnv
do not have a history, i.e. you cannot unambiguously find out how they were created.GlobalEnv
, and is like allways doing all your work in your home directory.Instead, a better way of working is:
This also means that the script needs to have the following sections:
Together with your data, this script provides a completely reproducible process!
In RStudio, you can make not reading/saving the default by Tools - Global Options - General - Workspace: un-check the button “restore data at worspace startup”, and set “save workspace to .RData on exit” to “never”:
R will not warn you before you push it to its limits, but only error when you do:
x = 1:1e12
object.size(x) # clearly doesn't exist as such!
## 8000000000048 bytes
x + 1
## Error: cannot allocate vector of size 7450.6 Gb
By default, R does everything in main memory. If you do a computation that gradually exhausts memory, your computer will become very slow when it uses virtual memory, and may react to follow-up memory shortage by shutting down processes randomly (not starting with R, necessarily).
R will not tell you in advance how long a computation will take, unless someone has implemented a progress bar or counter. Waiting very long is often not very productive; try working with smaller datasets, then somewhat larger, and make some timings, to get a feel for how long something will take:
system.time(a <- sin(runif(1e6))) # here <- is needed, = has a different meaning in a function call
## user system elapsed
## 0.040 0.000 0.041
system.time(a <- sin(runif(1e7)))
## user system elapsed
## 0.406 0.020 0.425