R crash course

Copyright

All the material presented here, to the extent it is original, is available under CC-BY-SA.

What is R? what is RStudio?

R is a program, that runs a REPL (read-evaluate-print-loop). If you start it interactively, you see a prompt; if you type a command on the prompt and press “enter”, the command is evaluated and the result is printed:

1+2

## [1] 3

a <- 2

The second command is an assignment: 2 is assigned to a. The result is not printed. We could force printing by

(a <- 2)

## [1] 2

An alternative to assignment using the “left-arrow”, <-, is using =

(a = 2)

## [1] 2

RStudio

RStudio is an application (program) that provides an IDE (integrated development environment) for R, that makes it easy to manage R data, R scripts, R markdown files, and develop R packages. The basic RStudio is free of charge, commercial versions provide all kind of extra’s like business integration, user management, collaboration, cloud deployment, and so on.

RStudio uses R.

It makes sense to try to understand when a particular problem is caused by RStudio, or by R; sometimes this is difficult.

Objects, functions, interfaces

In his book “Extending R”, John Chambers summarizes R with

everything that is, is an object
everything that happens is a function call
Most work in R happens through interfaces to other languages (C, C++, Fortran, …)

This also applies to infix expressions such as 2 + 3: these are just alternative writings of the function evaluation

`+`(2, 3)

## [1] 5

of function + and arguments 2 and 3.

We can also make our lives difficult by

`+` = `-` # replace + by -
3 + 2

## [1] 1

but we will not do this, and remove the bad plus by:

rm(`+`)
3 + 2 # check:

## [1] 5

Objects

See ?typeof for the complete list of object types. Some of them are concerned with language: you can manipulate language objects, and e.g. manipulate expressions before they are evaluated (executed). But that is advanced.

The most interesting objects are data objects, and functions.

class, mode, length

Objects always have a class,

(a = 3:1)

## [1] 3 2 1

class(a)

## [1] "integer"

and data objects have a length and a mode:

length(a)

## [1] 3

mode(a)

## [1] "numeric"

R does not have scalars: single numbers are vectors of length 1:

(a = 1)

## [1] 1

length(a)

## [1] 1

We can index (select) vector elements using [:

(a = c(1,2,3,4,5,10,11,12))

## [1]  1  2  3  4  5 10 11 12

a[3]

## [1] 3

a[3:5]

## [1] 3 4 5

Each data type has a special value, NA, denoting a “not available” (missing) value.

c(1,3,NA,5)

## [1]  1  3 NA  5

c(TRUE,TRUE,NA,FALSE)

## [1]  TRUE  TRUE    NA FALSE

c("alice", "bob", NA, "dylan")

## [1] "alice" "bob"   NA      "dylan"

R has both integer and double representations of numeric, but you rarely need to know this.

attributes

Data objects can have attributes, which contain metadata of an object.

a = 3
attr(a, "foo") = "bar"
a

## [1] 3
## attr(,"foo")
## [1] "bar"

Attributes with predefined semantics are e.g. names, class, and dim:

a = 1:4
attr(a, "names") = c("first", "second", "third", "fourth")
a

##  first second  third fourth 
##      1      2      3      4

attr(a, "dim") = c(2, 2) # now a is interpreted as a 2 x 2 matrix:
a

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## attr(,"names")
## [1] "first"  "second" "third"  "fourth"

matrix, array

Matices and arrays are created from vectors, by setting the number of rows/columns, or the dimensions. They have a dim attribute and a dim method to get or set dim:

(m = matrix(1:10, nrow = 2, ncol = 5))

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10

m[1:2, 2:3]

##      [,1] [,2]
## [1,]    3    5
## [2,]    4    6

dim(m)

## [1] 2 5

dim(m) = c(5, 2) # but not transpose!
attributes(m)

## $dim
## [1] 5 2

##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10

(a = array(1:24, c(2,3,4)))

## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,]   13   15   17
## [2,]   14   16   18
## 
## , , 4
## 
##      [,1] [,2] [,3]
## [1,]   19   21   23
## [2,]   20   22   24

a[,2,] # second slice for dimension 2, retain all others:

##      [,1] [,2] [,3] [,4]
## [1,]    3    9   15   21
## [2,]    4   10   16   22

Generic vector: list

Mixing types in single vectors doesn’t go well: elements are coerced to the type that can ultimately hold everything:

c(1, FALSE)

## [1] 1 0

c(1, FALSE, "foo")

## [1] "1"     "FALSE" "foo"

For this, lists of arbitrary objects can be used:

(a = list(1, c(TRUE, FALSE, TRUE, NA), c("foo", "bar")))

## [[1]]
## [1] 1
## 
## [[2]]
## [1]  TRUE FALSE  TRUE    NA
## 
## [[3]]
## [1] "foo" "bar"

class(a)

## [1] "list"

Indexing lists is special: a single [ returns a list:

a[1]

## [[1]]
## [1] 1

a[2:1]

## [[1]]
## [1]  TRUE FALSE  TRUE    NA
## 
## [[2]]
## [1] 1

and a double [[ retrieves the contents of a single list element:

a[[1]]

## [1] 1

a[[2]]

## [1]  TRUE FALSE  TRUE    NA

data.frame

data.frame objects are VERY common in R, and are used to represent tabular data, with table columns potentially of different type.

(d = data.frame(a = 1:3, b = c("alice", "bob", "charly"), t = as.Date("2019-08-31") + c(1,3,5)))

##   a      b          t
## 1 1  alice 2019-09-01
## 2 2    bob 2019-09-03
## 3 3 charly 2019-09-05

d[1:2, ] # first two rows

##   a     b          t
## 1 1 alice 2019-09-01
## 2 2   bob 2019-09-03

d[, 2:3] # last two columns

##        b          t
## 1  alice 2019-09-01
## 2    bob 2019-09-03
## 3 charly 2019-09-05

d[2, 3] # single element, as value

## [1] "2019-09-03"

d[2, 3, drop = FALSE] # single element, but as data.frame

##            t
## 2 2019-09-03

Functions as objects

Functions

Functions are also objects; if you type their name, they are printed:

sd

## function (x, na.rm = FALSE) 
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
##     na.rm = na.rm))
## <bytecode: 0x55aecf1d0bc8>
## <environment: namespace:stats>

For many functions, the result is a bit obscure, e.g. mean:

mean

## function (x, ...) 
## UseMethod("mean")
## <bytecode: 0x55aecf50edf0>
## <environment: namespace:base>

the UseMethod indicates that mean is a generic, which has methods that depend on the class of the first argument; we list methods by

methods(mean)

## [1] mean.Date     mean.default  mean.difftime mean.POSIXct  mean.POSIXlt 
## see '?methods' for accessing help and source code

and mean.default is the default method, called when mean is called with anything else (so, for example, a numeric vector). We can then list mean.default

mean.default

## function (x, trim = 0, na.rm = FALSE, ...) 
## {
##     if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
##         warning("argument is not numeric or logical: returning NA")
##         return(NA_real_)
##     }
##     if (na.rm) 
##         x <- x[!is.na(x)]
##     if (!is.numeric(trim) || length(trim) != 1L) 
##         stop("'trim' must be numeric of length one")
##     n <- length(x)
##     if (trim > 0 && n) {
##         if (is.complex(x)) 
##             stop("trimmed means are not defined for complex data")
##         if (anyNA(x)) 
##             return(NA_real_)
##         if (trim >= 0.5) 
##             return(stats::median(x, na.rm = FALSE))
##         lo <- floor(n * trim) + 1
##         hi <- n + 1 - lo
##         x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi]
##     }
##     .Internal(mean(x))
## }
## <bytecode: 0x55aecfa28310>
## <environment: namespace:base>

and see that that calls, after optional trimming and handling NA values, an .Internal version of mean, which, in the end, for real numbers calls this C function (which does more than a simple sum(x)/length(x)!)

This trick does not always work:

methods(quantile)

## [1] quantile.default* quantile.ecdf*    quantile.POSIXt* 
## see '?methods' for accessing help and source code

quantile.default

## Error in eval(expr, envir, enclos): object 'quantile.default' not found

We see that the method is marked with a *: use ?quantile to discover that quantile comes from package stats, then obtain the function with

stats:::quantile.default # (output suppressed)

What happens here is that package stats exports the default method for quantile, but not the function quantile.default; ::: allows one to peek into non-exported functions. For those who are curious what is going on.

self-made functions

We can also create functions:

mean_plus_one = function(x) {
  mean(x) + 1
}

R does not need a return statement: the result of the last expression in a function is the value returned:

mean_plus_one(c(1,2,3))

## [1] 3

If we want other arguments (like trim and na.rm) to also work here, we can pass them on using the ... trick:

mean_plus_one = function(x, ...) {
  mean(x, ...) + 1
}
mean_plus_one(c(1,2,NA,3))

## [1] NA

mean_plus_one(c(1,2,NA,3), na.rm = TRUE)

## [1] 3

We can also create functions that have no name:

function(x){mean(x)+1}

## function(x){mean(x)+1}

and call them

(function(x){mean(x)+1})(c(5,6,7))

## [1] 7

We can also pass functions as arguments:

furniture = data.frame(
  what = rep(c("table", "chair"), each = 2),
  weight = c(25, 27, 8, 11))
aggregate(x = furniture["weight"], by = furniture["what"], FUN = max)

##    what weight
## 1 chair     11
## 2 table     27

Here, max is passed as an argument to aggregate, which is applied to x values of each group defined by by.

R in practice

Environments, .GlobalEnv

Where does R find things? In environments, similar to (in-memory) directories in your search path:

search()

## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"

library(sf)

## Linking to GEOS 3.7.1, GDAL 2.4.2, PROJ 5.2.0

search()

##  [1] ".GlobalEnv"        "package:sf"        "package:stats"    
##  [4] "package:graphics"  "package:grDevices" "package:utils"    
##  [7] "package:datasets"  "package:methods"   "Autoloads"        
## [10] "package:base"

As you see, adding a library puts it on the search path, behind .GlobalEnv. .GlobalEnv is the global environment, where things like

a = 5:10
ls()

## [1] "a"             "d"             "furniture"     "m"            
## [5] "mean_plus_one"

## [1]  5  6  7  8  9 10

are put, and retrieved from.

You can use R (RStudio) in interactive mode (called console in RStudio), which helps trying things out and learning, but for serious work you will never do it. Although R saves your .GlobalEnv (optionally) before it shuts down, and reads it at startup next time, and also saves/restores your command history, working this way has large disadvantages:

the command history contains all commands, also those that failed
the objects in .GlobalEnv do not have a history, i.e. you cannot unambiguously find out how they were created
working this way constantly increases your .GlobalEnv, and is like allways doing all your work in your home directory.

Instead, a better way of working is:

always start R without reading data on startup
type commands in R scripts, or better: R markdown documents
run the commands from there
when quitting R (RStudio): don’t save .GlobalEnv to a file

This also means that the script needs to have the following sections:

possibly load packages needed
load data (e.g. read it from a file, or a package)
do analysis
write output (figures, tables, or data file(s))

Together with your data, this script provides a completely reproducible process!

In RStudio, you can make not reading/saving the default by Tools - Global Options - General - Workspace: un-check the button “restore data at worspace startup”, and set “save workspace to .RData on exit” to “never”:

Watch out for…

memory usage

R will not warn you before you push it to its limits, but only error when you do:

x = 1:1e12
object.size(x) # clearly doesn't exist as such!

## 8000000000048 bytes

x + 1

## Error: cannot allocate vector of size 7450.6 Gb

By default, R does everything in main memory. If you do a computation that gradually exhausts memory, your computer will become very slow when it uses virtual memory, and may react to follow-up memory shortage by shutting down processes randomly (not starting with R, necessarily).

compute capacity/speed

R will not tell you in advance how long a computation will take, unless someone has implemented a progress bar or counter. Waiting very long is often not very productive; try working with smaller datasets, then somewhat larger, and make some timings, to get a feel for how long something will take:

system.time(a <- sin(runif(1e6))) # here <- is needed, = has a different meaning in a function call

##    user  system elapsed 
##   0.040   0.000   0.041

system.time(a <- sin(runif(1e7)))

##    user  system elapsed 
##   0.406   0.020   0.425