Open In Colab

Base R#

R is another data science language more specifically geared towards statistics. R is an implementation of the S language by John Chambers written originally by Ross Ihaka and Robert Gentleman. The R Core Team, R Foundation and almost countless contributors continually improve and maintain R.

In this course, we’re going to cover very little basic R and focus mainly on the so-called tidyverse. However, a little base R goes a long way. You can install R into jupyter notebooks and then have it as an option. Alternatively, Rstudio is a well known R ide as is ESS in Emacs.

R installation#

You can install R from conda environment or directly from CRAN. If on Windows, make sure to install Rtools as well. On a mac, I always made sure that I had the developer tools installed. On linux it’s a good idea to have all of the dev version of R system dependencies installed.

# Comments in R are the # sign, just like python
# Arithmetic works like you would expect
1 + 2 * 3
7
# Assign a variable, note y is a separate entity than x (run this same example in python)
x = 5
y = x
x = 10
y
5

List out our variables that we’ve created.

ls()
  1. 'x'
  2. 'y'

Honestly, the c function in R is perhaps one of the most important. It concatenates things.

z = c(1, 5, 8)
# Most operations are elementwise
z + 5 * z
  1. 6
  2. 30
  3. 48

But you have to be careful that R will guess what you want to do. In this case adding a vector of length 3 + a vector of length 6 just repeats the vector of length 3 twice.

z + c(z, z)
  1. 2
  2. 10
  3. 16
  4. 2
  5. 10
  6. 16

R’s Boolean values are TRUE and FALSE.

3 == 4
3 == 3
FALSE
TRUE

Control flow#

R has for loops, while loops and control flow. The syntax a : b creates a vector starting at a and ending at b. Note R uses functional notation, so indentation doesn’t mean anything and you have to put curly braces for things included in the for, if or while statements.

for (i in 1 : 6){
    if (i <= 3) {
        print("i is small")
    }
    else {
        print("i is large")
    }
}
[1] "i is small"
[1] "i is small"
[1] "i is small"
[1] "i is large"
[1] "i is large"
[1] "i is large"

Data structures#

R’s generic structure is a list, which can be made with the command list, its generic matrix structure is a matrix, which can be made with the command matrix and its generic data frame structure is a data frame, which can be made with the command data.frame.

x = list(a = 1 : 3, b = "character", c = list(a = 1 : 4, b = "character2"))

Now, x is a list containing three elements, a is a vector, b is a string and c is itself another list. You can reference elements of x with the $ or brackets

x$a
x$b
x[[1]]
x[[2]]
  1. 1
  2. 2
  3. 3
'character'
  1. 1
  2. 2
  3. 3
'character'

Brief technicality, x[1] returns a list containing the first element of x whereas x[[1]] returns the entity itself. Let’s create a dataframe. Also note R starts counting at 1 (unlike 0 for python).

x = data.frame(index = 3 : 7, letter = letters[3 : 7])
x
A data.frame: 5 × 2
indexletter
<int><chr>
3c
4d
5e
6f
7g

The $ operator works on dataframes. In addition, bracket notation works as well.

x[,1]
x[1 : 2,]
x[1,2]
  1. 3
  2. 4
  3. 5
  4. 6
  5. 7
A data.frame: 2 × 2
indexletter
<int><chr>
13c
24d
'c'

Finally, let’s cover matrices.

x = matrix( 1 : 6, 3, 2)
x
y = matrix( 1 : 6, 2, 3)
y
A matrix: 3 × 2 of type int
14
25
36
A matrix: 2 × 3 of type int
135
246
x[1,]
x[,1]
x[1, 2]
  1. 1
  2. 4
  1. 1
  2. 2
  3. 3
4

Functions#

R has functions and uses so-called lexical scoping. Arguments can be named or not in function calls. But, just like in python, don’t get too cute with this.

pow = function(x, n) {
    x ^ n
}
pow(2, 3)
pow(x = 2, n = 3)
pow(n = 3, x = 2)
pow(n = 3, 2)
pow(3, 2)
8
8
8
8
9

Functions can be arguments to functions. The ... argument is for variable arguments.

doublefunc = function(f, x, ...){
    f(x, ...) * 2
}
doublefunc(pow, 2, 3)
doublefunc(exp, 2)
16
14.7781121978613

Variables within the scope of the environment that they are defined in. Note below, the variable c1 isn’t found since it’s only defined within f’s environment. Similarly, e is not found within f since it’s only found within the function g defined within f. Finally, note since c is a predefined function I had to define c as c1. R will let you define c no problem, but it creates confusion. Double check whether a variable is already assigned before defining it to avoid confusion.

a = 2
f = function(b){
    c1 = 3
    g = function(d){
        e = 4
        return(1)
    }
    #DOESN'T WORK
    #print(e)
    return(1)
}
#DOESN'T WORK
#print(c1)
f(1)
1