Week 2: R Basics

POP77001 Computer Programming for Social Scientists

Tom Paskhalis

Overview

  • Backstory
  • R operators and objects
  • Data structures and types
  • Indexing and subsetting
  • Attributes

Introduction to R

R Background

  • S (for statistics) is a programming language for statistical analysis developed in 1976 in AT&T Bell Labs.
  • Original S language and its extension S-PLUS were closed source.
  • In 1991 Ross Ihaka and Robert Gentleman began developing R, an open-source alternative to S.

R Overview

  • R is an interpreted language (like Python and Stata).
  • It is geared towards statistical analysis.
  • R is often used for interactive data analysis (one command at a time).
  • But it also permits to execute entire scripts in batch mode.
# Generate 5 random numbers from a standard normal distribution 
rnorm(5)
[1]  0.07951163 -0.97540480  1.66309412 -0.38210749  1.61278360

Operations

Operators

Key operators (infix functions) in R are:

  • Assignment (<-, <<-, =)
  • Arithmetic (+, -, *, ^, /, %/%, %%, %*%)
  • Boolean (&, &&, |, ||, !)
  • Relational (==, !=, >, >=, <, <=)
  • Membership (%in%)

Mathematical Operations

Arithmetic operations in R:

1 + 1
[1] 2
5 - 3
[1] 2
6 / 2
[1] 3
4 * 4
[1] 16
## Exponentiation, note that 2 ** 4 also works, but is not recommended
2 ^ 4
[1] 16

Advanced mathematical operations:

# Integer division
7 %/% 3
[1] 2
# Modulo operation (remainder of division)
7 %% 3
[1] 1

Logical Operations

3 != 1 # Not equal
[1] TRUE
3 > 3 # Greater than
[1] FALSE
# OR - TRUE if either first or second operand is TRUE, FALSE otherwise
FALSE | TRUE
[1] TRUE
# R also treats F and T as Boolean, but
# it is not recommended due to poor legibility
F | T
[1] TRUE
# Longer form '&&' (AND) and '||' (OR) evaluate
# to a single value and are preferable in programming control flow
FALSE && TRUE
[1] FALSE
# Shorter form '|' ('&') performs element-wise comparison
# and is often used in vectorised operations
c(FALSE, TRUE) | c(FALSE, FALSE) 
[1] FALSE  TRUE
# It is the only form appropriate for vector comparison
# (starting from R version 4.3.0)
c(FALSE, TRUE) || c(FALSE, FALSE) 
Error in c(FALSE, TRUE) || c(FALSE, FALSE): 'length = 2' in coercion to 'logical(1)'

Operator Precedence

Operator Description
:: ::: access variables in a namespace
$ @ component / slot extraction
[ [[ indexing
^ exponentiation (right to left)
- + unary minus and plus
: sequence operator
%any% |> special operators (including %% and %/%)
* / multiply, divide
+ - (binary) add, subtract
< > <= >= == != ordering and comparison
! negation
& && and
| || or
~ as in formulae
-> ->> rightwards assignment
<- <<- assignment (right to left)
= assignment (right to left)
? help (unary and binary)

Operator Precedence: Examples

# Effectively, 5 > (3 * 2)
5 > 3 * 2
[1] FALSE
(5 > 3) * 2
[1] 2
# Effectively, 3 + (2 ^ 1)/2
3 + 2 ^ 1/2
[1] 4
3 + 2 ^ (1/2)
[1] 4.414214

Assignment

R Objects

Everything that exists in R is an object.

John Chambers

  • Fundamentally, everything you are dealing with in R is an object.
  • That includes individual variables, datasets, functions and many other classes of objects.
  • The key reference to an object is its name.
  • Typically, the reference is established through assignment operation.

Assignment Operation

  • Assignment is the most important operation in R.
  • It is used to bind an object to a name.
  • <- is the standard assignment operator in R.
  • While = is also supported, it is not recommended.
x <- 3


x
[1] 3

Objects and Names

x <- 3
  • The way to read this is
    • “create an object (numeric vector of length 1) with element 3 and bind it to the name x”.
  • Thus, assignment operation does 2 things:
    • Creates an object.
    • Binds it to a name.

Memory Address

x

3️⃣

Aliases

  • Creating (binding) another name (alias) to an object does not create a copy of the object.
# We can use tracemem() function to trace the memory address of an object
tracemem(x)
[1] "<0x6398ede5ada8>"
y <- x
tracemem(y)
[1] "<0x6398ede5ada8>"

Memory Address

x

3️⃣

y

Copies

  • R will create a copy of an object if the original object is modified.
  • This is also known as copy-on-modify semantics.
  • While it might seem innocuous, it can have implications for large objects.
x <- 5
# Pointing to the original object
tracemem(y)
[1] "<0x6398ede5ada8>"
# Pointing to a new object
tracemem(x)
[1] "<0x6398e99567e8>"

Memory Address 1

Memory Address 2

x

5️⃣

y

3️⃣

Vector

Dimensionality & Homogeneity

Base R data structures can be classified along their: - dimensionality - homogeneity

5 main built-in data structures in R:

  • Atomic vector (vector)
  • Matrix (matrix)
  • Array (array)
  • List (list)
  • Data frame (data.frame)

Data Structures in R

Structure Description Dimensionality Data Type
vector Atomic vector (scalar) 1d homogenous
matrix Matrix 2d homogenous
array One-, two or n-dimensional array 1d/2d/nd homogenous
list List 1d heterogeneous
data.frame Rectangular data 2d heterogeneous

Vectors in R

Atomic Vectors

Numeric

Integer

Double

Character

Logical

List

Atomic Vectors

  • Vector is the core building block of R
  • Vectors can be created with c() function (short for combine)
v <- c(1, 2, 3)
v
[1] 1 2 3
v <- c(v, 4) # Vectors are always flattened (even when nested)
v
[1] 1 2 3 4

Scalars

  • R has no scalars.
  • Single values are just vectors of length 1.
1[1]
[1] 1

Data Types

Main Data Types

4 common data types that are contained in R structures:

  • Character (character)
  • Double (double, also numeric)
  • Integer (integer, also numeric)
  • Logical/boolean (logical)

Character Vector

char_vec <- c("apple", "banana", "watermelon")
char_vec
[1] "apple"      "banana"     "watermelon"
# length() function gives the length of an R object
length(char_vec) 
[1] 3
# typeof() function returns the type of an R object
typeof(char_vec)
[1] "character"
# is.character() tests whether R object (vector/array/matrix)
# contains elements of type 'character'
is.character(char_vec)
[1] TRUE

Double Vector

# Note that even without decimal part R treats these numbers as double
dbl_vec <- c(300, 200, 4)
dbl_vec
[1] 300 200   4
typeof(dbl_vec)
[1] "double"
is.double(dbl_vec)
[1] TRUE

Integer Vector

# Note the 'L' suffix to make sure you get an integer rather than double
int_vec <- c(300L, 200L, 4L)
int_vec
[1] 300 200   4
typeof(int_vec)
[1] "integer"
is.integer(int_vec)
[1] TRUE
# Note that is.numeric() function is a generic way of testing
# whether vector contains numbers - either integers or double
is.numeric(int_vec)
[1] TRUE
is.numeric(dbl_vec)
[1] TRUE

Integer vs Double

  • Integers are used to store whole numbers (e.g. counts)
  • Unsigned 32-bit integer: \(2^{32} - 1 = 4,294,967,295\)
    • Signed 32-bit integer: \([-2,147,483,648 \mathrel{{.}\,{.}} 2,147,483,647]\)
  • Unsigned 64-bit integer: \(2^{64} - 1 = 18,446,744,073,709,551,615\)

Logical Vector

log_vec <- c(FALSE, FALSE, TRUE)
log_vec
[1] FALSE FALSE  TRUE
# While more concise, using T/F instead of TRUE/FALSE can be confusing
log_vec2 <- c(F, F, T)
log_vec2
[1] FALSE FALSE  TRUE
typeof(log_vec)
[1] "logical"
is.logical(log_vec)
[1] TRUE

Type Coercion

  • All elements of a vector must be of the same type.
  • If you try to combine vectors of different types, their elements will be coerced to the most flexible type.
# Note that logical vector get coerced to 0/1 for FALSE/TRUE
c(dbl_vec, log_vec)
[1] 300 200   4   0   0   1
c(char_vec, int_vec)
[1] "apple"      "banana"     "watermelon" "300"        "200"       
[6] "4"         
# If no natural way of type conversion exists, NAs are introduced
as.numeric(char_vec)
[1] NA NA NA

Implicit Type Coercion

Twitter (X)

NA and NULL

  • R makes a distinction between:
    • NA - value exists, but is unknown (e.g. survey non-response)
    • NULL - object does not exist
  • NA’s are used in data sets (missing data).
  • NULL’s are used in function calls (optional arguments).
NA
[1] NA
NULL
NULL

NA and NULL: Example

na <- c(NA, NA, NA)
na
[1] NA NA NA
length(na)
[1] 3
null <- c(NULL, NULL, NULL)
null
NULL
length(null)
[1] 0

Working with NAs

# Presence of NAs can lead to unexpected results
v_na <- c(1, 2, 3, NA, 5)
mean(v_na)
[1] NA
# NAs should be treated specially
mean(v_na, na.rm = TRUE)
[1] 2.75
# Remember NAs are missing values
# Thus result of comparing them is unknown
NA == NA
[1] NA
# is.na() is a special function that checks whether value is missing (NA)
is.na(v_na)
[1] FALSE FALSE FALSE  TRUE FALSE
# We can use such logical vectors for subsetting (more below)
v_na[!is.na(v_na)]
[1] 1 2 3 5

Subsetting

Vector Indexing and Subsetting

  • Indexing in R starts from 1.
  • To subset a vector, use [] to index the elements you would like to select.
  • If you would like to select only a single element you can also use [[]].
vector[index]
vector[[index]]
dbl_vec[[1]]
[1] 300
dbl_vec[c(1,3)]
[1] 300   4

Summary of Vector Subsetting

Value Example Description
Positive integers v[c(3, 1)] Returns elements at specified positions
Negative integers v[-c(3, 1)] Omits elements at specified positions
Logical vectors v[c(FALSE, TRUE)] Returns elements where corresponding logical value is TRUE
Character vector v[c(“c”, “a”)] Returns elements with matching names (only for named vectors)
Nothing v[] Returns the original vector
0 (Zero) v[0] Returns a zero-length vector

Generating Sequences

  • You can use : operator to generate vectors of indices for subsetting.
  • seq() function provides a generalisation of : for generating arithemtic progressions.
2:4
[1] 2 3 4
seq(from = 1, to = 4, by = 2)
[1] 1 3

Vector Subsetting: Examples

v
[1] 1 2 3 4
v[2:4]
[1] 2 3 4
# Argument names can be omitted for matching by position
v[seq(1,4,2)]
[1] 1 3
# All but the last element
v[-length(v)]
[1] 1 2 3
# Reverse order
v[seq(length(v), 1, -1)]
[1] 4 3 2 1

Vector Recycling

For operations that require vectors to be of the same length R recycles (reuses) the shorter one.

c(0, 1) + c(1, 2, 3, 4)
[1] 1 3 3 5
5 * c(1, 2, 3, 4)
[1]  5 10 15 20
c(1, 2, 3, 4)[c(TRUE, FALSE)]
[1] 1 3

which() function

Returns indices of TRUE elements in a vector.

char_vec
[1] "apple"      "banana"     "watermelon"
char_vec == "watermelon"
[1] FALSE FALSE  TRUE
which(char_vec == "watermelon")
[1] 3
dbl_vec[char_vec == "watermelon"]
[1] 4
dbl_vec[which(char_vec == "watermelon")]
[1] 4

Membership Operation

Operator %in% returns TRUE if an object on the left side is present in a sequence on the right.

"a" %in% "abc" # Note that R strings are not sequences
[1] FALSE
3 %in% c(1, 2, 3) # c(1, 2, 3) is a vector
[1] TRUE
!(3 %in% c(1, 2, 3))
[1] FALSE

Lists

Lists

  • As opposed to vectors, lists can contain elements of any type.
  • List can also have nested lists within it.
  • Outputs of most statistical models are lists.
  • Lists are constructed using list() function in R.
# We can combine different data types in a list and, optionally, name elements.
l <- list(
  "linear regression",
  model = "y ~ x",
  coefs = c(1.2, 0.5),
  list(
    x = 1:10,
    y = 1.2 + 1:10 * 0.5
  )
)
l
[[1]]
[1] "linear regression"

$model
[1] "y ~ x"

$coefs
[1] 1.2 0.5

[[4]]
[[4]]$x
 [1]  1  2  3  4  5  6  7  8  9 10

[[4]]$y
 [1] 1.7 2.2 2.7 3.2 3.7 4.2 4.7 5.2 5.7 6.2

R Object Structure

  • str() - one of the most useful functions in R.
  • It shows the structure of an arbitrary R object.
str(l)
List of 4
 $      : chr "linear regression"
 $ model: chr "y ~ x"
 $ coefs: num [1:2] 1.2 0.5
 $      :List of 2
  ..$ x: int [1:10] 1 2 3 4 5 6 7 8 9 10
  ..$ y: num [1:10] 1.7 2.2 2.7 3.2 3.7 4.2 4.7 5.2 5.7 6.2

List Subsetting

  • As with vectors you can use [] to subset lists.
  • This will return a list of length one.
  • Components of the list can be individually extracted using [[ and $ operators.
list[index]
list[[index]]
list$name
l[3]
$coefs
[1] 1.2 0.5
str(l[3])
List of 1
 $ coefs: num [1:2] 1.2 0.5
l[[3]]
[1] 1.2 0.5
# Only works with named elements
l$coefs
[1] 1.2 0.5

Attributes

Attributes

  • All R objects can have attributes associated with them.
  • Attributes contain metadata that can be attached to any R object.
  • Technically, attributes can be thought of as named lists.
  • Names, dimensions and class are common examples of attributes.
  • They (and some other) have special functions for getting and setting them.

Attributes: Examples

v <- c(1, 2, 3, 4)
# Setting attributes as a named list
attributes(v) <- list(some_info = "This is a vector")
attributes(v)
$some_info
[1] "This is a vector"
# Setting individual attributes
attr(v, "more_info") <- "This vector contains numbers"
attr(v, "more_info")
[1] "This vector contains numbers"
attributes(v)
$some_info
[1] "This is a vector"

$more_info
[1] "This vector contains numbers"
# To set names for vector elements we can use names() function
names(v) <- c("a", "b", "c", "d")
v
a b c d 
1 2 3 4 
attr(,"some_info")
[1] "This is a vector"
attr(,"more_info")
[1] "This vector contains numbers"

Factors

  • Factors form the basis of categorical data analysis in R
  • Values of nominal variables represent categories rather than numeric data
  • Examples are abundant in social sciences (gender, party, region, etc.)
  • Internally, in R factor variables are represented by integer vectors
  • With 2 additional attributes:
    • class() attribute which is set to factor
    • levels() attribute which defines allowed values

Factors: Example

gender <- c("male", "female", "female", "non-binary", "male")
gender
[1] "male"       "female"     "female"     "non-binary" "male"      
typeof(gender)
[1] "character"
# We use factor() function to convert character vector into factor
# Only unique elements of character vector are considered as a level
gender <- factor(gender)
gender
[1] male       female     female     non-binary male      
Levels: female male non-binary
class(gender)
[1] "factor"
# Note that the data type of this vector is integer (and not character)
typeof(gender)
[1] "integer"

Factors: Example Continued

# Note that R automatically sorted the categories alphabetically
levels(gender)
[1] "female"     "male"       "non-binary"
# You can change the reference category using relevel() function
gender <- relevel(gender, ref = "male")
levels(gender)
[1] "male"       "female"     "non-binary"
# Or define an arbitrary ordering of levels
# using 'levels' argument in factor() function
gender <- factor(gender, levels = c("non-binary", "male", "female"))
levels(gender)
[1] "non-binary" "male"       "female"    
# Under the hood factors continue to be integer vectors
as.integer(gender)
[1] 2 3 3 1 2

Tabulation

  • table() function is very useful for describing discrete data.
  • It can be used for:
    • tabulating a single variable
    • creating contingency tables (crosstabs).
  • Implicitly, R treats tabulated variables as factors.
var_1 <- sample(c("female", "male", "non-binary"), size = 50, replace = TRUE)
var_2 <- sample(c(-1, 0, 1), size = 50, replace = TRUE)
table(var_1, var_2)
            var_2
var_1        -1  0  1
  female      6 11  3
  male        5  5  5
  non-binary  4  7  4

Factors in Crosstabs

var_2 <- factor(
  var_2, 
  levels = c(0, -1, 1)
)
table(var_2)
var_2
 0 -1  1 
23 15 12 
var_2 <- factor(
  var_2, 
  levels = c(0, -1, 1), 
  labels = c("centre", "left", "right")
)
table(var_1, var_2)
            var_2
var_1        centre left right
  female         11    6     3
  male            5    5     5
  non-binary      7    4     4

Arrays

Arrays and Matrices

  • Arrays are vectors with an added class and dimensionality attribute
  • These attributes can be accessed using class() and dim() functions
  • Arrays can have an arbitrary number of dimensions
  • Matrices are special cases of arrays that have just two dimensions
  • Arrays and matrices can be created using array() and matrix() functions
  • Or by adding dimension attribute with dim() function

Array: Example

# : operator can be used generate vectors of sequential numbers
a <- 1:12
a
 [1]  1  2  3  4  5  6  7  8  9 10 11 12
class(a)
[1] "integer"
dim(a) <- c(3, 2, 2)
a
, , 1

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

, , 2

     [,1] [,2]
[1,]    7   10
[2,]    8   11
[3,]    9   12
class(a)
[1] "array"

Matrix: Example

m <- 1:12
dim(m) <- c(3, 4)
m
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
# Alternatively, we could use matrix() function
m <- matrix(1:12, nrow = 3, ncol = 4)
m
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
# Note that length() function displays the length of underlying vector
length(m)
[1] 12

Array and Matrix Subsetting

  • Subsetting higher-dimensional (> 1) structures is a generalisation of vector subsetting
  • But, since they are built upon vectors there is a nuance (albeit uncommon)
  • They are usually subset in 2 ways:
    • with multiple vectors, where each vector is a sequence of elements in that dimension
    • with 1 vector, in which case subsetting happens from the underlying vector
array[vector_1, vector_2, ..., vector_n]
array[vector]

Array Subsetting: Example

a
, , 1

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

, , 2

     [,1] [,2]
[1,]    7   10
[2,]    8   11
[3,]    9   12
# Most common way
a[1, 2, 2]
[1] 10
# Specifying drop = FALSE after indices retains the original dimensionality of matrix/array
a[1, 2, 2, drop = FALSE]
, , 1

     [,1]
[1,]   10
# Here elements are subset from underlying vector (with repetition)
a[c(1, 2, 2)]
[1] 1 2 2

Matrix Subsetting: Example

m
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
# As with arrays drop = FALSE prevents from this object being collapsed into 1-dimensional vector
m[, 1, drop = FALSE]
     [,1]
[1,]    1
[2,]    2
[3,]    3
# Subset all rows, first two columns
m[1:nrow(m), 1:2]
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
# Note that vector recycling also applies here
m[c(TRUE, FALSE), -3]
     [,1] [,2] [,3]
[1,]    1    4   10
[2,]    3    6   12

R packages

  • R’s flexibility comes from its rich package ecosystem
  • Comprehensive R Archive Network (CRAN) is the official repository of R packages
  • At the moment it contains ~20K external packages
  • Use install.packages(<package_name>) function to install packages that were released on CRAN
  • Check devtools package if you need to install a package from other sources (e.g. GitHub, Bitbucket, etc.)
  • Type library(<package_name>) to load installed packages

Help!

R has an inbuilt help facility which provides more information about any function:

?length
help(dim)
  • The quality of documentation varies a lot across packages.
  • Stackoverflow is a good resource for many standard tasks.
  • For custom packages it is often helpful to check the issues page on the GitHub.
  • E.g. for ggplot2: https://github.com/tidyverse/ggplot2/issues
  • Or, indeed, any search engine #LMDDGTFY

Next

  • Tutorial: R objects, attributes and subsetting
  • Next week: Control Flow in R