Week 2: Data Structures

POP88162 Introduction to Quantitative Research Methods

Tom Paskhalis

Department of Political Science, Trinity College Dublin

So Far

R is a free and open-source programming language with a focus on data analysis.
Code can be distributed as R scripts and R Markdown.
Everything in R is an object, the references to which are established through assignment.

Topics for Today

Data structures
Data types
Missing values
Vector subsetting

Review: Assignment Operations

<- is the standard assignment operator in R
While = is also supported it is not recommended

x <- 3
x

[1] 3

democracy_2020 <- read.csv("../data/democracy_2020.csv")

Extra

R Documentation on assignment

Data Structures

Base R data structures can be classified along their:

dimensionality
homogeneity

5 main built-in data structures in R:

Atomic vector (vector)
Matrix (matrix)
Array (array)
List (list)
Data frame (data.frame)

Summary of Data Structures in R

Structure	Description	Dimensionality	Data Type
`vector`	Atomic vector (scalar)	1d	homogenous
`matrix`	Matrix	2d	homogenous
`array`	One-, two or n-dimensional array	1d/2d/nd	homogenous
`list`	List	1d	heterogeneous
`data.frame`	Rectangular data	2d	heterogeneous

Atomic Vectors

Vector is the core building block of R.
Vectors can be created with c() function (short for combine)
Individual variables are stored as vectors.

v <- c(8, 10, 12)
v

[1]  8 10 12

v <- c(v, 14) # Vectors are always flattened (even when nested)
v

[1]  8 10 12 14

Data Types

4 common data types that are contained in R structures:

Character (character)
Integer (integer)
Double/numeric (double/numeric)
Logical/boolean (logical)

Character Vector

char_vec <- c("apple", "banana", "watermelon")
char_vec

[1] "apple"      "banana"     "watermelon"

length(char_vec) # length() function gives the length of an R object

[1] 3

is.character(char_vec)

[1] TRUE

Numeric Vector

num_vec <- c(300, 200, 4)
num_vec

[1] 300 200   4

typeof(num_vec) # typeof() function returns the type of an R object

[1] "double"

is.numeric(num_vec)

[1] TRUE

Logical Vector

log_vec <- c(FALSE, FALSE, TRUE)
log_vec

[1] FALSE FALSE  TRUE

# While more concise, using T/F instead of TRUE/FALSE can be confusing
log_vec2 <- c(F, F, T)
log_vec2

[1] FALSE FALSE  TRUE

typeof(log_vec)

[1] "logical"

is.logical(log_vec)

[1] TRUE

# Compare two vectors element-by-element
log_vec == log_vec2

[1] TRUE TRUE TRUE

Type Coercion in Vectors

All elements of a vector must be of the same type
If you try to combine vectors of different types, their elements will be coerced to the most flexible type

# Note that logical vector get coerced to 0/1 for FALSE/TRUE
c(num_vec, log_vec)

[1] 300 200   4   0   0   1

c(char_vec, num_vec)

[1] "apple"      "banana"     "watermelon" "300"        "200"       
[6] "4"

# If no natural way of type conversion exists, NAs are introduced
as.numeric(char_vec)

[1] NA NA NA

Implicit Type Coercion

X (Twitter)

Missing Values in R

Missing values have a special type in R.
R makes a distinction between:
- NA - value exists, but is unknown (e.g. survey non-response)
- NULL - object does not exist

Extra

R Documentation on NA

NA and NULL Example

na <- c(NA, NA, NA)
na

[1] NA NA NA

length(na)

[1] 3

null <- c(NULL, NULL, NULL)
null

NULL

length(null)

[1] 0

Working with NAs

# Presence of NAs can lead to unexpected results
v_na <- c(1, 2, 3, NA, 5)
mean(v_na)

[1] NA

# NAs should be treated specially
mean(v_na, na.rm = TRUE)

[1] 2.75

# Remember NAs are missing values
# Thus result of comparing them is unknown
NA == NA

[1] NA

# is.na() is a special function that checks whether value is missing (NA)
is.na(v_na)

[1] FALSE FALSE FALSE  TRUE FALSE

# We can use such logical vectors for subsetting (more below)
v_na[!is.na(v_na)]

[1] 1 2 3 5

Vector Indexing and Subsetting

To subset a vector, use [] to index the elements you would like to select:

vector[index]

num_vec[1]

[1] 300

num_vec[c(1,3)]

[1] 300   4

Summary of Vector Subsetting

Value	Example	Description
Positive integers	`v[c(3, 1)]`	Returns elements at specified positions
Negative integers	`v[-c(3, 1)]`	Omits elements at specified positions
Logical vectors	`v[c(FALSE, TRUE)]`	Returns elements where corresponding logical value is `TRUE`
Character vector	`v[c(“c”, “a”)]`	Returns elements with matching names (only for named vectors)
Nothing	`v[]`	Returns the original vector
0 (Zero)	`v[0]`	Returns a zero-length vector

Generating Sequences for Subsetting

You can use : operator to generate vectors of indices for subsetting.
seq() function provides a generalization of : for generating arithmetic progressions.

2:4

[1] 2 3 4

seq(from = 1, to = 4, by = 2)

[1] 1 3

Vector Subsetting Examples

# v <- c(8, 10, 12, 14)
v[2:4]

[1] 10 12 14

# Argument names in seq() can be omitted
v[seq(1,4,2)]

[1]  8 12

# All but the last element
v[-length(v)]

[1]  8 10 12

# Reverse order
v[seq(length(v),1,-1)]

[1] 14 12 10  8

Vector Recycling

For operations that require vectors to be of the same length R recycles (reuses) the shorter one:

c(0, 1) + c(1, 2, 3, 4)

[1] 1 3 3 5

5 * c(1, 2, 3, 4)

[1]  5 10 15 20

c(1, 2, 3, 4)[c(TRUE, FALSE)]

[1] 1 3

Lists

As opposed to vectors, lists can contain elements of any type.
List can also have nested lists within it.
Lists are constructed using list() function in R.
Data frames are just lists with some special characteristics!

# We can combine different data types in a list and, optionally, name elements (e.g. B below)
l <- list(2:4, list("a"), B = c(TRUE, FALSE, FALSE))
l

[[1]]
[1] 2 3 4

[[2]]
[[2]][[1]]
[1] "a"


$B
[1]  TRUE FALSE FALSE

R object structure

str() - one of the most useful functions in R.
It shows the structure of an arbitrary R object.

str(l)

List of 3
 $  : int [1:3] 2 3 4
 $  :List of 1
  ..$ : chr "a"
 $ B: logi [1:3] TRUE FALSE FALSE

List subsetting

As with vectors you can use [] to subset lists.
This will return a list of length one.
Components of the list can be individually extracted using [[ and $ operators.

list[index]
list[[index]]
list$name

List subsetting examples

[[1]]
[1] 2 3 4

[[2]]
[[2]][[1]]
[1] "a"


$B
[1]  TRUE FALSE FALSE

l[3]

$B
[1]  TRUE FALSE FALSE

str(l[3])

List of 1
 $ B: logi [1:3] TRUE FALSE FALSE

l[[3]]

[1]  TRUE FALSE FALSE

# Only works with named elements
l$B

[1]  TRUE FALSE FALSE

Tutorial:
- Working with variables
1 R Assignment due:
- 08:59 Tuesday, 4 February
Next week:
- Probability distributions

Week 2: Data Structures

So Far

Topics for Today

Review: Assignment Operations

Data Structures

Summary of Data Structures in R

Atomic Vectors

Data Types

Character Vector

Numeric Vector

Logical Vector

Type Coercion in Vectors

Implicit Type Coercion

Missing Values in R

NA and NULL Example

Working with NAs

Vector Indexing and Subsetting

Summary of Vector Subsetting

Generating Sequences for Subsetting

Vector Subsetting Examples

Vector Recycling

Lists

R object structure

List subsetting

List subsetting examples

Next