Week 4: Data Frames

POP88162 Introduction to Quantitative Research Methods

Tom Paskhalis

Department of Political Science, Trinity College Dublin

So Far

Vector is the core data structure in R.
Vectors can be one of the main data types (character, numeric, logical).
Probabilities for critical values can be calculated using R.

Topics for Today

Lists
Data frames
Working with data frames

Review: Data Structures

Structure	Description	Dimensionality	Data Type
`vector`	Atomic vector (scalar)	1d	homogenous
`matrix`	Matrix	2d	homogenous
`array`	One-, two or n-dimensional array	1d/2d/nd	homogenous
`list`	List	1d	heterogeneous
`data.frame`	Rectangular data	2d	heterogeneous

Lists

As opposed to vectors, lists can contain elements of any type.
List can also have nested lists within it.
Lists are constructed using list() function in R.
Data frames are just lists with some special characteristics!

# We can combine different data types in a list and, optionally, name elements (e.g. B below)
l <- list(2:4, list("a"), B = c(TRUE, FALSE, FALSE))
l

[[1]]
[1] 2 3 4

[[2]]
[[2]][[1]]
[1] "a"


$B
[1]  TRUE FALSE FALSE

List Subsetting

As with vectors you can use [] to subset lists
This will return a list of length one
Components of the list can be individually extracted using [[ and $ operators

list[index]
list[[index]]
list$name

Data Frames

Data frame is the workhorse of data analysis in R.
Despite their matrix-like appearance, data frames are lists of equal-sized vectors.
Data frames can be created with data.frame() function with named vectors as input.

df <- data.frame(
    x = 1:4,
    y = c("a", "b", "c", "d"),
    z = c(TRUE, FALSE, FALSE, TRUE)
)
df

  x y     z
1 1 a  TRUE
2 2 b FALSE
3 3 c FALSE
4 4 d  TRUE

Data Frames: Example

# str() function applied to data frame is useful in determining variable types
str(df)

'data.frame':   4 obs. of  3 variables:
 $ x: int  1 2 3 4
 $ y: chr  "a" "b" "c" "d"
 $ z: logi  TRUE FALSE FALSE TRUE

# dim() function behaves similar to matrix, showing N rows and N columns, respectively
dim(df)

[1] 4 3

# If we want to explicitly extract number of rows
nrow(df)

[1] 4

# Which is equivalent to
dim(df)[1]

[1] 4

# If we want to explicitly extract number of columns
ncol(df)

[1] 3

# In contrast to matrix length() of data frame displays the length of underlying list
# Which is just a number of columns
length(df)

[1] 3

Creating Data Frame: Example

l <- list(x = 1:5, y = letters[1:5], z = rep(c(TRUE, FALSE), length.out = 5))
l

$x
[1] 1 2 3 4 5

$y
[1] "a" "b" "c" "d" "e"

$z
[1]  TRUE FALSE  TRUE FALSE  TRUE

df <- data.frame(l)
df

  x y     z
1 1 a  TRUE
2 2 b FALSE
3 3 c  TRUE
4 4 d FALSE
5 5 e  TRUE

str(df)

'data.frame':   5 obs. of  3 variables:
 $ x: int  1 2 3 4 5
 $ y: chr  "a" "b" "c" "d" ...
 $ z: logi  TRUE FALSE TRUE FALSE TRUE

Subsetting Data Frame

In subsetting data frames the techniques of subsetting matrices and lists are combined:
- If you subset with a single vector, it behaves as a list
```
data_frame[column_indices]
```
```
data_frame[column_name(s)]
```
```
data_frame$column_name
```
- If you subset with two vectors, it behaves as a matrix.
```
data_frame[row_indices, column_indices]
```
```
data_frame[row_indices, column_name(s)]
```

Subsetting Columns: Example

# Like a list
df[c("x", "z")]

  x     z
1 1  TRUE
2 2 FALSE
3 3  TRUE
4 4 FALSE
5 5  TRUE

# Like a matrix
df[,c("x", "z")]

  x     z
1 1  TRUE
2 2 FALSE
3 3  TRUE
4 4 FALSE
5 5  TRUE

Subsetting Rows: Example

# Subsetting on an existing binary variable coded as logical
df[df$z == TRUE,]

  x y    z
1 1 a TRUE
3 3 c TRUE
5 5 e TRUE

# Subsetting on a dynamically created logical vector
df[df$y == "b",]

  x y     z
2 2 b FALSE

# Note that the internal expression evaluates to logical vector
df$y == "b"

[1] FALSE  TRUE FALSE FALSE FALSE

Manipulating Columns

# New columns can also be created/modified by assignment
# (if the right-hand side object has correct length)
df["r"] <- rnorm(5, mean = 3, sd = 2)
df

  x y     z         r
1 1 a  TRUE  6.429766
2 2 b FALSE -1.264687
3 3 c  TRUE  2.687503
4 4 d FALSE  2.000350
5 5 e  TRUE  5.147361

# Individual columns can also be selected with $ operator
df$r_standardised <- (df$r - mean(df$r)) / sd(df$r)
df

  x y     z         r r_standardised
1 1 a  TRUE  6.429766      1.1486898
2 2 b FALSE -1.264687     -1.4283635
3 3 c  TRUE  2.687503     -0.1046822
4 4 d FALSE  2.000350     -0.3348260
5 5 e  TRUE  5.147361      0.7191820

Column Names

# colnames() or names() attribute for data frames contains column names
colnames(df)

[1] "x"              "y"              "z"              "r"             
[5] "r_standardised"

colnames(df)[4] <- "rand"
colnames(df)[5] <- "rand_standardised"
df

  x y     z      rand rand_standardised
1 1 a  TRUE  6.429766         1.1486898
2 2 b FALSE -1.264687        -1.4283635
3 3 c  TRUE  2.687503        -0.1046822
4 4 d FALSE  2.000350        -0.3348260
5 5 e  TRUE  5.147361         0.7191820

Merging Data Frames

Oftentimes, analysis involves more than one dataset.
This requires merging (joining) data frames together.
R function merge() can be used for this purpose.
Assuming the two columns share the same columns name:
```
merge(x, y, by = "column_name")
```
If the column names differ, the by.x and by.y arguments can be used:
```
merge(x, y, by.x = "column_name_x", by.y = "column_name_y")
```
Note that in either case the datasets must share some unique identifier.

Merging Data Frames: Example 1

Data Frame 1

Data Frame 2

candidates1 <- data.frame(
  id = 1:4,
  name = c(
    "Sean M", "Orla G", "Rian P", "Fiona F"
  ),
  party = c(
    "Left", "Right", "Left", "Right"
  )
)
candidates1

  id    name party
1  1  Sean M  Left
2  2  Orla G Right
3  3  Rian P  Left
4  4 Fiona F Right

candidates2 <- data.frame(
  id = 1:4,
  yob = c(
    1971, 1985, 1977, 1962
  ),
  experience = c(
    TRUE, FALSE, FALSE, TRUE
  )
)
candidates2

  id  yob experience
1  1 1971       TRUE
2  2 1985      FALSE
3  3 1977      FALSE
4  4 1962       TRUE

Merging Data Frames: Example 1

candidates_merged <- merge(candidates1, candidates2, by = "id")

candidates_merged

  id    name party  yob experience
1  1  Sean M  Left 1971       TRUE
2  2  Orla G Right 1985      FALSE
3  3  Rian P  Left 1977      FALSE
4  4 Fiona F Right 1962       TRUE

The order of arguments in merge() when the column name is the same is not important (but affects the order of columns in the merged data frame).

candidates_merged <- merge(candidates2, candidates1, by = "id")

candidates_merged

  id  yob experience    name party
1  1 1971       TRUE  Sean M  Left
2  2 1985      FALSE  Orla G Right
3  3 1977      FALSE  Rian P  Left
4  4 1962       TRUE Fiona F Right

Merging Data Frames: Example 2

Data Frame 1

Data Frame 2

candidates1 <- data.frame(
  candidate_id = 1:4,
  name = c(
    "Sean M", "Orla G", "Rian P", "Fiona F"
  ),
  party = c(
    "Left", "Right", "Left", "Right"
  )
)
candidates1

  candidate_id    name party
1            1  Sean M  Left
2            2  Orla G Right
3            3  Rian P  Left
4            4 Fiona F Right

candidates2 <- data.frame(
  id = 1:4,
  yob = c(
    1971, 1985, 1977, 1962
  ),
  experience = c(
    TRUE, FALSE, FALSE, TRUE
  )
)
candidates2

  id  yob experience
1  1 1971       TRUE
2  2 1985      FALSE
3  3 1977      FALSE
4  4 1962       TRUE

Merging Data Frames: Example 2

candidates_merged <- merge(
  candidates1,
  candidates2,
  by.x = "candidate_id",
  by.y = "id"
)

candidates_merged

  candidate_id    name party  yob experience
1            1  Sean M  Left 1971       TRUE
2            2  Orla G Right 1985      FALSE
3            3  Rian P  Left 1977      FALSE
4            4 Fiona F Right 1962       TRUE

But it is important to keep track which dataset is x and which is y:

candidates_merged <- merge(
  candidates2,
  candidates1,
  by.x = "id",
  by.y = "candidate_id"
)

candidates_merged

  id  yob experience    name party
1  1 1971       TRUE  Sean M  Left
2  2 1985      FALSE  Orla G Right
3  3 1977      FALSE  Rian P  Left
4  4 1962       TRUE Fiona F Right

Tutorial:
- Data frames & Plotting
Next week:
- Factor variables

Week 4: Data Frames

So Far

Topics for Today

Review: Data Structures

Lists

List Subsetting

Data Frames

Data Frames: Example

Creating Data Frame: Example

Subsetting Data Frame

Subsetting Columns: Example

Subsetting Rows: Example

Manipulating Columns

Column Names

Merging Data Frames

Merging Data Frames: Example 1

Merging Data Frames: Example 1

Merging Data Frames: Example 2

Merging Data Frames: Example 2

Next