15/03/2019

You can find the

- Assignment: tom.paskhal.is/MY591/exercises.html
- Slides: tom.paskhal.is/MY591/intro_to_r.html
- Source Code: github.com/tpaskhalis/MY591_Introduction_to_R

It is worth noting that both the assignment and the slides were all created using R and RStudio.

Because:

- R is free
- R is open source and has excellent package ecosystem
- R is very versatile
- R is
*very*good for data visualisation - R can accomodate more than one data frame in memory at the same time

On the other hand:

- R has a steep(er) learning curve

More discussion on this can be found here.

- As in STATA, you can either use a script (do file) to save commands and then run them, or write commands directly into the console
- It is perfectly possible to use the base R environment for everything, but most people do not do this
- Today we will mostly be using RStudio, which is a graphical user interface (GUI) for R
- RStudio has a host of useful features which make it an excellent environment for learning R
- Intuitive interface
- Syntax highlighting
- Easy data overview
- Project management

1 + 1

## [1] 2

5 - 3

## [1] 2

6 / 2

## [1] 3

4 * 4

## [1] 16

- Other typical mathematical functions are also hard-coded:

sum(<numbers>) # Sum mean(<numbers>) # Mean median(<numbers>) # Median sd(<numbers>) # Standard Deviation log(<number>) # Logarithm exp(<number>) # Exponent sqrt(<number>) # Square root

R also understands logical operators:

`<`

: less than`>`

: greater than`==`

: equal to (note, not`=`

)`>=`

: greater than or equal to`<=`

: less than or equal to`!=`

: not equal to`&`

: and`|`

: or

`objects`

- The entities that R creates and manipulates are known as objects
- These may be vectors, matrices, character strings, functions
- Objects are created using the
*assignment operator*:`<-`

- Once created, an object is stored in your current
*global environment*

x1 <- 10 print(x1)

## [1] 10

x2 <- 4 x3 <- x1 - x2 print(x3)

## [1] 6

`vector`

- the basic building block of R.A vector is a collection of elements which all share the same type.

We use the `c()`

function to concatenate observations into a single vector.

- Numeric

num_vec <- c(150, 178, 67.7, 905, 12) print(num_vec)

## [1] 150.0 178.0 67.7 905.0 12.0

num_vec

## [1] 150.0 178.0 67.7 905.0 12.0

`vector`

- the basic building block of R.There are no scalars (atomic units) in R

A scalar is just a vector of length 1

- Character

char_vec <- c("apple", "pear", "plum", "pineapple", "strawberry") length(char_vec)

## [1] 5

char_scalar <- "banana" length(char_scalar)

## [1] 1

`vector`

- the basic building block of R.When creating new objects, keep in mind reserved words and try to avoid using them in names

- Logical (Boolean)

log_vec <- c(FALSE, FALSE, FALSE, TRUE, TRUE) log_vec

## [1] FALSE FALSE FALSE TRUE TRUE

log_vec2 <- c(F, F, F, T, T) log_vec2

## [1] FALSE FALSE FALSE TRUE TRUE

`vector`

- the basic building block of R.- Factor

A factor is similar to a character vector, but here each unique element is associated with a numerical value which represents a category:

fac_vec <- as.factor(c("a", "b", "c", "b", "b", "c")) fac_vec

## [1] a b c b b c ## Levels: a b c

as.numeric(fac_vec)

## [1] 1 2 3 2 2 3

`vector`

- the basic building block of R.- Date

A date is similar to a numeric vector, as it stores the number of days, but it can assume a variety of forms:

date_vec <- as.Date(c("2001/07/06", "2005/05/05", "2010/06/05", "2015/07/05", "2017/08/06")) date_vec

## [1] "2001-07-06" "2005-05-05" "2010-06-05" "2015-07-05" "2017-08-06"

as.numeric(date_vec)

## [1] 11509 12908 14765 16621 17384

format(date_vec, "%d/%m/%Y")

## [1] "06/07/2001" "05/05/2005" "05/06/2010" "05/07/2015" "06/08/2017"

`vector`

subsettingTo subset a `vector`

, use square parenthesis to index the elements you would like via `object[index]`

:

- Numerical subsetting

char_vec

## [1] "apple" "pear" "plum" "pineapple" "strawberry"

char_vec[1:3]

## [1] "apple" "pear" "plum"

`vector`

subsettingInstead of passing indices, it is also possible to use a logical vector to specify the required elements:

- Logical subsetting

log_vec

## [1] FALSE FALSE FALSE TRUE TRUE

char_vec[log_vec]

## [1] "pineapple" "strawberry"

`vector`

operationsIn R, mathematical operations on vectors occur elementwise:

fib <- c(1, 1, 2, 3, 5, 8, 13, 21) fib[1:7]

## [1] 1 1 2 3 5 8 13

fib[2:8]

## [1] 1 2 3 5 8 13 21

fib[1:7] + fib[2:8]

## [1] 2 3 5 8 13 21 34

`vector`

operationsIt is also possible to perform logical operations on vectors:

fib <- c(1, 1, 2, 3, 5, 8, 13, 21) fib_gt_5 <- fib > 5 fib_gt_5

## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE

We can also combine logical operators

fib_gt_5_or_ls_2 <- fib > 5 | fib < 2 fib_gt_5_or_ls_2

## [1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE

`matrix`

A matrix is just a collection of vectors! I.e. it is a 2-dimensional vector that has 2 additional attributes: the number of rows and columns

As for vectors all elements of a matrix must be of the same type

mat <- matrix(data = 1:25, nrow = 5, ncol = 5) mat

## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 6 11 16 21 ## [2,] 2 7 12 17 22 ## [3,] 3 8 13 18 23 ## [4,] 4 9 14 19 24 ## [5,] 5 10 15 20 25

`list`

A `list`

is a collection of R objects of different types

l <- list(element1 = num_vec, element2 = mat[1:2,1:2], element3 = "Farmers' Market") l

## $element1 ## [1] 150.0 178.0 67.7 905.0 12.0 ## ## $element2 ## [,1] [,2] ## [1,] 1 6 ## [2,] 2 7 ## ## $element3 ## [1] "Farmers' Market"

`list`

subsettingIndividual elements of a `list`

object can be subset using `$`

or `[[`

operators:

l$element2

## [,1] [,2] ## [1,] 1 6 ## [2,] 2 7

l[[2]]

## [,1] [,2] ## [1,] 1 6 ## [2,] 2 7

`data.frame`

- the workhorse of data analysis.A `data.frame`

is an R object in which the columns can be of different types

Despite their matrix-like appearance, data frames in R are lists, where all elements are of equal length!

df <- data.frame(weight = num_vec, fruit = char_vec, berry = log_vec) df

## weight fruit berry ## 1 150.0 apple FALSE ## 2 178.0 pear FALSE ## 3 67.7 plum FALSE ## 4 905.0 pineapple TRUE ## 5 12.0 strawberry TRUE

`matrix`

and `data.frame`

subsettingTo subset a `matrix`

or `data.frame`

, you need to specify both rows and columns:

mat[1:3, 1:3]

## [,1] [,2] [,3] ## [1,] 1 6 11 ## [2,] 2 7 12 ## [3,] 3 8 13

df[1, ]

## weight fruit berry ## 1 150 apple FALSE

`matrix`

and `data.frame`

subsettingWe can also subset to remove rows or columns by using the `-`

operator applied to the `c`

function:

mat[-c(1:3), -c(1:3)]

## [,1] [,2] ## [1,] 19 24 ## [2,] 20 25

In the code, `1:3`

creates a vector of the integers 1, 2 and 3, and the `-`

operator negates these. We wrap the vector in the `c`

function so that `-`

applies to each element, and not just the first element.

`functions`

- the backbone of R operations.All operations on objects in R are done with functions.

fun(arg1, arg2, ...)

Where

`fun`

is the name of the function`arg1`

is the first argument passed to the function`arg2`

is the second argument passed to the function

`function`

exampleLet's consider the `mean()`

function. This function takes two main arguments:

mean(x, na.rm = FALSE)

Where `x`

is a numeric vector, and `na.rm`

is a logical value that indicates whether we'd like to remove missing values (`NA`

).

vec <- c(1, 2, 3, 4, 5) mean(x = vec, na.rm = FALSE)

## [1] 3

vec <- c(1, 2, 3, NA, 5) mean(x = vec, na.rm = TRUE)

## [1] 2.75

`function`

exampleWe can also perform calculations on the output of a function:

mean(num_vec) * 3

## [1] 787.62

Which means that we can also have nested functions:

log(mean(num_vec))

## [1] 5.570403

We can also assign the output of any function to a new object for use later:

log_fruit <- log(mean(num_vec)) # The logarithm of average fruit weight

`functions`

in disguise!All the basic operations on R `objects`

that we have encountered so far are, in fact, `functions`

:

`+`(1,1)

## [1] 2

`-`(5,3)

## [1] 2

`[`(char_vec, 1:3)

## [1] "apple" "pear" "plum"

`functions`

Functions are also objects, and we can create our own. We define a function as follows:

sum2 <- function(a, b){ return(a + b) } res <- sum2(a = 5, b = 50) res

## [1] 55

Note that the function itself, and the result returned by the function are both objects!

R has an inbuilt help facility which provides more information about any function:

?summary help(summary)

The quality of the help varies *hugely* across packages.

Stackoverflow is a good resource for many standard tasks.

For custom packages it is often helpful to check the **issues** page on the GitHub.

E.g. for `ggplot2`

: https://github.com/tidyverse/ggplot2/issues

Or, indeed, any search engine #LMDDGTFY.

sim.df <- read.csv(file = "file.csv")

`df`

is an R data.frame object (you could call this anything)`file.csv`

is a .csv file with your data`<-`

is the assignment operator- In order for R to access
`file.csv`

, it will have to be saved in your current working directory - Before recently, the standard advice was to use
`getwd()`

and`setwd()`

. **Try to avoid it!**Instead, create new RStudio project and use relative paths:

sim.df <- read.csv(file = "./data/file.csv")

- More details can be found here

install.packages("haven") # Install the "haven" package library("haven") ## Loads an additional package that deals with STATA files sim.df <- haven::read_dta(file = "./data/file.dta") sim.df <- haven::read_sav(file = "./data/file.sav")

- The
`haven`

package is not automatically installed in R so we need to do that ourselves `df`

is still a R data.frame objects`file.dta`

is a .dta (used by STATA) file with your data`file.sav`

is a .sav (used by SPSS) file with your data`<-`

is the usual assignment operator

`functions`

R includes a number of helpful functions for exploring your data:

View(sim.df) # View data in spreadsheet-style names(sim.df) # Names of the variables in your data head(sim.df) # First n (six, by default) rows of the data tail(sim.df) # Last n (six, by default) rows of the data str(sim.df) # "Structure" of any R object summary(sim.df) # Summary statistics for each of the columns in the data dim(sim.df) # Dimensions of data mean(sim.df$var) # Mean of a numeric vector sd(sim.df$var) # Standard deviation of a numeric vector range(sim.df$var) # Range of a numeric vector quantile(sim.df$var) # Quantiles of a numeric vector

`head`

By varying argument `n`

, we can adjust the number of top rows we want to inspect:

head(sim.df, n = 5)

## x y z g ## 1 -0.01725208 -1.0153648 0.2339631 b ## 2 0.24816398 -0.1925178 0.4941757 e ## 3 -2.42759848 0.6091226 0.6428747 e ## 4 -0.23537544 2.5924181 0.2784789 a ## 5 -1.21939703 0.8898623 0.1184253 e

Can also be used in combination with `View()`

:

View(head(sim.df, n = 50))

`str`

`str()`

applied to data frames is particularly useful in inferring types of variables in the data

str(sim.df)

## 'data.frame': 1000 obs. of 4 variables: ## $ x: num -0.0173 0.2482 -2.4276 -0.2354 -1.2194 ... ## $ y: num -1.015 -0.193 0.609 2.592 0.89 ... ## $ z: num 0.234 0.494 0.643 0.278 0.118 ... ## $ g: Factor w/ 6 levels "a","b","c","d",..: 2 5 5 1 5 3 3 5 5 1 ...

`summary`

summary(sim.df)

## x y z g ## Min. :-3.79564 Min. :-2.6862 Min. :0.001063 a:157 ## 1st Qu.:-0.60080 1st Qu.:-0.2939 1st Qu.:0.218175 b:163 ## Median : 0.02066 Median : 0.4897 Median :0.475239 c:176 ## Mean : 0.04743 Mean : 0.4851 Mean :0.482757 d:159 ## 3rd Qu.: 0.69988 3rd Qu.: 1.2484 3rd Qu.:0.751304 e:190 ## Max. : 3.61659 Max. : 4.5249 Max. :0.999785 f:155

`pairs`

`pairs()`

produces a matrix of bivariate scatterplots

pairs(sim.df)

`range`

Suppose you would like to find the range of a variable in your `data.frame`

. The following will not work:

range(sim.df)

## Error in FUN(X[[i]], ...): only defined on a data frame with all numeric variables

Right, we cannot take the range of the entire data! Some of the variables are not numeric, and therefore do not have a range…

`range`

Instead, we need to access one variable. There are two ways to do this.

- We can subset using square parenthesis, or
- We can use the
`[[`

operator and access the variable of interest by position, or - We can use the
`$`

operator and access the variable of interest by name

Note that the last two treat `data.frame`

as a list with variables as elements

range(sim.df[,1]) range(sim.df[[1]])

range(sim.df$x)

## [1] -3.795643 3.616594

- Plots are one of the great strenghts of R
- There are two main frameworks for plotting
- Base R graphics
`ggplot`

- Which should you use? It is a matter of preference!

The basic plotting syntax is very simple. `plot(x_var, y_var)`

will give you a scatter plot:

plot(sim.df$x, sim.df$y)

Hmm, let's work on that.

The plot function takes a number of arguments (`?plot`

for a full list). The fewer you specify, the uglier your plot:

plot(x = sim.df$x, y = sim.df$y, xlab = "X variable", ylab = "Y variable", main = "Awesome plot title", pch = 19, # Solid points cex = 0.5, # Smaller points bty = "n", # Remove surrounding box col = sim.df$g # Colour by grouping variable )

The default behaviour of `plot()`

depends on the type of input variables for the `x`

and `y`

arguments. If `x`

is a factor variable, and `y`

is numeric, then R will produce a boxplot:

plot(x = sim.df$g, y = sim.df$x)

`ggplot`

A very popular alternative to base R plots is the `ggplot2`

library (the *2* in the name refers to the second iteration, which is the standard). This is a separate package (i.e. it is not a part of the base R environment) but is very widely used.

- Based on the Grammar of Graphics data visualisation scheme:

Wilkinson, L. (2005). *The Grammar of Graphics* 2nd Ed. Heidelberg: Springer. https://doi.org/10.1007/0-387-28695-0

Wickham, H. (2010). A Layered Grammar of Graphics. *Journal of Computational and Graphical Statistics*, 19(1), 3â€“28. https://doi.org/10.1198/jcgs.2009.07098

Graphs are broken into multiple layers

Layers can be recycled across multiple plots

`ggplot`

Let's recreate the previous scatter plot using `ggplot`

:

library("ggplot2") ggplot(data = sim.df, aes(x = x, y = y, col = g)) + # Add scatterplot geom_point() + # Change axes labels and plot title labs(x = "X variable", y = "Y variable", title = "Awesome plot title") + # Change default grey theme to black and white theme_bw()

`ggplot`