Introduction to R Programming

15/05/2019

R basics

Materials

You can find the materials for today's course on my website. Either go to the Moodle page for MY591 and click on the links, or:

Assignment: tom.paskhal.is/MY591/exercises.html
Slides: tom.paskhal.is/MY591/intro_to_r.html
Source Code: github.com/tpaskhalis/MY591_Introduction_to_R

It is worth noting that both the assignment and the slides were all created using R and RStudio.

I have used STATA/SPSS for years, why learn R?

Because:

R is free
R is open source and has excellent package ecosystem
R is very versatile
R is very good for data visualisation
R can accomodate more than one data frame in memory at the same time

On the other hand:

R has a steep(er) learning curve

Popularity of data analysis software

More discussion on this can be found here.

R and RStudio

As in STATA, you can either use a script (do file) to save commands and then run them, or write commands directly into the console
It is perfectly possible to use the base R environment for everything, but most people do not do this
Today we will mostly be using RStudio, which is a graphical user interface (GUI) for R
RStudio has a host of useful features which make it an excellent environment for learning R
- Intuitive interface
- Syntax highlighting
- Easy data overview
- Project management

RStudio demonstration

Basic mathematical operations in R

1 + 1

## [1] 2

5 - 3

## [1] 2

6 / 2

## [1] 3

4 * 4

## [1] 16

Basic mathematical operations in R

Other typical mathematical functions are also hard-coded:

sum(<numbers>) # Sum
mean(<numbers>) # Mean
median(<numbers>) # Median
sd(<numbers>) # Standard Deviation
log(<number>) # Logarithm
exp(<number>) # Exponent
sqrt(<number>) # Square root

Basic operations in R

R also understands logical operators:

< : less than
> : greater than
== : equal to (note, not =)
>= : greater than or equal to
<= : less than or equal to
!= : not equal to
& : and
| : or

R `objects`

The entities that R creates and manipulates are known as objects
These may be vectors, matrices, character strings, functions
Objects are created using the assignment operator: <-
Once created, an object is stored in your current global environment

x1 <- 10
print(x1)

## [1] 10

x2 <- 4
x3 <- x1 - x2
print(x3)

## [1] 6

Data Structures

`vector` - the basic building block of R.

A vector is a collection of elements which all share the same type.

We use the c() function to concatenate observations into a single vector.

Numeric

num_vec <- c(150, 178, 67.7, 905, 12)
print(num_vec)

## [1] 150.0 178.0  67.7 905.0  12.0

num_vec

## [1] 150.0 178.0  67.7 905.0  12.0

`vector` - the basic building block of R.

There are no scalars (atomic units) in R

A scalar is just a vector of length 1

Character

char_vec <- c("apple", "pear", "plum", "pineapple", "strawberry")
length(char_vec)

## [1] 5

char_scalar <- "banana"
length(char_scalar)

## [1] 1

`vector` - the basic building block of R.

When creating new objects, keep in mind reserved words and try to avoid using them in names

Logical (Boolean)

log_vec <- c(FALSE, FALSE, FALSE, TRUE, TRUE)
log_vec

## [1] FALSE FALSE FALSE  TRUE  TRUE

log_vec2 <- c(F, F, F, T, T)
log_vec2

## [1] FALSE FALSE FALSE  TRUE  TRUE

`vector` - the basic building block of R.

Factor

A factor is similar to a character vector, but here each unique element is associated with a numerical value which represents a category:

fac_vec <- as.factor(c("a", "b", "c", "b", "b", "c"))
fac_vec

## [1] a b c b b c
## Levels: a b c

as.numeric(fac_vec)

## [1] 1 2 3 2 2 3

`vector` - the basic building block of R.

Date

A date is similar to a numeric vector, as it stores the number of days, but it can assume a variety of forms:

date_vec <- as.Date(c("2001/07/06", "2005/05/05", "2010/06/05", "2015/07/05", "2017/08/06"))
date_vec

## [1] "2001-07-06" "2005-05-05" "2010-06-05" "2015-07-05" "2017-08-06"

as.numeric(date_vec)

## [1] 11509 12908 14765 16621 17384

format(date_vec, "%d/%m/%Y")

## [1] "06/07/2001" "05/05/2005" "05/06/2010" "05/07/2015" "06/08/2017"

`vector` subsetting

To subset a vector, use square parenthesis to index the elements you would like via object[index]:

Numerical subsetting

char_vec

## [1] "apple"      "pear"       "plum"       "pineapple"  "strawberry"

char_vec[1:3]

## [1] "apple" "pear"  "plum"

`vector` subsetting

Instead of passing indices, it is also possible to use a logical vector to specify the required elements:

Logical subsetting

log_vec

## [1] FALSE FALSE FALSE  TRUE  TRUE

char_vec[log_vec]

## [1] "pineapple"  "strawberry"

`vector` operations

In R, mathematical operations on vectors occur elementwise:

fib <- c(1, 1, 2, 3, 5, 8, 13, 21)
fib[1:7]

## [1]  1  1  2  3  5  8 13

fib[2:8]

## [1]  1  2  3  5  8 13 21

fib[1:7] + fib[2:8]

## [1]  2  3  5  8 13 21 34

`vector` operations

It is also possible to perform logical operations on vectors:

fib <- c(1, 1, 2, 3, 5, 8, 13, 21)
fib_gt_5 <- fib > 5
fib_gt_5

## [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

We can also combine logical operators

fib_gt_5_or_ls_2 <- fib > 5 | fib < 2
fib_gt_5_or_ls_2

## [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

`matrix`

A matrix is just a collection of vectors! I.e. it is a 2-dimensional vector that has 2 additional attributes: the number of rows and columns

As for vectors all elements of a matrix must be of the same type

mat <- matrix(data = 1:25, nrow = 5, ncol = 5)
mat

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    6   11   16   21
## [2,]    2    7   12   17   22
## [3,]    3    8   13   18   23
## [4,]    4    9   14   19   24
## [5,]    5   10   15   20   25

`list`

A list is a collection of R objects of different types

l <- list(element1 = num_vec, 
          element2 = mat[1:2,1:2], 
          element3 = "Farmers' Market")
l

## $element1
## [1] 150.0 178.0  67.7 905.0  12.0
## 
## $element2
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## 
## $element3
## [1] "Farmers' Market"

`list` subsetting

Individual elements of a list object can be subset using $ or [[ operators:

l$element2

##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7

l[[2]]

##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7

`data.frame` - the workhorse of data analysis.

A data.frame is an R object in which the columns can be of different types

Despite their matrix-like appearance, data frames in R are lists, where all elements are of equal length!

df <- data.frame(weight = num_vec, fruit = char_vec, berry = log_vec)
df

##   weight      fruit berry
## 1  150.0      apple FALSE
## 2  178.0       pear FALSE
## 3   67.7       plum FALSE
## 4  905.0  pineapple  TRUE
## 5   12.0 strawberry  TRUE

`matrix` and `data.frame` subsetting

To subset a matrix or data.frame, you need to specify both rows and columns:

mat[1:3, 1:3]

##      [,1] [,2] [,3]
## [1,]    1    6   11
## [2,]    2    7   12
## [3,]    3    8   13

df[1, ]

##   weight fruit berry
## 1    150 apple FALSE

`matrix` and `data.frame` subsetting

We can also subset to remove rows or columns by using the - operator applied to the c function:

mat[-c(1:3), -c(1:3)]

##      [,1] [,2]
## [1,]   19   24
## [2,]   20   25

In the code, 1:3 creates a vector of the integers 1, 2 and 3, and the - operator negates these. We wrap the vector in the c function so that - applies to each element, and not just the first element.

`functions` - the backbone of R operations.

All operations on objects in R are done with functions.

fun(arg1, arg2, ...)

Where

fun is the name of the function
arg1 is the first argument passed to the function
arg2 is the second argument passed to the function

`function` example

Let's consider the mean() function. This function takes two main arguments:

mean(x, na.rm = FALSE)

Where x is a numeric vector, and na.rm is a logical value that indicates whether we'd like to remove missing values (NA).

vec <- c(1, 2, 3, 4, 5)
mean(x = vec, na.rm = FALSE)

## [1] 3

vec <- c(1, 2, 3, NA, 5)
mean(x = vec, na.rm = TRUE)

## [1] 2.75

`function` example

We can also perform calculations on the output of a function:

mean(num_vec) * 3

## [1] 787.62

Which means that we can also have nested functions:

log(mean(num_vec))

## [1] 5.570403

We can also assign the output of any function to a new object for use later:

log_fruit <- log(mean(num_vec)) # The logarithm of average fruit weight

Basic operations are `functions` in disguise!

All the basic operations on R objects that we have encountered so far are, in fact, functions:

`+`(1,1)

## [1] 2

`-`(5,3)

## [1] 2

`[`(char_vec, 1:3)

## [1] "apple" "pear"  "plum"

User defined `functions`

Functions are also objects, and we can create our own. We define a function as follows:

sum2 <- function(a, b){
  return(a + b)
}

res <- sum2(a = 5, b = 50)
res

## [1] 55

Note that the function itself, and the result returned by the function are both objects!

Help!

R has an inbuilt help facility which provides more information about any function:

?summary
help(summary)

The quality of the help varies hugely across packages.

Stackoverflow is a good resource for many standard tasks.

For custom packages it is often helpful to check the issues page on the GitHub.

E.g. for ggplot2: https://github.com/tidyverse/ggplot2/issues

Or, indeed, any search engine #LMDDGTFY.

Exercise One

Data manipulation and exploration

Reading data into R (.csv)

sim.df <- read.csv(file = "file.csv")

df is an R data.frame object (you could call this anything)
file.csv is a .csv file with your data
<- is the assignment operator
In order for R to access file.csv, it will have to be saved in your current working directory
Before recently, the standard advice was to use getwd() and setwd().
Try to avoid it! Instead, create new RStudio project and use relative paths:

sim.df <- read.csv(file = "./data/file.csv")

More details can be found here

Reading data into R (STATA .dta, SPSS .sav)

install.packages("haven") # Install the "haven" package
library("haven") ## Loads an additional package that deals with STATA files
sim.df <- haven::read_dta(file = "./data/file.dta") 
sim.df <- haven::read_sav(file = "./data/file.sav")

The haven package is not automatically installed in R so we need to do that ourselves
df is still a R data.frame objects
file.dta is a .dta (used by STATA) file with your data
file.sav is a .sav (used by SPSS) file with your data
<- is the usual assignment operator

Descriptive statistics `functions`

R includes a number of helpful functions for exploring your data:

View(sim.df) # View data in spreadsheet-style
names(sim.df) # Names of the variables in your data
head(sim.df) # First n (six, by default) rows of the data
tail(sim.df) # Last n (six, by default) rows of the data
str(sim.df) # "Structure" of any R object
summary(sim.df) # Summary statistics for each of the columns in the data
dim(sim.df) # Dimensions of data
mean(sim.df$var) # Mean of a numeric vector
sd(sim.df$var) # Standard deviation of a numeric vector
range(sim.df$var) # Range of a numeric vector
quantile(sim.df$var) # Quantiles of a numeric vector

`head`

By varying argument n, we can adjust the number of top rows we want to inspect:

head(sim.df, n = 5)

##            x         y          z g
## 1  1.8068729 1.1338624 0.08688987 d
## 2 -1.2486526 0.5960640 0.27517432 e
## 3 -0.1355884 0.8004329 0.70459357 a
## 4  0.9505454 2.2648623 0.61059223 a
## 5 -1.2711097 0.8887924 0.88313813 c

Can also be used in combination with View():

View(head(sim.df, n = 50))

`str`

str() applied to data frames is particularly useful in inferring types of variables in the data

str(sim.df)

## 'data.frame':    1000 obs. of  4 variables:
##  $ x: num  1.807 -1.249 -0.136 0.951 -1.271 ...
##  $ y: num  1.134 0.596 0.8 2.265 0.889 ...
##  $ z: num  0.0869 0.2752 0.7046 0.6106 0.8831 ...
##  $ g: Factor w/ 6 levels "a","b","c","d",..: 4 5 1 1 3 6 1 3 6 5 ...

`summary`

summary(sim.df)

##        x                    y                 z             g      
##  Min.   :-2.7467901   Min.   :-3.2166   Min.   :0.0007827   a:177  
##  1st Qu.:-0.6505946   1st Qu.:-0.2343   1st Qu.:0.2568557   b:161  
##  Median : 0.0003546   Median : 0.4703   Median :0.4932571   c:173  
##  Mean   : 0.0161324   Mean   : 0.4774   Mean   :0.4954730   d:141  
##  3rd Qu.: 0.6626705   3rd Qu.: 1.1694   3rd Qu.:0.7341048   e:177  
##  Max.   : 3.0778088   Max.   : 3.5594   Max.   :0.9994499   f:171

`pairs`

pairs() produces a matrix of bivariate scatterplots

pairs(sim.df)

`range`

Suppose you would like to find the range of a variable in your data.frame. The following will not work:

range(sim.df)

## Error in FUN(X[[i]], ...): only defined on a data frame with all numeric variables

Right, we cannot take the range of the entire data! Some of the variables are not numeric, and therefore do not have a range…

`range`

Instead, we need to access one variable. There are two ways to do this.

We can subset using square parenthesis, or
We can use the [[ operator and access the variable of interest by position, or
We can use the $ operator and access the variable of interest by name

Note that the last two treat data.frame as a list with variables as elements

range(sim.df[,1])
range(sim.df[[1]])

range(sim.df$x)

## [1] -2.746790  3.077809

Plots and graphs

Introduction

Plots are one of the great strenghts of R
There are two main frameworks for plotting
Base R graphics
ggplot
Which should you use? It is a matter of preference!

Base R plots

The basic plotting syntax is very simple. plot(x_var, y_var) will give you a scatter plot:

plot(sim.df$x, sim.df$y)

Base R plots

Hmm, let's work on that.

Base R plots

The plot function takes a number of arguments (?plot for a full list). The fewer you specify, the uglier your plot:

plot(x = sim.df$x, y = sim.df$y, 
     xlab = "X variable", 
     ylab = "Y variable", 
     main = "Awesome plot title",
     pch = 19, # Solid points
     cex = 0.5, # Smaller points
     bty = "n", # Remove surrounding box
     col = sim.df$g # Colour by grouping variable
     )

Base R plots

The default behaviour of plot() depends on the type of input variables for the x and y arguments. If x is a factor variable, and y is numeric, then R will produce a boxplot:

plot(x = sim.df$g, y = sim.df$x)

Base R plots

`ggplot`

A very popular alternative to base R plots is the ggplot2 library (the 2 in the name refers to the second iteration, which is the standard). This is a separate package (i.e. it is not a part of the base R environment) but is very widely used.

Based on the Grammar of Graphics data visualisation scheme:

Wilkinson, L. (2005). The Grammar of Graphics 2nd Ed. Heidelberg: Springer. https://doi.org/10.1007/0-387-28695-0

Wickham, H. (2010). A Layered Grammar of Graphics. Journal of Computational and Graphical Statistics, 19(1), 3–28. https://doi.org/10.1198/jcgs.2009.07098

Graphs are broken into multiple layers
Layers can be recycled across multiple plots

`ggplot`

Let's recreate the previous scatter plot using ggplot:

library("ggplot2")
ggplot(data = sim.df, aes(x = x, y = y, col = g)) +
  # Add scatterplot
  geom_point() +
  # Change axes labels and plot title
  labs(x = "X variable",
       y = "Y variable",
       title = "Awesome plot title") +
  # Change default grey theme to black and white
  theme_bw()

`ggplot`

One nice feature of ggplot is that it is very easy to create facet plots:

library("ggplot2")
ggplot(data = sim.df, aes(x = x, y = y, col = g)) + 
  geom_point() +
  labs(x = "X variable",
       y = "Y variable",
       title = "Awesome plot title") +
  theme_bw() +
  # Separate plots by variable g
  facet_wrap(~ g)

`ggplot`

Other examples of pretty R graphics

Exercise Two

Linear Regression Models

`lm`

Linear regression models in R are implemented using the lm function.

lm.fit <- lm(formula = y ~ x, data = sim.df)

The formula argument is the specification of the model, and the data argument is the data on which you would like the model to be estimated.

lm.fit

## 
## Call:
## lm(formula = y ~ x, data = sim.df)
## 
## Coefficients:
## (Intercept)            x  
##      0.4722       0.3269

`lm`

We can specify multivariate models:

lm.multi.fit <- lm(formula = y ~ x + z, data = sim.df)

Interaction models:

lm.inter.fit <- lm(formula = y ~ x * z, data = sim.df)

Note that direct effects of x and z are also included, when interaction term is specified.

Fixed-effect models:

lm.fe.fit <- lm(formula = y ~ x + g, data = sim.df)

And many more!

`lm`

The output of the lm function is a long list of interesting output.

When we call the fitted object (e.g. lm.fit), we are presented only with the estimated coefficients.

For some more information of the estimated model, use summary(fitted.model):

lm.fit.summary <- summary(lm.fit)
lm.fit.summary

`lm`

## 
## Call:
## lm(formula = y ~ x, data = sim.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2203 -0.6610 -0.0249  0.6988  2.8859 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.47217    0.03202  14.747   <2e-16 ***
## x            0.32686    0.03300   9.906   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.012 on 998 degrees of freedom
## Multiple R-squared:  0.08953,    Adjusted R-squared:  0.08862 
## F-statistic: 98.14 on 1 and 998 DF,  p-value: < 2.2e-16

`lm`

As with any other function, summary(fitted.model) returns an object. Here, it is a list. What is saved as the output of this function?

names(lm.fit.summary)

##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

If we want to extract other information of interest from the fitted model object, we can use the $ operator to do so:

lm.fit.summary$r.squared

## [1] 0.08952873

`lm`

Accessing elements from saved models can be very helpful in making comparisons across models.

Suppose we want to extract and compare $R^2$ across different models.

lm.r2 <- summary(lm.fit)$r.squared
lm.multi.r2 <- summary(lm.multi.fit)$r.squared
lm.inter.r2 <- summary(lm.inter.fit)$r.squared

r2.compare <- data.frame(
  model = c("bivariate", "multivariate", "interaction"), 
  r.squared = c(lm.r2, 
                lm.multi.r2, 
                lm.inter.r2)
)

`lm`

We can print the data frame containing values of $R^2$:

r2.compare

##          model  r.squared
## 1    bivariate 0.08952873
## 2 multivariate 0.08976396
## 3  interaction 0.10851819

Or we can plot them:

ggplot(r2.compare, aes(x = model, y = r.squared))+
  geom_point(size = 4) +
  # Use `expression` to add 2 as a superscript to R
  ggtitle(expression(paste(R^{2}, " ", "Comparison"))) +
  theme_bw()

`lm`

`lm` diagnostics

There are a number of functions that are helpful in producing model diagnostics:

residuals(fitted.model) extracts the residuals from a fitted model
coefficients(fitted.model) extracts coefficients
fitted(fitted.model) extracts fitted values
plot(fitted.model) is a convenience function for producing a number of useful diagnostics plots

`lm` residual plot

We can easily plot the residuals from a fitted model against an explanatory variable of interest.

Here we are going to exploit the modular nature of ggplot:

# First, let's create common lower layers
first.layer <- ggplot() +
  # Draw horizontal line with y-intercept of 0
  geom_hline(yintercept = 0) +
  labs(y = "residuals") +
  theme_bw()

`lm` residual plot

# Now, we will create the first plot by adding a scatterplot of residuals for x
first.layer +
  geom_point(aes(x = sim.df$x,
                 y = residuals(lm.multi.fit)),
             # Change colour to grey
             colour = "grey") +
  labs(x = "x")
# Create another scatterplot for z
first.layer + 
  geom_point(aes(x = sim.df$z,
                 y = residuals(lm.multi.fit)),
             colour = "grey") +
  labs(x = "z")

`lm` residual plot

Alternatively, we can do that in base R:

par(mfrow = c(1,2)) # Divide the plotting region into 1 row and 2 cols
plot(x = sim.df$x, y = residuals(lm.multi.fit), 
     xlab = "x", ylab = "residuals",
     pch = 19, # Filled circles
     col = "grey", # Change colour
     cex = 0.5) # Make point size smaller
abline(h = 0) # Add a horizontal line with y-intercept of 0
plot(x = sim.df$z, y = residuals(lm.multi.fit), 
     xlab = "z", ylab = "residuals",
     pch = 19, # Filled circles
     col = "grey", # Change colour
     cex = 0.5) # Make point size smaller
abline(h = 0) # Add a horizontal line with y-intercept of 0

`lm` residual plot

Generalized Linear Models

`glm`

To estimate a range of generalized linear models, the glm function is particularly helpful.

First, let us transform our outcome variable from a continuous measure to a binary measure:

sim.df$y_bin <- as.numeric(sim.df$y > median(sim.df$y))
table(sim.df$y_bin)

## 
##   0   1 
## 500 500

`glm`

Now we can estimate our model:

logit.fit <- glm(formula = y_bin ~ x + z, data = df, family = "binomial")

Where:

formula is the model specification
data is our simulated data
family is a description of the error distribution and link function to be used
- binomial, poisson, gaussian etc…

`glm`

summary(logit.fit)

## 
## Call:
## glm(formula = y_bin ~ x + z, family = "binomial", data = df)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.80978  -1.12580  -0.01668   1.12790   1.72617  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.09573    0.13140  -0.729    0.466    
## x            0.55053    0.07159   7.690 1.47e-14 ***
## z            0.17674    0.22990   0.769    0.442    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1386.3  on 999  degrees of freedom
## Residual deviance: 1320.9  on 997  degrees of freedom
## AIC: 1326.9
## 
## Number of Fisher Scoring iterations: 4

`glm`

Ok, but no-one actually thinks in terms of log-odds, so let's translate that into something more meaningful:

logit.OR <- exp(cbind(OR = coef(logit.fit), confint(logit.fit)))

coef extracts the coefficients
confint extracts the confidence intervals
cbind binds the vectors together as separate columns
exp exponentiates the log-odds ratios

round(logit.OR, digits = 4)

##                 OR  2.5 % 97.5 %
## (Intercept) 0.9087 0.7021 1.1755
## x           1.7342 1.5100 1.9997
## z           1.1933 0.7606 1.8740

`glm`

Almost all of the convenience functions that we used for lm are also applicable to glm models:

summary(logit.fit)
plot(logit.fit)
residuals(logit.fit)
coefficients(logit.fit)
fitted(logit.fit)

Other models

There are a number of external packages that can make fitting other model types (relatively) straightforward:

lmer4 - Linear, generalised linear, and nonlinear hierarchical models
mcgv - generalised additive models
survival - survival analysis
glmnet - lasso and elastic net regression models
randomForest - random forest models from machine learning
rjags and rstan - Bayesian models

To use a package that is not a part of the base R installation, use:

install.packages("lmer4")
library("lmer4")

`predict`

We can retrieve the fitted values from the model using fitted(), but we may be interested in calculating predicted values for arbitrary levels of our covariates.

sim_data <- data.frame(x = c(0, 1))
y_hat <- predict(object = lm.fit, newdata = sim_data)
y_hat

##         1         2 
## 0.4721722 0.7990340

Here, I am creating a data.frame with two observations of one variable (x).

I am then using the predict function, where

object = my_lm tells R the model object for which prediction is desired
newdata = sim_data tells R the values for which I would like predictions

`predict`

We can use the same syntax to retrieve predictions for (marginally) more interesting models:

sim.data <- data.frame(x = c(0, 0, 1, 1), z = c(0, 1, 0, 1))
sim.data

##   x z
## 1 0 0
## 2 0 1
## 3 1 0
## 4 1 1

y.hat <- predict(lm.multi.fit, newdata = sim.data)
y.hat

##         1         2         3         4 
## 0.4439052 0.5009562 0.7707532 0.8278042

`predict`

This can be especially useful when trying to visualise interactive models:

sim.data.z0 <- data.frame(x = seq(from = -2, to = 2, by = 0.01), z = 0)
sim.data.z1 <- data.frame(x = seq(from = -2, to = 2, by = 0.01), z = 1)
y.hat.z0 <- predict(lm.inter.fit, newdata = sim.data.z0)
y.hat.z1 <- predict(lm.inter.fit, newdata = sim.data.z1)

seq generates a regular sequences from one value to another value by given increments
I am creating two data frames for prediction, in both cases varying the value of x, but first setting z to 0, and then setting z to 1

`predict`

# Create a plot of the data
plot(sim.df$x, sim.df$y, cex = 0.5, pch = 19, 
     col = "gray", bty = "n",
     xlab = "X", ylab = "Y", 
     main = "Fitted values for sim_data")

# Add a prediction line for z = 0
lines(x = sim.data.z0$x, y = y.hat.z0, lwd = 2)

# Add a prediction line for z = 1
lines(x = sim.data.z1$x, y = y.hat.z1, lwd = 2, col = "red")

# Add a legend
legend("bottomright", legend = c("z = 0", "z = 1"), col = c("black", "red"), lty = 1)

`predict`

Can we just use `margins`?

Yes! Thomas Leeper has created a package that implements the functionality of STATA margins command:

library("margins")

margins::cplot(lm.inter.fit, x = "x", dx = "z", what = "effect")

cplot is equivalent to marginsplot in STATA.

The specification above tell R that we want to plot the marginal effect of z on y across the range of x.

Can we just use `margins`?

Thank you!

Any questions?

Other helpful functions

objects() & ls() # Which objects are currently loaded in my working environment?
rm() # Remove objects from my current environment
save() # Save R object(s) to disk
is.na() # Is this object a missing value? Or, which elements in this vector are missing?
rnorm() # Generate random numbers from a normal distribution
runif() # Generate random numbers from a uniform distribution

Other helpful packages

Check tidyverse collection of packages:

library("readr")
library("dplyr")
library("tidyr")

If you need to get the most speed out of data.frame operations:

library("data.table")

For dealing with time series data:

library("zoo")
library("lubridate")

R basics

Materials

I have used STATA/SPSS for years, why learn R?

Popularity of data analysis software

R and RStudio

RStudio demonstration

Basic mathematical operations in R

Basic mathematical operations in R

Basic operations in R

R objects

Data Structures

vector - the basic building block of R.

vector - the basic building block of R.

vector - the basic building block of R.

vector - the basic building block of R.

vector - the basic building block of R.

vector subsetting

vector subsetting

vector operations

vector operations

matrix

list

list subsetting

data.frame - the workhorse of data analysis.

matrix and data.frame subsetting

matrix and data.frame subsetting

functions - the backbone of R operations.

function example

function example

Basic operations are functions in disguise!

User defined functions

Help!

Exercise One

Data manipulation and exploration

Reading data into R (.csv)

Reading data into R (STATA .dta, SPSS .sav)

Descriptive statistics functions

head

str

summary

pairs

range

range

Plots and graphs

Introduction

Base R plots

Base R plots

Base R plots

Base R plots

Base R plots

Base R plots

ggplot

ggplot

ggplot

ggplot

ggplot

Other examples of pretty R graphics

Other examples of pretty R graphics

Exercise Two

Linear Regression Models

lm

lm

lm

lm

lm

lm

lm

lm

lm diagnostics

lm residual plot

lm residual plot

lm residual plot

lm residual plot

lm residual plot

Generalized Linear Models

glm

glm

glm

glm

glm

R `objects`

`vector` - the basic building block of R.

`vector` - the basic building block of R.

`vector` - the basic building block of R.

`vector` - the basic building block of R.

`vector` - the basic building block of R.

`vector` subsetting

`vector` subsetting

`vector` operations

`vector` operations

`matrix`

`list`

`list` subsetting

`data.frame` - the workhorse of data analysis.

`matrix` and `data.frame` subsetting

`matrix` and `data.frame` subsetting

`functions` - the backbone of R operations.

`function` example

`function` example

Basic operations are `functions` in disguise!

User defined `functions`

Descriptive statistics `functions`

`head`

`str`

`summary`

`pairs`

`range`

`range`

`ggplot`

`ggplot`

`ggplot`

`ggplot`

`ggplot`

`lm`

`lm`

`lm`

`lm`

`lm`

`lm`

`lm`

`lm`

`lm` diagnostics

`lm` residual plot

`lm` residual plot

`lm` residual plot

`lm` residual plot

`lm` residual plot

`glm`

`glm`

`glm`

`glm`

`glm`

`predict`

`predict`

`predict`

`predict`

`predict`

Can we just use `margins`?

Can we just use `margins`?