15/05/2019

R basics

Materials

I have used STATA/SPSS for years, why learn R?

Because:

  • R is free
  • R is open source and has excellent package ecosystem
  • R is very versatile
  • R is very good for data visualisation
  • R can accomodate more than one data frame in memory at the same time

On the other hand:

  • R has a steep(er) learning curve

Popularity of data analysis software

More discussion on this can be found here.

R and RStudio

  • As in STATA, you can either use a script (do file) to save commands and then run them, or write commands directly into the console
  • It is perfectly possible to use the base R environment for everything, but most people do not do this
  • Today we will mostly be using RStudio, which is a graphical user interface (GUI) for R
  • RStudio has a host of useful features which make it an excellent environment for learning R
    • Intuitive interface
    • Syntax highlighting
    • Easy data overview
    • Project management

RStudio demonstration

Basic mathematical operations in R

1 + 1
## [1] 2
5 - 3
## [1] 2
6 / 2
## [1] 3
4 * 4
## [1] 16

Basic mathematical operations in R

  • Other typical mathematical functions are also hard-coded:
sum(<numbers>) # Sum
mean(<numbers>) # Mean
median(<numbers>) # Median
sd(<numbers>) # Standard Deviation
log(<number>) # Logarithm
exp(<number>) # Exponent
sqrt(<number>) # Square root

Basic operations in R

R also understands logical operators:

  • < : less than
  • > : greater than
  • == : equal to (note, not =)
  • >= : greater than or equal to
  • <= : less than or equal to
  • != : not equal to
  • & : and
  • | : or

R objects

  • The entities that R creates and manipulates are known as objects
  • These may be vectors, matrices, character strings, functions
  • Objects are created using the assignment operator: <-
  • Once created, an object is stored in your current global environment
x1 <- 10
print(x1)
## [1] 10
x2 <- 4
x3 <- x1 - x2
print(x3)
## [1] 6

Data Structures

vector - the basic building block of R.

A vector is a collection of elements which all share the same type.

We use the c() function to concatenate observations into a single vector.

  • Numeric
num_vec <- c(150, 178, 67.7, 905, 12)
print(num_vec)
## [1] 150.0 178.0  67.7 905.0  12.0
num_vec
## [1] 150.0 178.0  67.7 905.0  12.0

vector - the basic building block of R.

There are no scalars (atomic units) in R

A scalar is just a vector of length 1

  • Character
char_vec <- c("apple", "pear", "plum", "pineapple", "strawberry")
length(char_vec)
## [1] 5
char_scalar <- "banana"
length(char_scalar)
## [1] 1

vector - the basic building block of R.

When creating new objects, keep in mind reserved words and try to avoid using them in names

  • Logical (Boolean)
log_vec <- c(FALSE, FALSE, FALSE, TRUE, TRUE)
log_vec
## [1] FALSE FALSE FALSE  TRUE  TRUE
log_vec2 <- c(F, F, F, T, T)
log_vec2
## [1] FALSE FALSE FALSE  TRUE  TRUE

vector - the basic building block of R.

  • Factor

A factor is similar to a character vector, but here each unique element is associated with a numerical value which represents a category:

fac_vec <- as.factor(c("a", "b", "c", "b", "b", "c"))
fac_vec
## [1] a b c b b c
## Levels: a b c
as.numeric(fac_vec)
## [1] 1 2 3 2 2 3

vector - the basic building block of R.

  • Date

A date is similar to a numeric vector, as it stores the number of days, but it can assume a variety of forms:

date_vec <- as.Date(c("2001/07/06", "2005/05/05", "2010/06/05", "2015/07/05", "2017/08/06"))
date_vec
## [1] "2001-07-06" "2005-05-05" "2010-06-05" "2015-07-05" "2017-08-06"
as.numeric(date_vec)
## [1] 11509 12908 14765 16621 17384
format(date_vec, "%d/%m/%Y")
## [1] "06/07/2001" "05/05/2005" "05/06/2010" "05/07/2015" "06/08/2017"

vector subsetting

To subset a vector, use square parenthesis to index the elements you would like via object[index]:

  • Numerical subsetting
char_vec
## [1] "apple"      "pear"       "plum"       "pineapple"  "strawberry"
char_vec[1:3]
## [1] "apple" "pear"  "plum"

vector subsetting

Instead of passing indices, it is also possible to use a logical vector to specify the required elements:

  • Logical subsetting
log_vec
## [1] FALSE FALSE FALSE  TRUE  TRUE
char_vec[log_vec]
## [1] "pineapple"  "strawberry"

vector operations

In R, mathematical operations on vectors occur elementwise:

fib <- c(1, 1, 2, 3, 5, 8, 13, 21)
fib[1:7]
## [1]  1  1  2  3  5  8 13
fib[2:8]
## [1]  1  2  3  5  8 13 21
fib[1:7] + fib[2:8]
## [1]  2  3  5  8 13 21 34

vector operations

It is also possible to perform logical operations on vectors:

fib <- c(1, 1, 2, 3, 5, 8, 13, 21)
fib_gt_5 <- fib > 5
fib_gt_5
## [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

We can also combine logical operators

fib_gt_5_or_ls_2 <- fib > 5 | fib < 2
fib_gt_5_or_ls_2
## [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

matrix

A matrix is just a collection of vectors! I.e. it is a 2-dimensional vector that has 2 additional attributes: the number of rows and columns

As for vectors all elements of a matrix must be of the same type

mat <- matrix(data = 1:25, nrow = 5, ncol = 5)
mat
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    6   11   16   21
## [2,]    2    7   12   17   22
## [3,]    3    8   13   18   23
## [4,]    4    9   14   19   24
## [5,]    5   10   15   20   25

list

A list is a collection of R objects of different types

l <- list(element1 = num_vec, 
          element2 = mat[1:2,1:2], 
          element3 = "Farmers' Market")
l
## $element1
## [1] 150.0 178.0  67.7 905.0  12.0
## 
## $element2
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## 
## $element3
## [1] "Farmers' Market"

list subsetting

Individual elements of a list object can be subset using $ or [[ operators:

l$element2
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
l[[2]]
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7

data.frame - the workhorse of data analysis.

A data.frame is an R object in which the columns can be of different types

Despite their matrix-like appearance, data frames in R are lists, where all elements are of equal length!

df <- data.frame(weight = num_vec, fruit = char_vec, berry = log_vec)
df
##   weight      fruit berry
## 1  150.0      apple FALSE
## 2  178.0       pear FALSE
## 3   67.7       plum FALSE
## 4  905.0  pineapple  TRUE
## 5   12.0 strawberry  TRUE

matrix and data.frame subsetting

To subset a matrix or data.frame, you need to specify both rows and columns:

mat[1:3, 1:3]
##      [,1] [,2] [,3]
## [1,]    1    6   11
## [2,]    2    7   12
## [3,]    3    8   13
df[1, ]
##   weight fruit berry
## 1    150 apple FALSE

matrix and data.frame subsetting

We can also subset to remove rows or columns by using the - operator applied to the c function:

mat[-c(1:3), -c(1:3)]
##      [,1] [,2]
## [1,]   19   24
## [2,]   20   25

In the code, 1:3 creates a vector of the integers 1, 2 and 3, and the - operator negates these. We wrap the vector in the c function so that - applies to each element, and not just the first element.

functions - the backbone of R operations.

All operations on objects in R are done with functions.

fun(arg1, arg2, ...)

Where

  • fun is the name of the function
  • arg1 is the first argument passed to the function
  • arg2 is the second argument passed to the function

function example

Let's consider the mean() function. This function takes two main arguments:

mean(x, na.rm = FALSE)

Where x is a numeric vector, and na.rm is a logical value that indicates whether we'd like to remove missing values (NA).

vec <- c(1, 2, 3, 4, 5)
mean(x = vec, na.rm = FALSE)
## [1] 3
vec <- c(1, 2, 3, NA, 5)
mean(x = vec, na.rm = TRUE)
## [1] 2.75

function example

We can also perform calculations on the output of a function:

mean(num_vec) * 3
## [1] 787.62

Which means that we can also have nested functions:

log(mean(num_vec))
## [1] 5.570403

We can also assign the output of any function to a new object for use later:

log_fruit <- log(mean(num_vec)) # The logarithm of average fruit weight

Basic operations are functions in disguise!

All the basic operations on R objects that we have encountered so far are, in fact, functions:

`+`(1,1)
## [1] 2
`-`(5,3)
## [1] 2
`[`(char_vec, 1:3)
## [1] "apple" "pear"  "plum"

User defined functions

Functions are also objects, and we can create our own. We define a function as follows:

sum2 <- function(a, b){
  return(a + b)
}

res <- sum2(a = 5, b = 50)
res
## [1] 55

Note that the function itself, and the result returned by the function are both objects!

Help!

R has an inbuilt help facility which provides more information about any function:

?summary
help(summary)

The quality of the help varies hugely across packages.

Stackoverflow is a good resource for many standard tasks.

For custom packages it is often helpful to check the issues page on the GitHub.

E.g. for ggplot2: https://github.com/tidyverse/ggplot2/issues

Or, indeed, any search engine #LMDDGTFY.

Exercise One

Data manipulation and exploration

Reading data into R (.csv)

sim.df <- read.csv(file = "file.csv")
  • df is an R data.frame object (you could call this anything)
  • file.csv is a .csv file with your data
  • <- is the assignment operator
  • In order for R to access file.csv, it will have to be saved in your current working directory
  • Before recently, the standard advice was to use getwd() and setwd().
  • Try to avoid it! Instead, create new RStudio project and use relative paths:
sim.df <- read.csv(file = "./data/file.csv")
  • More details can be found here

Reading data into R (STATA .dta, SPSS .sav)

install.packages("haven") # Install the "haven" package
library("haven") ## Loads an additional package that deals with STATA files
sim.df <- haven::read_dta(file = "./data/file.dta") 
sim.df <- haven::read_sav(file = "./data/file.sav") 
  • The haven package is not automatically installed in R so we need to do that ourselves
  • df is still a R data.frame objects
  • file.dta is a .dta (used by STATA) file with your data
  • file.sav is a .sav (used by SPSS) file with your data
  • <- is the usual assignment operator

Descriptive statistics functions

R includes a number of helpful functions for exploring your data:

View(sim.df) # View data in spreadsheet-style
names(sim.df) # Names of the variables in your data
head(sim.df) # First n (six, by default) rows of the data
tail(sim.df) # Last n (six, by default) rows of the data
str(sim.df) # "Structure" of any R object
summary(sim.df) # Summary statistics for each of the columns in the data
dim(sim.df) # Dimensions of data
mean(sim.df$var) # Mean of a numeric vector
sd(sim.df$var) # Standard deviation of a numeric vector
range(sim.df$var) # Range of a numeric vector
quantile(sim.df$var) # Quantiles of a numeric vector

head

str

str() applied to data frames is particularly useful in inferring types of variables in the data

str(sim.df) 
## 'data.frame':    1000 obs. of  4 variables:
##  $ x: num  1.807 -1.249 -0.136 0.951 -1.271 ...
##  $ y: num  1.134 0.596 0.8 2.265 0.889 ...
##  $ z: num  0.0869 0.2752 0.7046 0.6106 0.8831 ...
##  $ g: Factor w/ 6 levels "a","b","c","d",..: 4 5 1 1 3 6 1 3 6 5 ...

summary

summary(sim.df) 
##        x                    y                 z             g      
##  Min.   :-2.7467901   Min.   :-3.2166   Min.   :0.0007827   a:177  
##  1st Qu.:-0.6505946   1st Qu.:-0.2343   1st Qu.:0.2568557   b:161  
##  Median : 0.0003546   Median : 0.4703   Median :0.4932571   c:173  
##  Mean   : 0.0161324   Mean   : 0.4774   Mean   :0.4954730   d:141  
##  3rd Qu.: 0.6626705   3rd Qu.: 1.1694   3rd Qu.:0.7341048   e:177  
##  Max.   : 3.0778088   Max.   : 3.5594   Max.   :0.9994499   f:171

pairs

pairs() produces a matrix of bivariate scatterplots

pairs(sim.df)

range

Suppose you would like to find the range of a variable in your data.frame. The following will not work:

range(sim.df) 
## Error in FUN(X[[i]], ...): only defined on a data frame with all numeric variables

Right, we cannot take the range of the entire data! Some of the variables are not numeric, and therefore do not have a range…

range

Instead, we need to access one variable. There are two ways to do this.

  • We can subset using square parenthesis, or
  • We can use the [[ operator and access the variable of interest by position, or
  • We can use the $ operator and access the variable of interest by name

Note that the last two treat data.frame as a list with variables as elements

range(sim.df[,1])
range(sim.df[[1]])
range(sim.df$x) 
## [1] -2.746790  3.077809

Plots and graphs

Introduction

  • Plots are one of the great strenghts of R
  • There are two main frameworks for plotting
  • Base R graphics
  • ggplot
  • Which should you use? It is a matter of preference!

Base R plots

The basic plotting syntax is very simple. plot(x_var, y_var) will give you a scatter plot:

plot(sim.df$x, sim.df$y)

Base R plots

Hmm, let's work on that.

Base R plots

The plot function takes a number of arguments (?plot for a full list). The fewer you specify, the uglier your plot:

plot(x = sim.df$x, y = sim.df$y, 
     xlab = "X variable", 
     ylab = "Y variable", 
     main = "Awesome plot title",
     pch = 19, # Solid points
     cex = 0.5, # Smaller points
     bty = "n", # Remove surrounding box
     col = sim.df$g # Colour by grouping variable
     )

Base R plots

Base R plots

The default behaviour of plot() depends on the type of input variables for the x and y arguments. If x is a factor variable, and y is numeric, then R will produce a boxplot:

plot(x = sim.df$g, y = sim.df$x)

Base R plots

ggplot

A very popular alternative to base R plots is the ggplot2 library (the 2 in the name refers to the second iteration, which is the standard). This is a separate package (i.e. it is not a part of the base R environment) but is very widely used.

  • Based on the Grammar of Graphics data visualisation scheme:

Wilkinson, L. (2005). The Grammar of Graphics 2nd Ed. Heidelberg: Springer. https://doi.org/10.1007/0-387-28695-0

Wickham, H. (2010). A Layered Grammar of Graphics. Journal of Computational and Graphical Statistics, 19(1), 3–28. https://doi.org/10.1198/jcgs.2009.07098

  • Graphs are broken into multiple layers

  • Layers can be recycled across multiple plots

ggplot

Let's recreate the previous scatter plot using ggplot:

library("ggplot2")
ggplot(data = sim.df, aes(x = x, y = y, col = g)) +
  # Add scatterplot
  geom_point() +
  # Change axes labels and plot title
  labs(x = "X variable",
       y = "Y variable",
       title = "Awesome plot title") +
  # Change default grey theme to black and white
  theme_bw()

ggplot

ggplot

One nice feature of ggplot is that it is very easy to create facet plots:

library("ggplot2")
ggplot(data = sim.df, aes(x = x, y = y, col = g)) + 
  geom_point() +
  labs(x = "X variable",
       y = "Y variable",
       title = "Awesome plot title") +
  theme_bw() +
  # Separate plots by variable g
  facet_wrap(~ g)

ggplot

Other examples of pretty R graphics

Other examples of pretty R graphics

Exercise Two

Linear Regression Models

lm

Linear regression models in R are implemented using the lm function.

lm.fit <- lm(formula = y ~ x, data = sim.df)

The formula argument is the specification of the model, and the data argument is the data on which you would like the model to be estimated.

lm.fit
## 
## Call:
## lm(formula = y ~ x, data = sim.df)
## 
## Coefficients:
## (Intercept)            x  
##      0.4722       0.3269

lm

We can specify multivariate models:

lm.multi.fit <- lm(formula = y ~ x + z, data = sim.df)

Interaction models:

lm.inter.fit <- lm(formula = y ~ x * z, data = sim.df)

Note that direct effects of x and z are also included, when interaction term is specified.

Fixed-effect models:

lm.fe.fit <- lm(formula = y ~ x + g, data = sim.df)

And many more!

lm

The output of the lm function is a long list of interesting output.

When we call the fitted object (e.g. lm.fit), we are presented only with the estimated coefficients.

For some more information of the estimated model, use summary(fitted.model):

lm.fit.summary <- summary(lm.fit)
lm.fit.summary

lm

## 
## Call:
## lm(formula = y ~ x, data = sim.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2203 -0.6610 -0.0249  0.6988  2.8859 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.47217    0.03202  14.747   <2e-16 ***
## x            0.32686    0.03300   9.906   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.012 on 998 degrees of freedom
## Multiple R-squared:  0.08953,    Adjusted R-squared:  0.08862 
## F-statistic: 98.14 on 1 and 998 DF,  p-value: < 2.2e-16

lm

As with any other function, summary(fitted.model) returns an object. Here, it is a list. What is saved as the output of this function?

names(lm.fit.summary)
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

If we want to extract other information of interest from the fitted model object, we can use the $ operator to do so:

lm.fit.summary$r.squared
## [1] 0.08952873

lm

Accessing elements from saved models can be very helpful in making comparisons across models.

Suppose we want to extract and compare \(R^2\) across different models.

lm.r2 <- summary(lm.fit)$r.squared
lm.multi.r2 <- summary(lm.multi.fit)$r.squared
lm.inter.r2 <- summary(lm.inter.fit)$r.squared

r2.compare <- data.frame(
  model = c("bivariate", "multivariate", "interaction"), 
  r.squared = c(lm.r2, 
                lm.multi.r2, 
                lm.inter.r2)
)

lm

We can print the data frame containing values of \(R^2\):

r2.compare
##          model  r.squared
## 1    bivariate 0.08952873
## 2 multivariate 0.08976396
## 3  interaction 0.10851819

Or we can plot them:

ggplot(r2.compare, aes(x = model, y = r.squared))+
  geom_point(size = 4) +
  # Use `expression` to add 2 as a superscript to R
  ggtitle(expression(paste(R^{2}, " ", "Comparison"))) +
  theme_bw()

lm

lm diagnostics

There are a number of functions that are helpful in producing model diagnostics:

  • residuals(fitted.model) extracts the residuals from a fitted model
  • coefficients(fitted.model) extracts coefficients
  • fitted(fitted.model) extracts fitted values
  • plot(fitted.model) is a convenience function for producing a number of useful diagnostics plots

lm residual plot

We can easily plot the residuals from a fitted model against an explanatory variable of interest.

Here we are going to exploit the modular nature of ggplot:

# First, let's create common lower layers
first.layer <- ggplot() +
  # Draw horizontal line with y-intercept of 0
  geom_hline(yintercept = 0) +
  labs(y = "residuals") +
  theme_bw()

lm residual plot

# Now, we will create the first plot by adding a scatterplot of residuals for x
first.layer +
  geom_point(aes(x = sim.df$x,
                 y = residuals(lm.multi.fit)),
             # Change colour to grey
             colour = "grey") +
  labs(x = "x")
# Create another scatterplot for z
first.layer + 
  geom_point(aes(x = sim.df$z,
                 y = residuals(lm.multi.fit)),
             colour = "grey") +
  labs(x = "z")

lm residual plot

lm residual plot

Alternatively, we can do that in base R:

par(mfrow = c(1,2)) # Divide the plotting region into 1 row and 2 cols
plot(x = sim.df$x, y = residuals(lm.multi.fit), 
     xlab = "x", ylab = "residuals",
     pch = 19, # Filled circles
     col = "grey", # Change colour
     cex = 0.5) # Make point size smaller
abline(h = 0) # Add a horizontal line with y-intercept of 0
plot(x = sim.df$z, y = residuals(lm.multi.fit), 
     xlab = "z", ylab = "residuals",
     pch = 19, # Filled circles
     col = "grey", # Change colour
     cex = 0.5) # Make point size smaller
abline(h = 0) # Add a horizontal line with y-intercept of 0

lm residual plot

Generalized Linear Models

glm

To estimate a range of generalized linear models, the glm function is particularly helpful.

First, let us transform our outcome variable from a continuous measure to a binary measure:

sim.df$y_bin <- as.numeric(sim.df$y > median(sim.df$y))
table(sim.df$y_bin)
## 
##   0   1 
## 500 500

glm

Now we can estimate our model:

logit.fit <- glm(formula = y_bin ~ x + z, data = df, family = "binomial")

Where:

  • formula is the model specification
  • data is our simulated data
  • family is a description of the error distribution and link function to be used
    • binomial, poisson, gaussian etc…

glm

summary(logit.fit)
## 
## Call:
## glm(formula = y_bin ~ x + z, family = "binomial", data = df)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.80978  -1.12580  -0.01668   1.12790   1.72617  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.09573    0.13140  -0.729    0.466    
## x            0.55053    0.07159   7.690 1.47e-14 ***
## z            0.17674    0.22990   0.769    0.442    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1386.3  on 999  degrees of freedom
## Residual deviance: 1320.9  on 997  degrees of freedom
## AIC: 1326.9
## 
## Number of Fisher Scoring iterations: 4

glm

Ok, but no-one actually thinks in terms of log-odds, so let's translate that into something more meaningful:

logit.OR <- exp(cbind(OR = coef(logit.fit), confint(logit.fit)))
  • coef extracts the coefficients
  • confint extracts the confidence intervals
  • cbind binds the vectors together as separate columns
  • exp exponentiates the log-odds ratios
round(logit.OR, digits = 4)
##                 OR  2.5 % 97.5 %
## (Intercept) 0.9087 0.7021 1.1755
## x           1.7342 1.5100 1.9997
## z           1.1933 0.7606 1.8740

glm

Almost all of the convenience functions that we used for lm are also applicable to glm models:

summary(logit.fit)
plot(logit.fit)
residuals(logit.fit)
coefficients(logit.fit)
fitted(logit.fit)

Other models

There are a number of external packages that can make fitting other model types (relatively) straightforward:

  • lmer4 - Linear, generalised linear, and nonlinear hierarchical models
  • mcgv - generalised additive models
  • survival - survival analysis
  • glmnet - lasso and elastic net regression models
  • randomForest - random forest models from machine learning
  • rjags and rstan - Bayesian models

To use a package that is not a part of the base R installation, use:

install.packages("lmer4")
library("lmer4")

predict

We can retrieve the fitted values from the model using fitted(), but we may be interested in calculating predicted values for arbitrary levels of our covariates.

sim_data <- data.frame(x = c(0, 1))
y_hat <- predict(object = lm.fit, newdata = sim_data)
y_hat
##         1         2 
## 0.4721722 0.7990340

Here, I am creating a data.frame with two observations of one variable (x).

I am then using the predict function, where

  • object = my_lm tells R the model object for which prediction is desired
  • newdata = sim_data tells R the values for which I would like predictions

predict

We can use the same syntax to retrieve predictions for (marginally) more interesting models:

sim.data <- data.frame(x = c(0, 0, 1, 1), z = c(0, 1, 0, 1))
sim.data
##   x z
## 1 0 0
## 2 0 1
## 3 1 0
## 4 1 1
y.hat <- predict(lm.multi.fit, newdata = sim.data)
y.hat
##         1         2         3         4 
## 0.4439052 0.5009562 0.7707532 0.8278042

predict

This can be especially useful when trying to visualise interactive models:

sim.data.z0 <- data.frame(x = seq(from = -2, to = 2, by = 0.01), z = 0)
sim.data.z1 <- data.frame(x = seq(from = -2, to = 2, by = 0.01), z = 1)
y.hat.z0 <- predict(lm.inter.fit, newdata = sim.data.z0)
y.hat.z1 <- predict(lm.inter.fit, newdata = sim.data.z1)
  • seq generates a regular sequences from one value to another value by given increments
  • I am creating two data frames for prediction, in both cases varying the value of x, but first setting z to 0, and then setting z to 1

predict

# Create a plot of the data
plot(sim.df$x, sim.df$y, cex = 0.5, pch = 19, 
     col = "gray", bty = "n",
     xlab = "X", ylab = "Y", 
     main = "Fitted values for sim_data")

# Add a prediction line for z = 0
lines(x = sim.data.z0$x, y = y.hat.z0, lwd = 2)

# Add a prediction line for z = 1
lines(x = sim.data.z1$x, y = y.hat.z1, lwd = 2, col = "red")

# Add a legend
legend("bottomright", legend = c("z = 0", "z = 1"), col = c("black", "red"), lty = 1) 

predict

Can we just use margins?

Yes! Thomas Leeper has created a package that implements the functionality of STATA margins command:

library("margins")

margins::cplot(lm.inter.fit, x = "x", dx = "z", what = "effect")

cplot is equivalent to marginsplot in STATA.

The specification above tell R that we want to plot the marginal effect of z on y across the range of x.

Can we just use margins?

Thank you!

Any questions?

Other helpful functions

objects() & ls() # Which objects are currently loaded in my working environment?
rm() # Remove objects from my current environment
save() # Save R object(s) to disk
is.na() # Is this object a missing value? Or, which elements in this vector are missing?
rnorm() # Generate random numbers from a normal distribution
runif() # Generate random numbers from a uniform distribution

Other helpful packages

library("readr")
library("dplyr")
library("tidyr")
  • If you need to get the most speed out of data.frame operations:
library("data.table")
  • For dealing with time series data:
library("zoo")
library("lubridate")

Further reading

  • Crawley, M. J. (2012). The R Book 2nd Ed. Hoboken, NJ: Wiley-Blackwell.
  • Dalgaard, P. (2008). Introductory Statistics with R 2nd Ed. Heidelberg: Springer.
  • Fox, J. (2018). An R Companion to Applied Regression 3rd Ed. Thousand Oaks, CA: Sage Publications.
  • Matloff, N. (2011). The Art of R Programming. San Francisco: No Starch Press.
  • Wickham, H. (2014). Advanced R Programming. Chapman & Hall/CRC.
  • Wickham, H., & Grolemund, G. (2017). R for Data Science. Sebastopol: O’Reilly Media.