15/05/2019
You can find the
It is worth noting that both the assignment and the slides were all created using R and RStudio.
Because:
On the other hand:
More discussion on this can be found here.
1 + 1
## [1] 2
5 - 3
## [1] 2
6 / 2
## [1] 3
4 * 4
## [1] 16
sum(<numbers>) # Sum mean(<numbers>) # Mean median(<numbers>) # Median sd(<numbers>) # Standard Deviation log(<number>) # Logarithm exp(<number>) # Exponent sqrt(<number>) # Square root
R also understands logical operators:
<
: less than>
: greater than==
: equal to (note, not =
)>=
: greater than or equal to<=
: less than or equal to!=
: not equal to&
: and|
: orobjects
<-
x1 <- 10 print(x1)
## [1] 10
x2 <- 4 x3 <- x1 - x2 print(x3)
## [1] 6
vector
- the basic building block of R.A vector is a collection of elements which all share the same type.
We use the c()
function to concatenate observations into a single vector.
num_vec <- c(150, 178, 67.7, 905, 12) print(num_vec)
## [1] 150.0 178.0 67.7 905.0 12.0
num_vec
## [1] 150.0 178.0 67.7 905.0 12.0
vector
- the basic building block of R.There are no scalars (atomic units) in R
A scalar is just a vector of length 1
char_vec <- c("apple", "pear", "plum", "pineapple", "strawberry") length(char_vec)
## [1] 5
char_scalar <- "banana" length(char_scalar)
## [1] 1
vector
- the basic building block of R.When creating new objects, keep in mind reserved words and try to avoid using them in names
log_vec <- c(FALSE, FALSE, FALSE, TRUE, TRUE) log_vec
## [1] FALSE FALSE FALSE TRUE TRUE
log_vec2 <- c(F, F, F, T, T) log_vec2
## [1] FALSE FALSE FALSE TRUE TRUE
vector
- the basic building block of R.A factor is similar to a character vector, but here each unique element is associated with a numerical value which represents a category:
fac_vec <- as.factor(c("a", "b", "c", "b", "b", "c")) fac_vec
## [1] a b c b b c ## Levels: a b c
as.numeric(fac_vec)
## [1] 1 2 3 2 2 3
vector
- the basic building block of R.A date is similar to a numeric vector, as it stores the number of days, but it can assume a variety of forms:
date_vec <- as.Date(c("2001/07/06", "2005/05/05", "2010/06/05", "2015/07/05", "2017/08/06")) date_vec
## [1] "2001-07-06" "2005-05-05" "2010-06-05" "2015-07-05" "2017-08-06"
as.numeric(date_vec)
## [1] 11509 12908 14765 16621 17384
format(date_vec, "%d/%m/%Y")
## [1] "06/07/2001" "05/05/2005" "05/06/2010" "05/07/2015" "06/08/2017"
vector
subsettingTo subset a vector
, use square parenthesis to index the elements you would like via object[index]
:
char_vec
## [1] "apple" "pear" "plum" "pineapple" "strawberry"
char_vec[1:3]
## [1] "apple" "pear" "plum"
vector
subsettingInstead of passing indices, it is also possible to use a logical vector to specify the required elements:
log_vec
## [1] FALSE FALSE FALSE TRUE TRUE
char_vec[log_vec]
## [1] "pineapple" "strawberry"
vector
operationsIn R, mathematical operations on vectors occur elementwise:
fib <- c(1, 1, 2, 3, 5, 8, 13, 21) fib[1:7]
## [1] 1 1 2 3 5 8 13
fib[2:8]
## [1] 1 2 3 5 8 13 21
fib[1:7] + fib[2:8]
## [1] 2 3 5 8 13 21 34
vector
operationsIt is also possible to perform logical operations on vectors:
fib <- c(1, 1, 2, 3, 5, 8, 13, 21) fib_gt_5 <- fib > 5 fib_gt_5
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
We can also combine logical operators
fib_gt_5_or_ls_2 <- fib > 5 | fib < 2 fib_gt_5_or_ls_2
## [1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
matrix
A matrix is just a collection of vectors! I.e. it is a 2-dimensional vector that has 2 additional attributes: the number of rows and columns
As for vectors all elements of a matrix must be of the same type
mat <- matrix(data = 1:25, nrow = 5, ncol = 5) mat
## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 6 11 16 21 ## [2,] 2 7 12 17 22 ## [3,] 3 8 13 18 23 ## [4,] 4 9 14 19 24 ## [5,] 5 10 15 20 25
list
A list
is a collection of R objects of different types
l <- list(element1 = num_vec, element2 = mat[1:2,1:2], element3 = "Farmers' Market") l
## $element1 ## [1] 150.0 178.0 67.7 905.0 12.0 ## ## $element2 ## [,1] [,2] ## [1,] 1 6 ## [2,] 2 7 ## ## $element3 ## [1] "Farmers' Market"
list
subsettingIndividual elements of a list
object can be subset using $
or [[
operators:
l$element2
## [,1] [,2] ## [1,] 1 6 ## [2,] 2 7
l[[2]]
## [,1] [,2] ## [1,] 1 6 ## [2,] 2 7
data.frame
- the workhorse of data analysis.A data.frame
is an R object in which the columns can be of different types
Despite their matrix-like appearance, data frames in R are lists, where all elements are of equal length!
df <- data.frame(weight = num_vec, fruit = char_vec, berry = log_vec) df
## weight fruit berry ## 1 150.0 apple FALSE ## 2 178.0 pear FALSE ## 3 67.7 plum FALSE ## 4 905.0 pineapple TRUE ## 5 12.0 strawberry TRUE
matrix
and data.frame
subsettingTo subset a matrix
or data.frame
, you need to specify both rows and columns:
mat[1:3, 1:3]
## [,1] [,2] [,3] ## [1,] 1 6 11 ## [2,] 2 7 12 ## [3,] 3 8 13
df[1, ]
## weight fruit berry ## 1 150 apple FALSE
matrix
and data.frame
subsettingWe can also subset to remove rows or columns by using the -
operator applied to the c
function:
mat[-c(1:3), -c(1:3)]
## [,1] [,2] ## [1,] 19 24 ## [2,] 20 25
In the code, 1:3
creates a vector of the integers 1, 2 and 3, and the -
operator negates these. We wrap the vector in the c
function so that -
applies to each element, and not just the first element.
functions
- the backbone of R operations.All operations on objects in R are done with functions.
fun(arg1, arg2, ...)
Where
fun
is the name of the functionarg1
is the first argument passed to the functionarg2
is the second argument passed to the functionfunction
exampleLet's consider the mean()
function. This function takes two main arguments:
mean(x, na.rm = FALSE)
Where x
is a numeric vector, and na.rm
is a logical value that indicates whether we'd like to remove missing values (NA
).
vec <- c(1, 2, 3, 4, 5) mean(x = vec, na.rm = FALSE)
## [1] 3
vec <- c(1, 2, 3, NA, 5) mean(x = vec, na.rm = TRUE)
## [1] 2.75
function
exampleWe can also perform calculations on the output of a function:
mean(num_vec) * 3
## [1] 787.62
Which means that we can also have nested functions:
log(mean(num_vec))
## [1] 5.570403
We can also assign the output of any function to a new object for use later:
log_fruit <- log(mean(num_vec)) # The logarithm of average fruit weight
functions
in disguise!All the basic operations on R objects
that we have encountered so far are, in fact, functions
:
`+`(1,1)
## [1] 2
`-`(5,3)
## [1] 2
`[`(char_vec, 1:3)
## [1] "apple" "pear" "plum"
functions
Functions are also objects, and we can create our own. We define a function as follows:
sum2 <- function(a, b){ return(a + b) } res <- sum2(a = 5, b = 50) res
## [1] 55
Note that the function itself, and the result returned by the function are both objects!
R has an inbuilt help facility which provides more information about any function:
?summary help(summary)
The quality of the help varies hugely across packages.
Stackoverflow is a good resource for many standard tasks.
For custom packages it is often helpful to check the issues page on the GitHub.
E.g. for ggplot2
: https://github.com/tidyverse/ggplot2/issues
Or, indeed, any search engine #LMDDGTFY.
sim.df <- read.csv(file = "file.csv")
df
is an R data.frame object (you could call this anything)file.csv
is a .csv file with your data<-
is the assignment operatorfile.csv
, it will have to be saved in your current working directorygetwd()
and setwd()
.sim.df <- read.csv(file = "./data/file.csv")
install.packages("haven") # Install the "haven" package library("haven") ## Loads an additional package that deals with STATA files sim.df <- haven::read_dta(file = "./data/file.dta") sim.df <- haven::read_sav(file = "./data/file.sav")
haven
package is not automatically installed in R so we need to do that ourselvesdf
is still a R data.frame objectsfile.dta
is a .dta (used by STATA) file with your datafile.sav
is a .sav (used by SPSS) file with your data<-
is the usual assignment operatorfunctions
R includes a number of helpful functions for exploring your data:
View(sim.df) # View data in spreadsheet-style names(sim.df) # Names of the variables in your data head(sim.df) # First n (six, by default) rows of the data tail(sim.df) # Last n (six, by default) rows of the data str(sim.df) # "Structure" of any R object summary(sim.df) # Summary statistics for each of the columns in the data dim(sim.df) # Dimensions of data mean(sim.df$var) # Mean of a numeric vector sd(sim.df$var) # Standard deviation of a numeric vector range(sim.df$var) # Range of a numeric vector quantile(sim.df$var) # Quantiles of a numeric vector
head
By varying argument n
, we can adjust the number of top rows we want to inspect:
head(sim.df, n = 5)
## x y z g ## 1 1.8068729 1.1338624 0.08688987 d ## 2 -1.2486526 0.5960640 0.27517432 e ## 3 -0.1355884 0.8004329 0.70459357 a ## 4 0.9505454 2.2648623 0.61059223 a ## 5 -1.2711097 0.8887924 0.88313813 c
Can also be used in combination with View()
:
View(head(sim.df, n = 50))
str
str()
applied to data frames is particularly useful in inferring types of variables in the data
str(sim.df)
## 'data.frame': 1000 obs. of 4 variables: ## $ x: num 1.807 -1.249 -0.136 0.951 -1.271 ... ## $ y: num 1.134 0.596 0.8 2.265 0.889 ... ## $ z: num 0.0869 0.2752 0.7046 0.6106 0.8831 ... ## $ g: Factor w/ 6 levels "a","b","c","d",..: 4 5 1 1 3 6 1 3 6 5 ...
summary
summary(sim.df)
## x y z g ## Min. :-2.7467901 Min. :-3.2166 Min. :0.0007827 a:177 ## 1st Qu.:-0.6505946 1st Qu.:-0.2343 1st Qu.:0.2568557 b:161 ## Median : 0.0003546 Median : 0.4703 Median :0.4932571 c:173 ## Mean : 0.0161324 Mean : 0.4774 Mean :0.4954730 d:141 ## 3rd Qu.: 0.6626705 3rd Qu.: 1.1694 3rd Qu.:0.7341048 e:177 ## Max. : 3.0778088 Max. : 3.5594 Max. :0.9994499 f:171
pairs
pairs()
produces a matrix of bivariate scatterplots
pairs(sim.df)
range
Suppose you would like to find the range of a variable in your data.frame
. The following will not work:
range(sim.df)
## Error in FUN(X[[i]], ...): only defined on a data frame with all numeric variables
Right, we cannot take the range of the entire data! Some of the variables are not numeric, and therefore do not have a range…
range
Instead, we need to access one variable. There are two ways to do this.
[[
operator and access the variable of interest by position, or$
operator and access the variable of interest by nameNote that the last two treat data.frame
as a list with variables as elements
range(sim.df[,1]) range(sim.df[[1]])
range(sim.df$x)
## [1] -2.746790 3.077809
ggplot
The basic plotting syntax is very simple. plot(x_var, y_var)
will give you a scatter plot:
plot(sim.df$x, sim.df$y)
Hmm, let's work on that.
The plot function takes a number of arguments (?plot
for a full list). The fewer you specify, the uglier your plot:
plot(x = sim.df$x, y = sim.df$y, xlab = "X variable", ylab = "Y variable", main = "Awesome plot title", pch = 19, # Solid points cex = 0.5, # Smaller points bty = "n", # Remove surrounding box col = sim.df$g # Colour by grouping variable )
The default behaviour of plot()
depends on the type of input variables for the x
and y
arguments. If x
is a factor variable, and y
is numeric, then R will produce a boxplot:
plot(x = sim.df$g, y = sim.df$x)
ggplot
A very popular alternative to base R plots is the ggplot2
library (the 2 in the name refers to the second iteration, which is the standard). This is a separate package (i.e. it is not a part of the base R environment) but is very widely used.
Wilkinson, L. (2005). The Grammar of Graphics 2nd Ed. Heidelberg: Springer. https://doi.org/10.1007/0-387-28695-0
Wickham, H. (2010). A Layered Grammar of Graphics. Journal of Computational and Graphical Statistics, 19(1), 3–28. https://doi.org/10.1198/jcgs.2009.07098
Graphs are broken into multiple layers
Layers can be recycled across multiple plots
ggplot
Let's recreate the previous scatter plot using ggplot
:
library("ggplot2") ggplot(data = sim.df, aes(x = x, y = y, col = g)) + # Add scatterplot geom_point() + # Change axes labels and plot title labs(x = "X variable", y = "Y variable", title = "Awesome plot title") + # Change default grey theme to black and white theme_bw()
ggplot
ggplot
One nice feature of ggplot
is that it is very easy to create facet plots:
library("ggplot2") ggplot(data = sim.df, aes(x = x, y = y, col = g)) + geom_point() + labs(x = "X variable", y = "Y variable", title = "Awesome plot title") + theme_bw() + # Separate plots by variable g facet_wrap(~ g)
ggplot
lm
Linear regression models in R are implemented using the lm
function.
lm.fit <- lm(formula = y ~ x, data = sim.df)
The formula
argument is the specification of the model, and the data
argument is the data on which you would like the model to be estimated.
lm.fit
## ## Call: ## lm(formula = y ~ x, data = sim.df) ## ## Coefficients: ## (Intercept) x ## 0.4722 0.3269
lm
We can specify multivariate models:
lm.multi.fit <- lm(formula = y ~ x + z, data = sim.df)
Interaction models:
lm.inter.fit <- lm(formula = y ~ x * z, data = sim.df)
Note that direct effects of x
and z
are also included, when interaction term is specified.
Fixed-effect models:
lm.fe.fit <- lm(formula = y ~ x + g, data = sim.df)
And many more!
lm
The output of the lm
function is a long list of interesting output.
When we call the fitted object (e.g. lm.fit
), we are presented only with the estimated coefficients.
For some more information of the estimated model, use summary(fitted.model)
:
lm.fit.summary <- summary(lm.fit) lm.fit.summary
lm
## ## Call: ## lm(formula = y ~ x, data = sim.df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.2203 -0.6610 -0.0249 0.6988 2.8859 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.47217 0.03202 14.747 <2e-16 *** ## x 0.32686 0.03300 9.906 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.012 on 998 degrees of freedom ## Multiple R-squared: 0.08953, Adjusted R-squared: 0.08862 ## F-statistic: 98.14 on 1 and 998 DF, p-value: < 2.2e-16
lm
As with any other function, summary(fitted.model)
returns an object. Here, it is a list. What is saved as the output of this function?
names(lm.fit.summary)
## [1] "call" "terms" "residuals" "coefficients" ## [5] "aliased" "sigma" "df" "r.squared" ## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
If we want to extract other information of interest from the fitted model object, we can use the $
operator to do so:
lm.fit.summary$r.squared
## [1] 0.08952873
lm
Accessing elements from saved models can be very helpful in making comparisons across models.
Suppose we want to extract and compare \(R^2\) across different models.
lm.r2 <- summary(lm.fit)$r.squared lm.multi.r2 <- summary(lm.multi.fit)$r.squared lm.inter.r2 <- summary(lm.inter.fit)$r.squared r2.compare <- data.frame( model = c("bivariate", "multivariate", "interaction"), r.squared = c(lm.r2, lm.multi.r2, lm.inter.r2) )
lm
We can print the data frame containing values of \(R^2\):
r2.compare
## model r.squared ## 1 bivariate 0.08952873 ## 2 multivariate 0.08976396 ## 3 interaction 0.10851819
Or we can plot them:
ggplot(r2.compare, aes(x = model, y = r.squared))+ geom_point(size = 4) + # Use `expression` to add 2 as a superscript to R ggtitle(expression(paste(R^{2}, " ", "Comparison"))) + theme_bw()
lm
lm
diagnosticsThere are a number of functions that are helpful in producing model diagnostics:
residuals(fitted.model)
extracts the residuals from a fitted modelcoefficients(fitted.model)
extracts coefficientsfitted(fitted.model)
extracts fitted valuesplot(fitted.model)
is a convenience function for producing a number of useful diagnostics plotslm
residual plotWe can easily plot the residuals from a fitted model against an explanatory variable of interest.
Here we are going to exploit the modular nature of ggplot
:
# First, let's create common lower layers first.layer <- ggplot() + # Draw horizontal line with y-intercept of 0 geom_hline(yintercept = 0) + labs(y = "residuals") + theme_bw()
lm
residual plot# Now, we will create the first plot by adding a scatterplot of residuals for x first.layer + geom_point(aes(x = sim.df$x, y = residuals(lm.multi.fit)), # Change colour to grey colour = "grey") + labs(x = "x") # Create another scatterplot for z first.layer + geom_point(aes(x = sim.df$z, y = residuals(lm.multi.fit)), colour = "grey") + labs(x = "z")
lm
residual plotlm
residual plotAlternatively, we can do that in base R:
par(mfrow = c(1,2)) # Divide the plotting region into 1 row and 2 cols plot(x = sim.df$x, y = residuals(lm.multi.fit), xlab = "x", ylab = "residuals", pch = 19, # Filled circles col = "grey", # Change colour cex = 0.5) # Make point size smaller abline(h = 0) # Add a horizontal line with y-intercept of 0 plot(x = sim.df$z, y = residuals(lm.multi.fit), xlab = "z", ylab = "residuals", pch = 19, # Filled circles col = "grey", # Change colour cex = 0.5) # Make point size smaller abline(h = 0) # Add a horizontal line with y-intercept of 0
lm
residual plotglm
To estimate a range of generalized linear models, the glm
function is particularly helpful.
First, let us transform our outcome variable from a continuous measure to a binary measure:
sim.df$y_bin <- as.numeric(sim.df$y > median(sim.df$y)) table(sim.df$y_bin)
## ## 0 1 ## 500 500
glm
Now we can estimate our model:
logit.fit <- glm(formula = y_bin ~ x + z, data = df, family = "binomial")
Where:
formula
is the model specificationdata
is our simulated datafamily
is a description of the error distribution and link function to be used
binomial
, poisson
, gaussian
etc…glm
summary(logit.fit)
## ## Call: ## glm(formula = y_bin ~ x + z, family = "binomial", data = df) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.80978 -1.12580 -0.01668 1.12790 1.72617 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.09573 0.13140 -0.729 0.466 ## x 0.55053 0.07159 7.690 1.47e-14 *** ## z 0.17674 0.22990 0.769 0.442 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1386.3 on 999 degrees of freedom ## Residual deviance: 1320.9 on 997 degrees of freedom ## AIC: 1326.9 ## ## Number of Fisher Scoring iterations: 4
glm
Ok, but no-one actually thinks in terms of log-odds, so let's translate that into something more meaningful:
logit.OR <- exp(cbind(OR = coef(logit.fit), confint(logit.fit)))
coef
extracts the coefficientsconfint
extracts the confidence intervalscbind
binds the vectors together as separate columnsexp
exponentiates the log-odds ratiosround(logit.OR, digits = 4)
## OR 2.5 % 97.5 % ## (Intercept) 0.9087 0.7021 1.1755 ## x 1.7342 1.5100 1.9997 ## z 1.1933 0.7606 1.8740
glm
Almost all of the convenience functions that we used for lm
are also applicable to glm
models:
summary(logit.fit) plot(logit.fit) residuals(logit.fit) coefficients(logit.fit) fitted(logit.fit)
There are a number of external packages that can make fitting other model types (relatively) straightforward:
lmer4
- Linear, generalised linear, and nonlinear hierarchical modelsmcgv
- generalised additive modelssurvival
- survival analysisglmnet
- lasso and elastic net regression modelsrandomForest
- random forest models from machine learningrjags
and rstan
- Bayesian modelsTo use a package that is not a part of the base R installation, use:
install.packages("lmer4") library("lmer4")
predict
We can retrieve the fitted values from the model using fitted()
, but we may be interested in calculating predicted values for arbitrary levels of our covariates.
sim_data <- data.frame(x = c(0, 1)) y_hat <- predict(object = lm.fit, newdata = sim_data) y_hat
## 1 2 ## 0.4721722 0.7990340
Here, I am creating a data.frame
with two observations of one variable (x
).
I am then using the predict
function, where
object = my_lm
tells R the model object for which prediction is desirednewdata = sim_data
tells R the values for which I would like predictionspredict
We can use the same syntax to retrieve predictions for (marginally) more interesting models:
sim.data <- data.frame(x = c(0, 0, 1, 1), z = c(0, 1, 0, 1)) sim.data
## x z ## 1 0 0 ## 2 0 1 ## 3 1 0 ## 4 1 1
y.hat <- predict(lm.multi.fit, newdata = sim.data) y.hat
## 1 2 3 4 ## 0.4439052 0.5009562 0.7707532 0.8278042
predict
This can be especially useful when trying to visualise interactive models:
sim.data.z0 <- data.frame(x = seq(from = -2, to = 2, by = 0.01), z = 0) sim.data.z1 <- data.frame(x = seq(from = -2, to = 2, by = 0.01), z = 1) y.hat.z0 <- predict(lm.inter.fit, newdata = sim.data.z0) y.hat.z1 <- predict(lm.inter.fit, newdata = sim.data.z1)
seq
generates a regular sequences from
one value to
another value by
given incrementsx
, but first setting z
to 0, and then setting z
to 1predict
# Create a plot of the data plot(sim.df$x, sim.df$y, cex = 0.5, pch = 19, col = "gray", bty = "n", xlab = "X", ylab = "Y", main = "Fitted values for sim_data") # Add a prediction line for z = 0 lines(x = sim.data.z0$x, y = y.hat.z0, lwd = 2) # Add a prediction line for z = 1 lines(x = sim.data.z1$x, y = y.hat.z1, lwd = 2, col = "red") # Add a legend legend("bottomright", legend = c("z = 0", "z = 1"), col = c("black", "red"), lty = 1)
predict
margins
?Yes! Thomas Leeper has created a package that implements the functionality of STATA margins
command:
library("margins") margins::cplot(lm.inter.fit, x = "x", dx = "z", what = "effect")
cplot
is equivalent to marginsplot
in STATA.
The specification above tell R
that we want to plot the marginal effect of z
on y
across the range of x
.
margins
?objects() & ls() # Which objects are currently loaded in my working environment? rm() # Remove objects from my current environment save() # Save R object(s) to disk is.na() # Is this object a missing value? Or, which elements in this vector are missing? rnorm() # Generate random numbers from a normal distribution runif() # Generate random numbers from a uniform distribution
library("readr") library("dplyr") library("tidyr")
library("data.table")
library("zoo") library("lubridate")