15/05/2019
You can find the
It is worth noting that both the assignment and the slides were all created using R and RStudio.
Because:
On the other hand:
More discussion on this can be found here.
1 + 1
## [1] 2
5 - 3
## [1] 2
6 / 2
## [1] 3
4 * 4
## [1] 16
sum(<numbers>) # Sum mean(<numbers>) # Mean median(<numbers>) # Median sd(<numbers>) # Standard Deviation log(<number>) # Logarithm exp(<number>) # Exponent sqrt(<number>) # Square root
R also understands logical operators:
<
: less than>
: greater than==
: equal to (note, not =
)>=
: greater than or equal to<=
: less than or equal to!=
: not equal to&
: and|
: orobjects
<-
x1 <- 10 print(x1)
## [1] 10
x2 <- 4 x3 <- x1 - x2 print(x3)
## [1] 6
vector
- the basic building block of R.A vector is a collection of elements which all share the same type.
We use the c()
function to concatenate observations into a single vector.
num_vec <- c(150, 178, 67.7, 905, 12) print(num_vec)
## [1] 150.0 178.0 67.7 905.0 12.0
num_vec
## [1] 150.0 178.0 67.7 905.0 12.0
vector
- the basic building block of R.There are no scalars (atomic units) in R
A scalar is just a vector of length 1
char_vec <- c("apple", "pear", "plum", "pineapple", "strawberry") length(char_vec)
## [1] 5
char_scalar <- "banana" length(char_scalar)
## [1] 1
vector
- the basic building block of R.When creating new objects, keep in mind reserved words and try to avoid using them in names
log_vec <- c(FALSE, FALSE, FALSE, TRUE, TRUE) log_vec
## [1] FALSE FALSE FALSE TRUE TRUE
log_vec2 <- c(F, F, F, T, T) log_vec2
## [1] FALSE FALSE FALSE TRUE TRUE
vector
- the basic building block of R.A factor is similar to a character vector, but here each unique element is associated with a numerical value which represents a category:
fac_vec <- as.factor(c("a", "b", "c", "b", "b", "c")) fac_vec
## [1] a b c b b c ## Levels: a b c
as.numeric(fac_vec)
## [1] 1 2 3 2 2 3
vector
- the basic building block of R.A date is similar to a numeric vector, as it stores the number of days, but it can assume a variety of forms:
date_vec <- as.Date(c("2001/07/06", "2005/05/05", "2010/06/05", "2015/07/05", "2017/08/06")) date_vec
## [1] "2001-07-06" "2005-05-05" "2010-06-05" "2015-07-05" "2017-08-06"
as.numeric(date_vec)
## [1] 11509 12908 14765 16621 17384
format(date_vec, "%d/%m/%Y")
## [1] "06/07/2001" "05/05/2005" "05/06/2010" "05/07/2015" "06/08/2017"
vector
subsettingTo subset a vector
, use square parenthesis to index the elements you would like via object[index]
:
char_vec
## [1] "apple" "pear" "plum" "pineapple" "strawberry"
char_vec[1:3]
## [1] "apple" "pear" "plum"
vector
subsettingInstead of passing indices, it is also possible to use a logical vector to specify the required elements:
log_vec
## [1] FALSE FALSE FALSE TRUE TRUE
char_vec[log_vec]
## [1] "pineapple" "strawberry"
vector
operationsIn R, mathematical operations on vectors occur elementwise:
fib <- c(1, 1, 2, 3, 5, 8, 13, 21) fib[1:7]
## [1] 1 1 2 3 5 8 13
fib[2:8]
## [1] 1 2 3 5 8 13 21
fib[1:7] + fib[2:8]
## [1] 2 3 5 8 13 21 34
vector
operationsIt is also possible to perform logical operations on vectors:
fib <- c(1, 1, 2, 3, 5, 8, 13, 21) fib_gt_5 <- fib > 5 fib_gt_5
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
We can also combine logical operators
fib_gt_5_or_ls_2 <- fib > 5 | fib < 2 fib_gt_5_or_ls_2
## [1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
matrix
A matrix is just a collection of vectors! I.e. it is a 2-dimensional vector that has 2 additional attributes: the number of rows and columns
As for vectors all elements of a matrix must be of the same type
mat <- matrix(data = 1:25, nrow = 5, ncol = 5) mat
## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 6 11 16 21 ## [2,] 2 7 12 17 22 ## [3,] 3 8 13 18 23 ## [4,] 4 9 14 19 24 ## [5,] 5 10 15 20 25
list
A list
is a collection of R objects of different types
l <- list(element1 = num_vec, element2 = mat[1:2,1:2], element3 = "Farmers' Market") l
## $element1 ## [1] 150.0 178.0 67.7 905.0 12.0 ## ## $element2 ## [,1] [,2] ## [1,] 1 6 ## [2,] 2 7 ## ## $element3 ## [1] "Farmers' Market"
list
subsettingIndividual elements of a list
object can be subset using $
or [[
operators:
l$element2
## [,1] [,2] ## [1,] 1 6 ## [2,] 2 7
l[[2]]
## [,1] [,2] ## [1,] 1 6 ## [2,] 2 7
data.frame
- the workhorse of data analysis.A data.frame
is an R object in which the columns can be of different types
Despite their matrix-like appearance, data frames in R are lists, where all elements are of equal length!
df <- data.frame(weight = num_vec, fruit = char_vec, berry = log_vec) df
## weight fruit berry ## 1 150.0 apple FALSE ## 2 178.0 pear FALSE ## 3 67.7 plum FALSE ## 4 905.0 pineapple TRUE ## 5 12.0 strawberry TRUE
matrix
and data.frame
subsettingTo subset a matrix
or data.frame
, you need to specify both rows and columns:
mat[1:3, 1:3]
## [,1] [,2] [,3] ## [1,] 1 6 11 ## [2,] 2 7 12 ## [3,] 3 8 13
df[1, ]
## weight fruit berry ## 1 150 apple FALSE
matrix
and data.frame
subsettingWe can also subset to remove rows or columns by using the -
operator applied to the c
function:
mat[-c(1:3), -c(1:3)]
## [,1] [,2] ## [1,] 19 24 ## [2,] 20 25
In the code, 1:3
creates a vector of the integers 1, 2 and 3, and the -
operator negates these. We wrap the vector in the c
function so that -
applies to each element, and not just the first element.
functions
- the backbone of R operations.All operations on objects in R are done with functions.
fun(arg1, arg2, ...)
Where
fun
is the name of the functionarg1
is the first argument passed to the functionarg2
is the second argument passed to the functionfunction
exampleLet's consider the mean()
function. This function takes two main arguments:
mean(x, na.rm = FALSE)
Where x
is a numeric vector, and na.rm
is a logical value that indicates whether we'd like to remove missing values (NA
).
vec <- c(1, 2, 3, 4, 5) mean(x = vec, na.rm = FALSE)
## [1] 3
vec <- c(1, 2, 3, NA, 5) mean(x = vec, na.rm = TRUE)
## [1] 2.75
function
exampleWe can also perform calculations on the output of a function:
mean(num_vec) * 3
## [1] 787.62
Which means that we can also have nested functions:
log(mean(num_vec))
## [1] 5.570403
We can also assign the output of any function to a new object for use later:
log_fruit <- log(mean(num_vec)) # The logarithm of average fruit weight
functions
in disguise!All the basic operations on R objects
that we have encountered so far are, in fact, functions
:
`+`(1,1)
## [1] 2
`-`(5,3)
## [1] 2
`[`(char_vec, 1:3)
## [1] "apple" "pear" "plum"
functions
Functions are also objects, and we can create our own. We define a function as follows:
sum2 <- function(a, b){ return(a + b) } res <- sum2(a = 5, b = 50) res
## [1] 55
Note that the function itself, and the result returned by the function are both objects!
R has an inbuilt help facility which provides more information about any function:
?summary help(summary)
The quality of the help varies hugely across packages.
Stackoverflow is a good resource for many standard tasks.
For custom packages it is often helpful to check the issues page on the GitHub.
E.g. for ggplot2
: https://github.com/tidyverse/ggplot2/issues
Or, indeed, any search engine #LMDDGTFY.
sim.df <- read.csv(file = "file.csv")
df
is an R data.frame object (you could call this anything)file.csv
is a .csv file with your data<-
is the assignment operatorfile.csv
, it will have to be saved in your current working directorygetwd()
and setwd()
.sim.df <- read.csv(file = "./data/file.csv")
install.packages("haven") # Install the "haven" package library("haven") ## Loads an additional package that deals with STATA files sim.df <- haven::read_dta(file = "./data/file.dta") sim.df <- haven::read_sav(file = "./data/file.sav")
haven
package is not automatically installed in R so we need to do that ourselvesdf
is still a R data.frame objectsfile.dta
is a .dta (used by STATA) file with your datafile.sav
is a .sav (used by SPSS) file with your data<-
is the usual assignment operatorfunctions
R includes a number of helpful functions for exploring your data:
View(sim.df) # View data in spreadsheet-style names(sim.df) # Names of the variables in your data head(sim.df) # First n (six, by default) rows of the data tail(sim.df) # Last n (six, by default) rows of the data str(sim.df) # "Structure" of any R object summary(sim.df) # Summary statistics for each of the columns in the data dim(sim.df) # Dimensions of data mean(sim.df$var) # Mean of a numeric vector sd(sim.df$var) # Standard deviation of a numeric vector range(sim.df$var) # Range of a numeric vector quantile(sim.df$var) # Quantiles of a numeric vector
head
By varying argument n
, we can adjust the number of top rows we want to inspect:
head(sim.df, n = 5)
## x y z g ## 1 1.8068729 1.1338624 0.08688987 d ## 2 -1.2486526 0.5960640 0.27517432 e ## 3 -0.1355884 0.8004329 0.70459357 a ## 4 0.9505454 2.2648623 0.61059223 a ## 5 -1.2711097 0.8887924 0.88313813 c
Can also be used in combination with View()
:
View(head(sim.df, n = 50))
str
str()
applied to data frames is particularly useful in inferring types of variables in the data
str(sim.df)
## 'data.frame': 1000 obs. of 4 variables: ## $ x: num 1.807 -1.249 -0.136 0.951 -1.271 ... ## $ y: num 1.134 0.596 0.8 2.265 0.889 ... ## $ z: num 0.0869 0.2752 0.7046 0.6106 0.8831 ... ## $ g: Factor w/ 6 levels "a","b","c","d",..: 4 5 1 1 3 6 1 3 6 5 ...
summary
summary(sim.df)
## x y z g ## Min. :-2.7467901 Min. :-3.2166 Min. :0.0007827 a:177 ## 1st Qu.:-0.6505946 1st Qu.:-0.2343 1st Qu.:0.2568557 b:161 ## Median : 0.0003546 Median : 0.4703 Median :0.4932571 c:173 ## Mean : 0.0161324 Mean : 0.4774 Mean :0.4954730 d:141 ## 3rd Qu.: 0.6626705 3rd Qu.: 1.1694 3rd Qu.:0.7341048 e:177 ## Max. : 3.0778088 Max. : 3.5594 Max. :0.9994499 f:171
pairs
pairs()
produces a matrix of bivariate scatterplots
pairs(sim.df)
range
Suppose you would like to find the range of a variable in your data.frame
. The following will not work:
range(sim.df)
## Error in FUN(X[[i]], ...): only defined on a data frame with all numeric variables
Right, we cannot take the range of the entire data! Some of the variables are not numeric, and therefore do not have a range…
range
Instead, we need to access one variable. There are two ways to do this.
[[
operator and access the variable of interest by position, or$
operator and access the variable of interest by nameNote that the last two treat data.frame
as a list with variables as elements
range(sim.df[,1]) range(sim.df[[1]])
range(sim.df$x)
## [1] -2.746790 3.077809
ggplot
The basic plotting syntax is very simple. plot(x_var, y_var)
will give you a scatter plot:
plot(sim.df$x, sim.df$y)
Hmm, let's work on that.
The plot function takes a number of arguments (?plot
for a full list). The fewer you specify, the uglier your plot:
plot(x = sim.df$x, y = sim.df$y, xlab = "X variable", ylab = "Y variable", main = "Awesome plot title", pch = 19, # Solid points cex = 0.5, # Smaller points bty = "n", # Remove surrounding box col = sim.df$g # Colour by grouping variable )
The default behaviour of plot()
depends on the type of input variables for the x
and y
arguments. If x
is a factor variable, and y
is numeric, then R will produce a boxplot:
plot(x = sim.df$g, y = sim.df$x)
ggplot
A very popular alternative to base R plots is the ggplot2
library (the 2 in the name refers to the second iteration, which is the standard). This is a separate package (i.e. it is not a part of the base R environment) but is very widely used.
Wilkinson, L. (2005). The Grammar of Graphics 2nd Ed. Heidelberg: Springer. https://doi.org/10.1007/0-387-28695-0
Wickham, H. (2010). A Layered Grammar of Graphics. Journal of Computational and Graphical Statistics, 19(1), 3–28. https://doi.org/10.1198/jcgs.2009.07098
Graphs are broken into multiple layers
Layers can be recycled across multiple plots
ggplot
Let's recreate the previous scatter plot using ggplot
:
library("ggplot2") ggplot(data = sim.df, aes(x = x, y = y, col = g)) + # Add scatterplot geom_point() + # Change axes labels and plot title labs(x = "X variable", y = "Y variable", title = "Awesome plot title") + # Change default grey theme to black and white theme_bw()
ggplot