Open RStudio and explore the programme. Make sure you can identify the console
, the script editor, the environment
window, the packages
window and the help
window!
c()
function, with each argument separated by a comma. So, a character vector of length three could be c("a", "b", "c")
. Assign your integer vector using the assignment operator <-
to an object with a name of your choosing.my_vec <- c(1,2,3,4,5)
my_new_vec <- my_vec * 3
print(my_new_vec)
## [1] 3 6 9 12 15
print(my_new_vec + my_vec)
## [1] 4 8 12 16 20
c()
function. Make sure that you have a mix of TRUE
and FALSE
values in the vector. Use the logical vector to subset the numeric vector that you created in question 2 and print
the result.my_logical_vec <- c(T, T, T, F, F)
print(my_new_vec[my_logical_vec])
## [1] 3 6 9
my_short_vector
.my_short_vector <- my_new_vec[c(1,2)]
This exercise relates to the College
data set, which comes from An Introduction to Statistical Learning by James et al 2013.. It contains a number of variables for 777 different universities and colleges in the US.
The variables are
Private
: Public/private indicatorApps
: Number of applications receivedAccept
: Number of applicants acceptedEnroll
: Number of new students enrolledTop10perc
: New students from top 10% of high school classTop25perc
: New students from top 25% of high school classF.Undergrad
: Number of full-time undergraduatesP.Undergrad
: Number of part-time undergraduatesOutstate
: Out-of-state tuitionRoom.Board
: Room and board costsBooks
: Estimated book costsPersonal
: Estimated personal spendingPhD
: Percent of faculty with Ph.D.’sTerminal
: Percent of faculty with terminal degreeS.F.Ratio
: Student/faculty ratioperc.alumni
: Percent of alumni who donateExpend
: Instructional expenditure per studentGrad.Rate
: Graduation rateYou can either download the .csv file containing the data from the MY591 moodle page, or read the data in directly from the website.
read.csv()
function to read the data into R
. Call the loaded data college
. Make sure that you have the directory set to the correct location for the data. You can load this in R directly from the website, using:college <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/College.csv")
Or you can load it from a saved file, using:
college <- read.csv("path_to_my_file/College.csv")
View()
function. You should notice that the first column is just the name of each university. We don’t really want R
to treat this as data. However, it may be handy to have these names for later. Try the following commands:rownames(college) <- college[, 1]
View(college)
You should see that there is now a row.names
column with the name of each university recorded. This means that R
has given each row a name corresponding to the appropriate university. R
will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try
college <- college[, -1]
View(college)
Now you should see that the first data column is Private
. Note that another column labeled row.names
now appears before the Private
column. However, this is not a data column but rather the name that R
is giving to each row.
str()
function to look at the structure of the data. Which of the variables are numeric? Which are integer? Which are factors?str(college)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : int 1660 2186 1428 417 193 587 353 1899 1038 582 ...
## $ Accept : int 1232 1924 1097 349 146 479 340 1720 839 498 ...
## $ Enroll : int 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : int 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : int 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: int 2885 2683 1036 510 249 678 416 1594 973 799 ...
## $ P.Undergrad: int 537 1227 99 63 869 41 230 32 306 78 ...
## $ Outstate : int 7440 12280 11250 12960 7560 13500 13290 13868 15595 10468 ...
## $ Room.Board : int 3300 6450 3750 5450 4120 3335 5720 4826 4400 3380 ...
## $ Books : int 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : int 2200 1500 1165 875 1500 675 1500 850 500 1800 ...
## $ PhD : int 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : int 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: int 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : int 7041 10527 8735 19016 10922 9727 8861 11487 11644 8991 ...
## $ Grad.Rate : int 60 56 54 59 15 55 63 73 80 52 ...
summary()
function to produce a numerical summary of the variables in the data set.summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
Enroll
and Top10Perc
variables?mean(college$Enroll)
## [1] 779.973
mean(college$Top10perc)
## [1] 27.55856
sd(college$Enroll)
## [1] 929.1762
sd(college$Top10perc)
## [1] 17.64036
Enroll
and Top10Perc
variables in the subset of the data that remains?college <- college[-c(15:85),]
mean(college$Enroll)
## [1] 784.8867
mean(college$Top10perc)
## [1] 27.51841
sd(college$Enroll)
## [1] 928.6599
sd(college$Top10perc)
## [1] 17.61432
Books
variable?range(college$Books)
## [1] 110 2340
pairs()
function to produce a scatterplot matrix of the first five columns or variables of the data. Recall that you can reference the first five columns of a matrix A
using A[,1:5]
.pairs(college[,1:5])
plot()
function to produce a scatter plot of S.F.Ratio
versus Grad.Rate
. Give the axes informative labels.plot(college$S.F.Ratio, college$Grad.Rate, xlab = "Student/Faculty Ratio", ylab = "Graduation Rate")
?plot
and ?par
for some ideas. If you are feeling very keen, try using ggplot
but you will need to load the ggplot library first: library(ggplot2)
.plot(college$S.F.Ratio, college$Grad.Rate,
xlab = "Student/Faculty Ratio", ylab = "Graduation Rate",
main = "A really nice plot",
pch = 19, col = "gray", bty = "n")
library(ggplot2)
ggplot(data = college, aes(x = S.F.Ratio, y = Grad.Rate, col = Private)) +
geom_point()+
xlab("Student/Faculty Ratio")+
ylab("Graduation Rate")+
ggtitle("A really nice plot")+
facet_grid(~Private)
This exercise involves the auto
data set available as Auto.csv
from the MY591 website, or directly from http://www-bcf.usc.edu/~gareth/ISL/Auto.csv. Load this data into R. This data includes characteristics on a number of different types of cars. It includes the following variables:
mpg
= miles per galloncylinders
= number of cylindersdisplacement
= engine displacement (I have no idea what this is)horsepower
= number of horses powering the car (not really)weight
= weight of the car in kgsacceleration
= time in seconds for the car to go from 0-60mphyear
= year of manufactureorigin
= country of manufacturename
= name of carUnfortunately, the people who made this data decided to code their missing values with a ?
, which is an awful thing to do. When reading the data in from the .csv
file, use na.strings = "?"
to convert them to NA
s. Then exclude all the NA
s from the data.
auto <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv", na.strings = "?")
auto <- na.omit(auto)
Note: Sometimes when you load a dataset, a qualitative variable might have a numeric value. For instance, the origin
variable is qualitative, but has integer values of 1, 2, 3. From mysterious sources (Googling), we know that this variable is coded 1 = usa; 2 = europe; 3 = japan
. So we can covert it into a factor, using:
auto$originf <- factor(auto$origin, labels = c("usa", "europe", "japan"))
str(auto)
## 'data.frame': 392 obs. of 10 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## $ originf : Factor w/ 3 levels "usa","europe",..: 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "na.action")= 'omit' Named int 33 127 331 337 355
## ..- attr(*, "names")= chr "33" "127" "331" "337" ...
Quantitative: mpg, cylinders, displacement, horsepower, weight, acceleration, year. Qualitative: name, origin, originf
qualitative_columns <- c(4, 9, 10)
pairs(auto[, -qualitative_columns])
# Heavier weight correlates with lower mpg.
plot(auto$mpg, auto$weight)
# More cylinders, fewer mpg.
plot(auto$mpg, auto$cylinders)
# Cars become more efficient over time.
plot(auto$mpg, auto$year)
# Lets plot some mpg vs. region of origin
plot(auto$originf, auto$mpg, ylab = "mpg")
lm()
function to perform a multiple linear regression with mpg
as the response and all other variables except name
as the predictors (make sure you use the factor version of the origin
variable that we created in part a). Use the summary()
function to print the results. Comment on the output.lm.fit1 <- lm(mpg ~ cylinders + displacement + horsepower + weight +
acceleration + year + originf, data=auto)
summary(lm.fit1)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## acceleration + year + originf, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0095 -2.0785 -0.0982 1.9856 13.3608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.795e+01 4.677e+00 -3.839 0.000145 ***
## cylinders -4.897e-01 3.212e-01 -1.524 0.128215
## displacement 2.398e-02 7.653e-03 3.133 0.001863 **
## horsepower -1.818e-02 1.371e-02 -1.326 0.185488
## weight -6.710e-03 6.551e-04 -10.243 < 2e-16 ***
## acceleration 7.910e-02 9.822e-02 0.805 0.421101
## year 7.770e-01 5.178e-02 15.005 < 2e-16 ***
## originfeurope 2.630e+00 5.664e-01 4.643 4.72e-06 ***
## originfjapan 2.853e+00 5.527e-01 5.162 3.93e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared: 0.8242, Adjusted R-squared: 0.8205
## F-statistic: 224.5 on 8 and 383 DF, p-value: < 2.2e-16
i. Is there a relationship between the predictors and the response?
Yes, there is a relationship between the predictors and the response by testing the null hypothesis of whether all the regression coefficients are zero. The F-statistic is far from 1 (with a small p-value), indicating evidence against the null hypothesis.
ii. Which predictors appear to have a statistically significant relationship to the response?
Looking at the p-values associated with each predictor’s t-statistic, we see that displacement, weight, year, and origin have a statistically significant relationship, while cylinders, horsepower, and acceleration do not.
iii. What does the coefficient for the `year` variable suggest?
The regression coefficient for year, 0.7770269
, suggests that for every one year, mpg increases by the coefficient. In other words, cars become more fuel efficient every year by almost 1 mpg / year.
iv. What is the R-squared for this model? Access the r-squared value from the summary of the model object, and save it as its own object.
fit1r2 <- summary(lm.fit1)$r.squared
plot()
function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?par(mfrow=c(2,2))
plot(lm.fit1)
Seems to be non-linear pattern, linear model not the best fit. From the leverage plot, point 14 appears to have high leverage, although not a high magnitude residual.
predict()
function to give a numerical prediction for each of these car types. Base your estimates on the following model:lm(mpg ~ cylinders + weight + acceleration + originf, data=auto)
lm.fit3 <- lm(mpg ~ cylinders + weight + acceleration + originf, data=auto)
sim_example_one <- data.frame(cylinders = 6,
acceleration = 12,
weight = 4000,
originf = "europe"
)
sim_example_two <- data.frame(cylinders = 4,
acceleration = 20,
weight = 2000,
originf = "japan"
)
predict(lm.fit3, sim_example_one)
## 1
## 16.25391
predict(lm.fit3, sim_example_two)
## 1
## 32.76226
predict()
function to further explore the fitted values from your model. You may want to use summary()
again to work out some reasonable values of your covariates such that you are not extrapolating wildly away from the data. You will need to specify values for each of the variables that you included in your model. For example:sim_data_low_weight <- data.frame(displacement = seq(from = 68, to = 455, by = 1),
weight = 2000,
cylinders = mean(auto$cylinders),
horsepower = mean(auto$horsepower),
acceleration = mean(auto$acceleration),
year = mean(auto$year),
originf = "europe"
)
range(auto$displacement)
## [1] 68 455
range(auto$weight)
## [1] 1613 5140
sim_data_high_weight <- data.frame(displacement = seq(from = 68, to = 455, by = 1),
weight = 5000,
cylinders = mean(auto$cylinders),
horsepower = mean(auto$horsepower),
acceleration = mean(auto$acceleration),
year = mean(auto$year),
originf = "europe"
)
y_hat_low_weight <- predict(lm.fit1, newdata = sim_data_low_weight)
y_hat_high_weight <- predict(lm.fit1, newdata = sim_data_high_weight)
# Create a plot of the data
plot(auto$displacement, auto$mpg, cex = 0.5, pch = 19,
col = "gray", bty = "n",
main = "Fitted values for displacement and weight")
# Add a prediction line for z = 0
lines(x = sim_data_low_weight$displacement, y = y_hat_low_weight, lwd = 2)
# Add a prediction line for z = 1
lines(x = sim_data_high_weight$displacement, y = y_hat_high_weight, lwd = 2, col = "red")
# Add a legend
legend("topright", legend = c("weight = 2000", "weight = 5000"), col = c("black", "red"), lty = 1)
Interestingly, the fitted values reveal that while the relationship between displacement and mpg in the raw data is negative, once we condition on the other variables in the model, the coefficient is positive.
*
operator to fit linear regression models with interaction effects (you can choose which variables to interact - you almost certainly know more about the determinants of car efficiency than I do). Do any interactions appear to be statistically significant?lm.fit2 <- lm(mpg ~ cylinders * displacement + displacement * weight, data=auto)
summary(lm.fit2)
##
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement *
## weight, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2934 -2.5184 -0.3476 1.8399 17.7723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.262e+01 2.237e+00 23.519 < 2e-16 ***
## cylinders 7.606e-01 7.669e-01 0.992 0.322
## displacement -7.351e-02 1.669e-02 -4.403 1.38e-05 ***
## weight -9.888e-03 1.329e-03 -7.438 6.69e-13 ***
## cylinders:displacement -2.986e-03 3.426e-03 -0.872 0.384
## displacement:weight 2.128e-05 5.002e-06 4.254 2.64e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.103 on 386 degrees of freedom
## Multiple R-squared: 0.7272, Adjusted R-squared: 0.7237
## F-statistic: 205.8 on 5 and 386 DF, p-value: < 2.2e-16
Interaction between displacement and weight is statistically signifcant, while the interactiion between cylinders and displacement is not.
predict()
function to create fitted values from the interaction model. Use the fitted values to illustrate the conditional effect of one of your variables of interest.# We can re-use the simulated data that we created last time and just project the interaction model onto this!
y_hat_low_weight <- predict(lm.fit2, newdata = sim_data_low_weight)
y_hat_high_weight <- predict(lm.fit2, newdata = sim_data_high_weight)
# Create a plot of the data
plot(auto$displacement, auto$mpg, cex = 0.5, pch = 19,
col = "gray", bty = "n",
main = "Conditional effect of displacement on mpg")
# Add a prediction line for z = 0
lines(x = sim_data_low_weight$displacement, y = y_hat_low_weight, lwd = 2)
# Add a prediction line for z = 1
lines(x = sim_data_high_weight$displacement, y = y_hat_high_weight, lwd = 2, col = "red")
# Add a legend
legend("topright", legend = c("weight = 2000", "weight = 5000"), col = c("black", "red"), lty = 1)
fit2r2 <- summary(lm.fit2)$r.squared
print(fit1r2)
## [1] 0.8241995
print(fit2r2)
## [1] 0.7272237
median()
function, create a new variable which is equal to 1 (or, TRUE
) when mpg
is above the median value, and 0 (or, FALSE
) otherwise.auto$mpg_bin <- as.numeric(auto$mpg > median(auto$mpg))
cylinders
, weight
, acceleration
and origin
as predictors of mpg
. The glm()
function will be useful, and you should set the family
argument to be equal to "binomial"
.glm.fit1 <- glm(mpg_bin ~ cylinders + weight +
acceleration + originf, data=auto, family = "binomial")
summary(glm.fit1)
##
## Call:
## glm(formula = mpg_bin ~ cylinders + weight + acceleration + originf,
## family = "binomial", data = auto)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.50956 -0.25737 0.04781 0.39649 3.03403
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.7921523 1.6940903 5.190 2.1e-07 ***
## cylinders -0.4137897 0.2395222 -1.728 0.0841 .
## weight -0.0033442 0.0006335 -5.279 1.3e-07 ***
## acceleration 0.1705619 0.0846498 2.015 0.0439 *
## originfeurope 0.2036643 0.4701660 0.433 0.6649
## originfjapan 0.3487730 0.5090198 0.685 0.4932
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 543.43 on 391 degrees of freedom
## Residual deviance: 214.43 on 386 degrees of freedom
## AIC: 226.43
##
## Number of Fisher Scoring iterations: 7
coef()
) and their associated confidence intervals (confint()
) into odds-ratios (exp()
). Create a data.frame
of these values and print it.odds.ratios <- exp(cbind(OR = coef(glm.fit1), confint(glm.fit1)))
print(odds.ratios)
## OR 2.5 % 97.5 %
## (Intercept) 6582.3844043 268.5175349 2.114970e+05
## cylinders 0.6611400 0.4085758 1.048035e+00
## weight 0.9966614 0.9953509 9.978397e-01
## acceleration 1.1859710 1.0085434 1.406990e+00
## originfeurope 1.2258865 0.4905699 3.134691e+00
## originfjapan 1.4173275 0.5288169 3.958253e+00