Open RStudio and explore the programme. Make sure you can identify the console
, the script editor, the environment
window, the packages
window and the help
window!
In the script window, write code that creates a vector composed of 5 single-digit integers. To concatinate a series of values, use the c()
function, with each argument separated by a comma. So, a character vector of length three could be c("a", "b", "c")
. Assign your integer vector using the assignment operator <-
to an object with a name of your choosing.
Multiply your vector by 3, and assign the output to a new object. print
the values of your new object.
Add together the two objects that you have created to far, print
ing the result. Note that R operates on vectors element-wise.
Create a logical vector of length five, again using the c()
function. Make sure that you have a mix of TRUE
and FALSE
values in the vector. Use the logical vector to subset the numeric vector that you created in question 2 and print
the result. Check what happens when the logical vector is shorter than the numeric vector.
Subset to just the first two elements of the numeric vector that you created in question 2 and assign the result to have the name my_short_vector
.
This exercise relates to the College
data set, which comes from An Introduction to Statistical Learning by James et al 2013. It contains a number of variables for 777 different universities and colleges in the US.
The variables are
Private
: Public/private indicatorApps
: Number of applications receivedAccept
: Number of applicants acceptedEnroll
: Number of new students enrolledTop10perc
: New students from top 10% of high school classTop25perc
: New students from top 25% of high school classF.Undergrad
: Number of full-time undergraduatesP.Undergrad
: Number of part-time undergraduatesOutstate
: Out-of-state tuitionRoom.Board
: Room and board costsBooks
: Estimated book costsPersonal
: Estimated personal spendingPhD
: Percent of faculty with Ph.D.’sTerminal
: Percent of faculty with terminal degreeS.F.Ratio
: Student/faculty ratioperc.alumni
: Percent of alumni who donateExpend
: Instructional expenditure per studentGrad.Rate
: Graduation rateYou can either download the .csv file containing the data from the MY591 moodle page, or read the data in directly from the website.
read.csv()
function to read the data into R
. Call the loaded data college
. Make sure that you have the directory set to the correct location for the data. You can load this in R directly from the website, using:college <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/College.csv")
Or you can load it from a saved file, using:
college <- read.csv("path_to_my_file/College.csv")
Use the str()
function to look at the structure of the data. Which of the variables are numeric? Which are integer? Which are factors?
Use the summary()
function to produce a numerical summary of the variables in the data set.
What is the mean and standard deviation of the Enroll
and Top10Perc
variables?
Now remove the 10th through 85th observations. What is the mean and standard deviation of the Enroll
and Top10Perc
variables in the subset of the data that remains?
What is the range of the Books
variable?
Use the pairs()
function to produce a scatterplot matrix of the first five columns or variables of the data. Recall that you can reference the first five columns of a matrix A
using A[,1:5]
.
Use the plot()
function to produce a scatter plot of S.F.Ratio
versus Grad.Rate
. Give the axes informative labels.
Compete with your neighbour to make the prettiest plot. You might want to look at ?plot
and ?par
for some ideas. If you are feeling very keen, try using ggplot
but you will need to load the ggplot library first: library("ggplot2")
.
This exercise involves the auto
data set available as Auto.csv
from the MY591 website, or directly from http://www-bcf.usc.edu/~gareth/ISL/Auto.csv. Load this data into R. This data includes characteristics on a number of different types of cars. It includes the following variables:
mpg
= miles per galloncylinders
= number of cylindersdisplacement
= engine displacement (I have no idea what this is)horsepower
= number of horses powering the car (not really)weight
= weight of the car in kgsacceleration
= time in seconds for the car to go from 0-60mphyear
= year of manufactureorigin
= country of manufacturename
= name of carUnfortunately, the people who made this data decided to code their missing values with a ?
, which is an awful thing to do. When reading the data in from the .csv
file, use na.strings = "?"
to convert them to NA
s. Then exclude all the NA
s from the data.
auto <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv", na.strings = "?")
auto <- na.omit(auto)
Note: Sometimes when you load a dataset, a categoriccal variable might have a numeric value. For instance, the origin
variable is categorical, but has integer values of 1, 2, 3. From mysterious sources (Googling), we know that this variable is coded 1 = usa; 2 = europe; 3 = japan
. So we can covert it into a factor, using:
auto$originf <- factor(auto$origin, labels = c("usa", "europe", "japan"))
Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
Use the lm()
function to perform a multiple linear regression with mpg
as the response and all other variables except name
as the predictors (make sure you use the factor version of the origin
variable that we created in part a). Use the summary()
function to print the results. Comment on the output.
Is there a relationship between the predictors and the response?
Which predictors appear to have a statistically significant relationship to the response?
What does the coefficient for the year
variable suggest?
What is the R-squared for this model? Access the r.squared
value from the summary
of the model object, and save it as its own object.
Use the plot()
function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
Use the predict()
function to give a numerical prediction for each of these car types. Base your estimates on the following model:
lm(mpg ~ cylinders + weight + acceleration + originf, data=auto)
predict()
function to further explore the fitted values from your model. You may want to use summary()
again to work out some reasonable values of your covariates such that you are not extrapolating wildly away from the data. You will need to specify values for each of the variables that you included in your model. For example:sim_data_low_weight <- data.frame(displacement = seq(from = 68, to = 455, by = 1),
weight = 2000,
cylinders = mean(auto$cylinders),
horsepower = mean(auto$horsepower),
acceleration = mean(auto$acceleration),
year = mean(auto$year),
originf = "europe"
)
Use the *
operator to fit linear regression models with interaction effects (you can choose which variables to interact - you almost certainly know more about the determinants of car efficiency than I do). Do any interactions appear to be statistically significant?
Again, use the predict()
function to create fitted values from the interaction model. Use the fitted values to illustrate the conditional effect of one of your variables of interest.
How does the r-squared from this model compare to that of the non-interactive model? Again, access the R-squared value from the summary of the model object and print both r-squared values.
Using the median()
function, create a new variable which is equal to 1 (or, TRUE
) when mpg
is above the median value, and 0 (or, FALSE
) otherwise.
Make use of this new variable to estimate a logistic regression with cylinders
, weight
, acceleration
and origin
as predictors. The dependent variable in this analysis should be the transformed version of mpg
that you created in part a. The glm()
function will be useful, and you should set the family
argument to be equal to "binomial"
.
Print a summary of the fitted model object and interpret two of the coefficients.
If you struggled with iii, that is because log-odds ratios are completely non-intuitive. Let’s make some progress by converting the coefficients (coef()
) and their associated confidence intervals (confint()
) into odds-ratios (exp()
). Create a data.frame
of these values and print it.