Open RStudio and explore the programme. Make sure you can identify the `console`

, the script editor, the `environment`

window, the `packages`

window and the `help`

window!

In the script window, write code that creates a vector composed of 5 single-digit integers. To concatinate a series of values, use the

`c()`

function, with each argument separated by a comma. So, a character vector of length three could be`c("a", "b", "c")`

. Assign your integer vector using the assignment operator`<-`

to an object with a name of your choosing.Multiply your vector by 3, and assign the output to a new object.

`print`

the values of your new object.Add together the two objects that you have created to far,

`print`

ing the result. Note that R operates on vectors element-wise.Create a logical vector of length five, again using the

`c()`

function. Make sure that you have a mix of`TRUE`

and`FALSE`

values in the vector. Use the logical vector to subset the numeric vector that you created in question 2 and`print`

the result. Check what happens when the logical vector is shorter than the numeric vector.Subset to just the first two elements of the numeric vector that you created in question 2 and assign the result to have the name

`my_short_vector`

.

This exercise relates to the `College`

data set, which comes from An Introduction to Statistical Learning by James et al 2013. It contains a number of variables for 777 different universities and colleges in the US.

The variables are

`Private`

: Public/private indicator`Apps`

: Number of applications received`Accept`

: Number of applicants accepted`Enroll`

: Number of new students enrolled`Top10perc`

: New students from top 10% of high school class`Top25perc`

: New students from top 25% of high school class`F.Undergrad`

: Number of full-time undergraduates`P.Undergrad`

: Number of part-time undergraduates`Outstate`

: Out-of-state tuition`Room.Board`

: Room and board costs`Books`

: Estimated book costs`Personal`

: Estimated personal spending`PhD`

: Percent of faculty with Ph.D.’s`Terminal`

: Percent of faculty with terminal degree`S.F.Ratio`

: Student/faculty ratio`perc.alumni`

: Percent of alumni who donate`Expend`

: Instructional expenditure per student`Grad.Rate`

: Graduation rate

You can either download the .csv file containing the data from the MY591 moodle page, or read the data in directly from the website.

- Use the
`read.csv()`

function to read the data into`R`

. Call the loaded data`college`

. Make sure that you have the directory set to the correct location for the data. You can load this in R directly from the website, using:

`college <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/College.csv")`

Or you can load it from a saved file, using:

`college <- read.csv("path_to_my_file/College.csv")`

Use the

`str()`

function to look at the structure of the data. Which of the variables are numeric? Which are integer? Which are factors?Use the

`summary()`

function to produce a numerical summary of the variables in the data set.What is the mean and standard deviation of the

`Enroll`

and`Top10Perc`

variables?Now remove the 10th through 85th observations. What is the mean and standard deviation of the

`Enroll`

and`Top10Perc`

variables in the subset of the data that remains?What is the range of the

`Books`

variable?Use the

`pairs()`

function to produce a scatterplot matrix of the first five columns or variables of the data. Recall that you can reference the first five columns of a matrix`A`

using`A[,1:5]`

.Use the

`plot()`

function to produce a scatter plot of`S.F.Ratio`

versus`Grad.Rate`

. Give the axes informative labels.Compete with your neighbour to make the prettiest plot. You might want to look at

`?plot`

and`?par`

for some ideas. If you are feeling very keen, try using`ggplot`

but you will need to load the ggplot library first:`library("ggplot2")`

.

This exercise involves the `auto`

data set available as `Auto.csv`

from the MY591 website, or directly from http://www-bcf.usc.edu/~gareth/ISL/Auto.csv. Load this data into R. This data includes characteristics on a number of different types of cars. It includes the following variables:

`mpg`

= miles per gallon`cylinders`

= number of cylinders`displacement`

= engine displacement (I have no idea what this is)`horsepower`

= number of horses powering the car (not really)`weight`

= weight of the car in kgs`acceleration`

= time in seconds for the car to go from 0-60mph`year`

= year of manufacture`origin`

= country of manufacture`name`

= name of car

Unfortunately, the people who made this data decided to code their missing values with a `?`

, which is an awful thing to do. When reading the data in from the `.csv`

file, use `na.strings = "?"`

to convert them to `NA`

s. Then exclude all the `NA`

s from the data.

```
auto <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv", na.strings = "?")
auto <- na.omit(auto)
```

- Which of the predictors are numeric, and which are categorical?

Note: Sometimes when you load a dataset, a categoriccal variable might have a numeric value. For instance, the `origin`

variable is categorical, but has integer values of 1, 2, 3. From mysterious sources (Googling), we know that this variable is coded `1 = usa; 2 = europe; 3 = japan`

. So we can covert it into a factor, using:

`auto$originf <- factor(auto$origin, labels = c("usa", "europe", "japan"))`

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

Use the

`lm()`

function to perform a multiple linear regression with`mpg`

as the response and all other variables except`name`

as the predictors (make sure you use the factor version of the`origin`

variable that we created in part a). Use the`summary()`

function to print the results. Comment on the output.Is there a relationship between the predictors and the response?

Which predictors appear to have a statistically significant relationship to the response?

What does the coefficient for the

`year`

variable suggest?What is the R-squared for this model? Access the

`r.squared`

value from the`summary`

of the model object, and save it as its own object.

Use the

`plot()`

function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?- We would like to know which of the following cars is has the higher miles-to-gallon ratio:
- A 6 cylinder European car which weighs 4000 kg and accelerates from 0-60 in 12 seconds
- A 4 cylinder Japanese car which weighs 2000 kg and accelerates from 0-60 in 20 seconds

Use the `predict()`

function to give a numerical prediction for each of these car types. Base your estimates on the following model:

`lm(mpg ~ cylinders + weight + acceleration + originf, data=auto)`

- Use the
`predict()`

function to further explore the fitted values from your model. You may want to use`summary()`

again to work out some reasonable values of your covariates such that you are not extrapolating wildly away from the data. You will need to specify values for each of the variables that you included in your model. For example:

```
sim_data_low_weight <- data.frame(displacement = seq(from = 68, to = 455, by = 1),
weight = 2000,
cylinders = mean(auto$cylinders),
horsepower = mean(auto$horsepower),
acceleration = mean(auto$acceleration),
year = mean(auto$year),
originf = "europe"
)
```

Use the

`*`

operator to fit linear regression models with interaction effects (you can choose which variables to interact - you almost certainly know more about the determinants of car efficiency than I do). Do any interactions appear to be statistically significant?Again, use the

`predict()`

function to create fitted values from the interaction model. Use the fitted values to illustrate the conditional effect of one of your variables of interest.How does the r-squared from this model compare to that of the non-interactive model? Again, access the R-squared value from the summary of the model object and print both r-squared values.

Using the

`median()`

function, create a new variable which is equal to 1 (or,`TRUE`

) when`mpg`

is above the median value, and 0 (or,`FALSE`

) otherwise.Make use of this new variable to estimate a logistic regression with

`cylinders`

,`weight`

,`acceleration`

and`origin`

as predictors. The dependent variable in this analysis should be the transformed version of`mpg`

that you created in part a. The`glm()`

function will be useful, and you should set the`family`

argument to be equal to`"binomial"`

.Print a summary of the fitted model object and interpret two of the coefficients.

If you struggled with iii, that is because log-odds ratios are completely non-intuitive. Let’s make some progress by converting the coefficients (

`coef()`

) and their associated confidence intervals (`confint()`

) into odds-ratios (`exp()`

). Create a`data.frame`

of these values and print it.