Introduction to R Programming

Exercise 1 - R basics

Open RStudio and explore the programme. Make sure you can identify the console, the script editor, the environment window, the packages window and the help window!

In the script window, write code that creates a vector composed of 5 single-digit integers. To concatinate a series of values, use the c() function, with each argument separated by a comma. So, a character vector of length three could be c("a", "b", "c"). Assign your integer vector using the assignment operator <- to an object with a name of your choosing.
Multiply your vector by 3, and assign the output to a new object. print the values of your new object.
Add together the two objects that you have created to far, printing the result. Note that R operates on vectors element-wise.
Create a logical vector of length five, again using the c() function. Make sure that you have a mix of TRUE and FALSE values in the vector. Use the logical vector to subset the numeric vector that you created in question 2 and print the result. Check what happens when the logical vector is shorter than the numeric vector.
Subset to just the first two elements of the numeric vector that you created in question 2 and assign the result to have the name my_short_vector.

Exercise 2 - exploring data

This exercise relates to the College data set, which comes from An Introduction to Statistical Learning by James et al 2013. It contains a number of variables for 777 different universities and colleges in the US.

The variables are

Private : Public/private indicator
Apps : Number of applications received
Accept : Number of applicants accepted
Enroll : Number of new students enrolled
Top10perc : New students from top 10% of high school class
Top25perc : New students from top 25% of high school class
F.Undergrad : Number of full-time undergraduates
P.Undergrad : Number of part-time undergraduates
Outstate : Out-of-state tuition
Room.Board : Room and board costs
Books : Estimated book costs
Personal : Estimated personal spending
PhD : Percent of faculty with Ph.D.’s
Terminal : Percent of faculty with terminal degree
S.F.Ratio : Student/faculty ratio
perc.alumni : Percent of alumni who donate
Expend : Instructional expenditure per student
Grad.Rate : Graduation rate

You can either download the .csv file containing the data from the MY591 moodle page, or read the data in directly from the website.

Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data. You can load this in R directly from the website, using:

college <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/College.csv")

Or you can load it from a saved file, using:

college <- read.csv("path_to_my_file/College.csv")

Use the str() function to look at the structure of the data. Which of the variables are numeric? Which are integer? Which are factors?
Use the summary() function to produce a numerical summary of the variables in the data set.
What is the mean and standard deviation of the Enroll and Top10Perc variables?
Now remove the 10th through 85th observations. What is the mean and standard deviation of the Enroll and Top10Perc variables in the subset of the data that remains?
What is the range of the Books variable?
Use the pairs() function to produce a scatterplot matrix of the first five columns or variables of the data. Recall that you can reference the first five columns of a matrix A using A[,1:5].
Use the plot() function to produce a scatter plot of S.F.Ratio versus Grad.Rate. Give the axes informative labels.
Compete with your neighbour to make the prettiest plot. You might want to look at ?plot and ?par for some ideas. If you are feeling very keen, try using ggplot but you will need to load the ggplot library first: library("ggplot2").

Exercise 3 - linear regression

This exercise involves the auto data set available as Auto.csv from the MY591 website, or directly from http://www-bcf.usc.edu/~gareth/ISL/Auto.csv. Load this data into R. This data includes characteristics on a number of different types of cars. It includes the following variables:

mpg = miles per gallon
cylinders = number of cylinders
displacement = engine displacement (I have no idea what this is)
horsepower = number of horses powering the car (not really)
weight = weight of the car in kgs
acceleration = time in seconds for the car to go from 0-60mph
year = year of manufacture
origin = country of manufacture
name = name of car

Unfortunately, the people who made this data decided to code their missing values with a ?, which is an awful thing to do. When reading the data in from the .csv file, use na.strings = "?" to convert them to NAs. Then exclude all the NAs from the data.

auto <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv", na.strings = "?")
auto <- na.omit(auto)

Which of the predictors are numeric, and which are categorical?

Note: Sometimes when you load a dataset, a categoriccal variable might have a numeric value. For instance, the origin variable is categorical, but has integer values of 1, 2, 3. From mysterious sources (Googling), we know that this variable is coded 1 = usa; 2 = europe; 3 = japan. So we can covert it into a factor, using:

auto$originf <- factor(auto$origin, labels = c("usa", "europe", "japan"))

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors (make sure you use the factor version of the origin variable that we created in part a). Use the summary() function to print the results. Comment on the output.
1. Is there a relationship between the predictors and the response?
2. Which predictors appear to have a statistically significant relationship to the response?
3. What does the coefficient for the year variable suggest?
4. What is the R-squared for this model? Access the r.squared value from the summary of the model object, and save it as its own object.
Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
We would like to know which of the following cars is has the higher miles-to-gallon ratio:
- A 6 cylinder European car which weighs 4000 kg and accelerates from 0-60 in 12 seconds
- A 4 cylinder Japanese car which weighs 2000 kg and accelerates from 0-60 in 20 seconds

Use the predict() function to give a numerical prediction for each of these car types. Base your estimates on the following model:

lm(mpg ~ cylinders + weight + acceleration + originf, data=auto)

Use the predict() function to further explore the fitted values from your model. You may want to use summary() again to work out some reasonable values of your covariates such that you are not extrapolating wildly away from the data. You will need to specify values for each of the variables that you included in your model. For example:

sim_data_low_weight <- data.frame(displacement = seq(from = 68, to = 455, by = 1), 
                                  weight = 2000, 
                                  cylinders = mean(auto$cylinders),
                                  horsepower = mean(auto$horsepower),
                                  acceleration = mean(auto$acceleration),
                                  year = mean(auto$year),
                                  originf = "europe"
                                  )

Use the * operator to fit linear regression models with interaction effects (you can choose which variables to interact - you almost certainly know more about the determinants of car efficiency than I do). Do any interactions appear to be statistically significant?
Again, use the predict() function to create fitted values from the interaction model. Use the fitted values to illustrate the conditional effect of one of your variables of interest.
How does the r-squared from this model compare to that of the non-interactive model? Again, access the R-squared value from the summary of the model object and print both r-squared values.

Exercise 4 - logistic regression

Using the median() function, create a new variable which is equal to 1 (or, TRUE) when mpg is above the median value, and 0 (or, FALSE) otherwise.
Make use of this new variable to estimate a logistic regression with cylinders, weight, acceleration and origin as predictors. The dependent variable in this analysis should be the transformed version of mpg that you created in part a. The glm() function will be useful, and you should set the family argument to be equal to "binomial".
Print a summary of the fitted model object and interpret two of the coefficients.
If you struggled with iii, that is because log-odds ratios are completely non-intuitive. Let’s make some progress by converting the coefficients (coef()) and their associated confidence intervals (confint()) into odds-ratios (exp()). Create a data.frame of these values and print it.

Introduction to R Programming

Exercises

Tom Paskhalis

Exercise 1 - R basics

Exercise 2 - exploring data

Exercise 3 - linear regression

Exercise 4 - logistic regression