Open RStudio and explore the programme. Make sure you can identify the console
, the script editor, the environment
window, the packages
window and the help
window!
c()
function, with each argument separated by a comma. So, a character vector of length three could be c("a", "b", "c")
. Assign your integer vector using the assignment operator <-
to an object with a name of your choosing.my_vec <- c(1,2,3,4,5)
my_new_vec <- my_vec * 3
print(my_new_vec)
## [1] 3 6 9 12 15
print(my_new_vec + my_vec)
## [1] 4 8 12 16 20
c()
function. Make sure that you have a mix of TRUE
and FALSE
values in the vector. Use the logical vector to subset the numeric vector that you created in question 2 and print
the result.my_logical_vec <- c(T, T, T, F, F)
print(my_new_vec[my_logical_vec])
## [1] 3 6 9
my_short_vector
.my_short_vector <- my_new_vec[c(1,2)]
This exercise relates to the College
data set, which comes from An Introduction to Statistical Learning by James et al 2013.. It contains a number of variables for 777 different universities and colleges in the US.
The variables are
Private
: Public/private indicatorApps
: Number of applications receivedAccept
: Number of applicants acceptedEnroll
: Number of new students enrolledTop10perc
: New students from top 10% of high school classTop25perc
: New students from top 25% of high school classF.Undergrad
: Number of full-time undergraduatesP.Undergrad
: Number of part-time undergraduatesOutstate
: Out-of-state tuitionRoom.Board
: Room and board costsBooks
: Estimated book costsPersonal
: Estimated personal spendingPhD
: Percent of faculty with Ph.D.’sTerminal
: Percent of faculty with terminal degreeS.F.Ratio
: Student/faculty ratioperc.alumni
: Percent of alumni who donateExpend
: Instructional expenditure per studentGrad.Rate
: Graduation rateYou can either download the .csv file containing the data from the MY591 moodle page, or read the data in directly from the website.
read.csv()
function to read the data into R
. Call the loaded data college
. Make sure that you have the directory set to the correct location for the data. You can load this in R directly from the website, using:college <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/College.csv")
Or you can load it from a saved file, using:
college <- read.csv("path_to_my_file/College.csv")
View()
function. You should notice that the first column is just the name of each university. We don’t really want R
to treat this as data. However, it may be handy to have these names for later. Try the following commands:rownames(college) <- college[, 1]
View(college)
You should see that there is now a row.names
column with the name of each university recorded. This means that R
has given each row a name corresponding to the appropriate university. R
will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try
college <- college[, -1]
View(college)
Now you should see that the first data column is Private
. Note that another column labeled row.names
now appears before the Private
column. However, this is not a data column but rather the name that R
is giving to each row.
str()
function to look at the structure of the data. Which of the variables are numeric? Which are integer? Which are factors?str(college)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : int 1660 2186 1428 417 193 587 353 1899 1038 582 ...
## $ Accept : int 1232 1924 1097 349 146 479 340 1720 839 498 ...
## $ Enroll : int 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : int 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : int 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: int 2885 2683 1036 510 249 678 416 1594 973 799 ...
## $ P.Undergrad: int 537 1227 99 63 869 41 230 32 306 78 ...
## $ Outstate : int 7440 12280 11250 12960 7560 13500 13290 13868 15595 10468 ...
## $ Room.Board : int 3300 6450 3750 5450 4120 3335 5720 4826 4400 3380 ...
## $ Books : int 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : int 2200 1500 1165 875 1500 675 1500 850 500 1800 ...
## $ PhD : int 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : int 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: int 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : int 7041 10527 8735 19016 10922 9727 8861 11487 11644 8991 ...
## $ Grad.Rate : int 60 56 54 59 15 55 63 73 80 52 ...
summary()
function to produce a numerical summary of the variables in the data set.summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
Enroll
and Top10Perc
variables?mean(college$Enroll)
## [1] 779.973
mean(college$Top10perc)
## [1] 27.55856
sd(college$Enroll)
## [1] 929.1762
sd(college$Top10perc)
## [1] 17.64036
Enroll
and Top10Perc
variables in the subset of the data that remains?college <- college[-c(15:85),]
mean(college$Enroll)
## [1] 784.8867
mean(college$Top10perc)
## [1] 27.51841
sd(college$Enroll)
## [1] 928.6599
sd(college$Top10perc)
## [1] 17.61432
Books
variable?range(college$Books)
## [1] 110 2340
pairs()
function to produce a scatterplot matrix of the first five columns or variables of the data. Recall that you can reference the first five columns of a matrix A
using A[,1:5]
.pairs(college[,1:5])
plot()
function to produce a scatter plot of S.F.Ratio
versus Grad.Rate
. Give the axes informative labels.plot(college$S.F.Ratio, college$Grad.Rate, xlab = "Student/Faculty Ratio", ylab = "Graduation Rate")
?plot
and ?par
for some ideas. If you are feeling very keen, try using ggplot
but you will need to load the ggplot library first: library(ggplot2)
.plot(college$S.F.Ratio, college$Grad.Rate,
xlab = "Student/Faculty Ratio", ylab = "Graduation Rate",
main = "A really nice plot",
pch = 19, col = "gray", bty = "n")
library(ggplot2)
ggplot(data = college, aes(x = S.F.Ratio, y = Grad.Rate, col = Private)) +
geom_point()+
xlab("Student/Faculty Ratio")+
ylab("Graduation Rate")+
ggtitle("A really nice plot")+
facet_grid(~Private)