POP77001 Computer Programming for Social Scientists
# A tibble: 1 × 296
`Duration (in seconds)` Q2 Q3 Q4 Q5 Q6_1 Q6_2 Q6_3 Q6_4 Q6_5
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 121 30-34 Man India No <NA> <NA> <NA> <NA> <NA>
# ℹ 286 more variables: Q6_6 <chr>, Q6_7 <chr>, Q6_8 <chr>, Q6_9 <chr>,
# Q6_10 <chr>, Q6_11 <chr>, Q6_12 <chr>, Q7_1 <chr>, Q7_2 <chr>, Q7_3 <chr>,
# Q7_4 <chr>, Q7_5 <chr>, Q7_6 <chr>, Q7_7 <chr>, Q8 <chr>, Q9 <chr>,
# Q10_1 <chr>, Q10_2 <chr>, Q10_3 <chr>, Q11 <chr>, Q12_1 <chr>, Q12_2 <chr>,
# Q12_3 <chr>, Q12_4 <chr>, Q12_5 <chr>, Q12_6 <chr>, Q12_7 <chr>,
# Q12_8 <chr>, Q12_9 <chr>, Q12_10 <chr>, Q12_11 <chr>, Q12_12 <chr>,
# Q12_13 <chr>, Q12_14 <chr>, Q12_15 <chr>, Q13_1 <chr>, Q13_2 <chr>, …
\[ \stackrel{female}{ \begin{bmatrix} 1 \\ 0 \\ 1 \\ \vdots \\ 1 \end{bmatrix} } \]
\[ \stackrel{{\scriptstyle25-34\,35-44\,45-64\,65+}}{ \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 0 \end{bmatrix} } \]
Where the first row corresponds to a respondent who is between 25 and 34 years old, the second to someone between 35 and 44 and the third one to a participant who is older than 65. Note that the number of columns in this matrix is one lower than the number of levels of our imaginary categorical variable age. We are omitting the baseline (reference) category. You can see that we can establish belonging to this category from the information provided in the matrix. If the values in all columns are \(0\) (such as in the last row above), we can be sure that this observation is from a respondent who is in age group 18-24.
model.matrix() from base R and apply it to the dataset to get a design matrix (you need to specify formula as the first argument).seq_along() function in combination with any other variable to do so).pivot_wider function from tidyr package to create a separate column for each age group.NA’s, use mutate function to replace them with \(0\)’s and \(1\)’s.pivot_longer function to convert this representation of the dataset back into its original form.dplyr::filter() function to remove redundant rows.