POP77032 Quantitative Text Analysis for Social Scientists
RodriguezSpirling2022_CR.rds dataset. You can find a description of it in the Assignment 3. speech_id
9 1020000009
13 1020000013
23 1020000023
28 1020000028
29 1020000029
30 1020000030
speech
9 respectively advanced to the desk of the vice president the oath prescribed by law was administered to them by the vice president and they severally subscribed to ahe oath in the official oath book
13 respectively advanced to the desk of the vice president the oath prescribed by law was administered to them by the vice president and they severally subscribed to the oath in the official oath book
23 respectively advanced to the desk of the vice president the oath prescribed by law was administered to them by the vice president and they severally subscribed to the oath in the official oath book
28 mr president i will momentarily suggest the absence of a quorum so that the roll will be called and a quorum established for the purpose of beginning the proceedings of this senate but i believe it appropriate to note at this time for the members of the senate and for all americans that senator thurmond has just taken the oath of office to the senate for the eighth time
29 this is one other respect in which he is unique among many
30 mr president i suggest absence of a quorum
speakerid lastname firstname chamber state gender party district nonvoting
9 102110161 DOLE ROBERT S KS M R voting
13 102110161 DOLE ROBERT S KS M R voting
23 102112641 MURKOWSKI FRANK S AK M R voting
28 102108401 MITCHELL GEORGE S ME M D voting
29 102108401 MITCHELL GEORGE S ME M D voting
30 102108401 MITCHELL GEORGE S ME M D voting
session_id
9 102
13 102
23 102
28 102
29 102
30 102
Here is an example with the synthetic dataset for prompt words: “soccer”, “computer” and “potato”.
set.seed(123)
n <- 100
cues <- c("soccer", "computer", "potato")
models <- c("6_300", "human")
# Top 10 nearest neighbours per cue per model
# Overlap: soccer = 7/10, computer = 5/10, potato = 6/10
nn_list <- list(
"6_300" = list(
soccer = c("goal", "pitch", "penalty", "player", "match", "kick", "score",
"midfielder", "dribble", "goalkeeper"),
computer = c("keyboard", "software", "monitor", "program", "mouse",
"algorithm", "compiler", "cursor", "desktop", "processor"),
potato = c("fries", "mashed", "chips", "roasted", "vegetable", "boiled",
"starch", "tuber", "harvest", "baked")
),
"human" = list(
soccer = c("goal", "pitch", "penalty", "player", "match", "kick", "score",
"football", "team", "coach"),
computer = c("keyboard", "software", "monitor", "program", "mouse",
"laptop", "internet", "data", "email", "screen"),
potato = c("fries", "mashed", "chips", "roasted", "vegetable", "boiled",
"food", "carrot", "soup", "garden")
)
)
syn_mturk <- data.frame(
cue = sample(cues, n, replace = TRUE),
left.source = sample(models, n, replace = TRUE),
stringsAsFactors = FALSE
)
# right.source differs from left.source
syn_mturk$right.source <- sapply(syn_mturk$left.source, \(s) sample(setdiff(models, s), 1))
# Draw one word from each model's top-10 list for the given cue
syn_mturk$left.word <- mapply(\(src, cue) sample(nn_list[[src]][[cue]], 1),
syn_mturk$left.source, syn_mturk$cue)
syn_mturk$right.word <- mapply(\(src, cue) sample(nn_list[[src]][[cue]], 1),
syn_mturk$right.source, syn_mturk$cue)
# left.choice and right.choice are mutually exclusive
syn_mturk$left.choice <- sample(c(TRUE, FALSE), n, replace = TRUE)
syn_mturk$right.choice <- !syn_mturk$left.choice
head(syn_mturk) cue left.source right.source left.word right.word left.choice
1 potato 6_300 human roasted boiled FALSE
2 potato human 6_300 chips baked FALSE
3 potato human 6_300 soup vegetable FALSE
4 computer 6_300 human keyboard program TRUE
5 potato 6_300 human mashed mashed FALSE
6 computer 6_300 human program keyboard FALSE
right.choice
1 TRUE
2 TRUE
3 TRUE
4 FALSE
5 TRUE
6 TRUE
As this dataset was artificially generated, it’s no surprise that the performance of two models is indistinguishable.
RodriguezSpirling2022_CR.rds dataset