POP77142 Quantitative Text Analysis for Social Scientists
| battle | good | fool | wit | |
|---|---|---|---|---|
| As You Like It | 1 | 114 | 36 | 20 |
| Twelfth Night | 0 | 80 | 58 | 15 |
| Julius Caesar | 7 | 62 | 1 | 2 |
| Henry V | 13 | 89 | 4 | 3 |
library("ggplot2")
dtm |>
(\(dtm) data.frame(
docs = rownames(dtm),
x = dtm[,"fool"],
y = dtm[,"battle"]
))() |>
ggplot(aes(x = 0, y = 0)) +
geom_segment(aes(xend = x, yend = y),
arrow = arrow(type = "closed", length = unit(0.25, "inches")),
color = "navy", linewidth = 1) +
geom_text(aes(x = x, y = y,
label = paste0(docs, " [", x, ",", y, "]")),
hjust = -0.1, vjust = -0.5, size = 4) +
labs(x = "fool", y = "battle") +
theme_minimal() +
theme(
axis.title = element_text(face = "bold.italic", color = "navy", size = 14),
axis.text = element_text(size = 12)
) +
xlim(0, 80) +
ylim(0, 30)| As You Like It | Twelfth Night | Julius Caesar | Henry V | |
|---|---|---|---|---|
| battle | 1 | 0 | 7 | 13 |
| good | 114 | 80 | 62 | 89 |
| fool | 36 | 58 | 1 | 4 |
| wit | 20 | 15 | 2 | 3 |
Similar to document vectors, we can also represent words as vectors.
library("ggplot2")
tdm |>
(\(tdm) data.frame(
words = rownames(tdm),
x = tdm[,"Twelfth Night"],
y = tdm[,"Henry V"]
))() |>
ggplot(aes(x = 0, y = 0)) +
geom_segment(aes(xend = x, yend = y),
arrow = arrow(type = "closed", length = unit(0.25, "inches")),
color = "navy", linewidth = 1) +
geom_text(aes(x = x, y = y,
label = paste0(words, " [", x, ",", y, "]")),
hjust = -0.1, vjust = -0.5, size = 4) +
labs(x = "Twelfth Night", y = "Henry V") +
theme_minimal() +
theme(
axis.title = element_text(face = "bold.italic", color = "navy", size = 14),
axis.text = element_text(size = 12)
) +
xlim(0, 100) +
ylim(0, 100)You shall know a word by the company it keeps.
J.R. Firth, 1957
Keyword-in-context with 10 matches.
[text28, 370] enacted apparently | taoiseach | tánaiste’s office
[text32, 249] ireland background | taoiseach | asked attorney
[text77, 477] months ago | taoiseach | said build
[text84, 57] indeed elected | taoiseach | venue strange
[text84, 81] teams department | taoiseach | office tánaiste
[text188, 25] asking wanted | taoiseach | answer simple
[text219, 77] going election | taoiseach | call election
[text273, 5] minister tánaiste | taoiseach | minister state
[text344, 10] expenditure reform | taoiseach | pursuant standing
[text512, 7] expenditure reform | taoiseach | completed consideration
minister ask said department government tánaiste
0.016495321 0.014144115 0.013157531 0.012263151 0.011774469 0.009211194
deputy can thank time issue raised
0.008492001 0.008270711 0.007505417 0.007265686 0.005320179 0.005320179
know asked question us now last
0.005117330 0.004932921 0.004564105 0.004564105 0.004536444 0.004416578
give aware
0.004269052 0.004259831
Could we have guessed the word from the provided context?
Feature co-occurrence matrix of: 157,414 by 157,414 features.
features
features seanad éireann accepted finance bill 2024 without
seanad 58 565 19 7 530 33 29
éireann 565 184 5 16 50 6 5
accepted 19 5 98 20 69 1 16
finance 7 16 20 176 1134 27 28
bill 530 50 69 1134 2056 535 340
2024 33 6 1 27 535 78 47
without 29 5 16 28 340 47 274
recommendation 5 4 101 12 36 15 16
wish 7 5 7 17 141 7 12
advise 1 8 0 4 12 0 1
features
features recommendation wish advise
seanad 5 7 1
éireann 4 5 8
accepted 101 7 0
finance 12 17 4
bill 36 141 12
2024 15 7 0
without 16 12 1
recommendation 106 6 3
wish 6 94 569
advise 3 569 10
[ reached max_feat ... 157,404 more features, reached max_nfeat ... 157,404 more features ]
Feature co-occurrence matrix of: 6 by 6 features.
features
features minister government tánaiste taoiseach party election
minister 7792 4988 1204 1789 670 85
government 4988 4896 517 1277 1185 168
tánaiste 1204 517 68 999 81 9
taoiseach 1789 1277 999 308 131 67
party 670 1185 81 131 700 60
election 85 168 9 67 60 154
\[ PMI(w, c) = \log_2 \frac{P(w, c)}{P(w) P(c)} \]
Feature co-occurrence matrix of: 6 by 6 features.
features
features minister government tánaiste taoiseach party election
minister 7792 4988 1204 1789 670 85
government 4988 4896 517 1277 1185 168
tánaiste 1204 517 68 999 81 9
taoiseach 1789 1277 999 308 131 67
party 670 1185 81 131 700 60
election 85 168 9 67 60 154
\[ P(w = taoiseach, c = party) = \frac{131}{40378} = 0.0032 \]
\[ P(w = taoiseach) = \frac{4571}{40378} = 0.1132 \]
\[ P(c = party) = \frac{2827}{40378} = 0.07 \]
\[ PPMI(w = taoiseach, c = party) = \log_2 \frac{0.0032}{0.1132 \times 0.07} = -1.31 \]
text2vec package to fit a GloVe model to our co-occurrence matrix.# Fit GloVe model
glove <- text2vec::GloVe$new(
rank = 50,
x_max = 10
)
wv <- glove$fit_transform(
x = dail_33_fcm,
n_iter = 10,
convergence_tol = 0.01,
n_threads = 8
)INFO [10:11:44.738] epoch 1, loss 0.1110
INFO [10:11:46.881] epoch 2, loss 0.0857
INFO [10:11:49.002] epoch 3, loss 0.0757
INFO [10:11:51.120] epoch 4, loss 0.0711
INFO [10:11:53.249] epoch 5, loss 0.0682
INFO [10:11:55.368] epoch 6, loss 0.0662
INFO [10:11:57.489] epoch 7, loss 0.0648
INFO [10:11:59.621] epoch 8, loss 0.0637
INFO [10:12:01.758] epoch 9, loss 0.0628
INFO [10:12:03.899] epoch 10, loss 0.0621