POP77032 Quantitative Text Analysis for Social Scientists
news.tsv file with blobs of news articles, but do check the behaviors.tsv file as well as it contains the actual news engagement metrics.quanteda.textmodels package to fit a Naive Bayes classifier.# Simulate some textual data
set.seed(123)
n_docs <- 100; n_features <- 50; n_classes <- 2
vocab <- sample(quanteda::stopwords("en"), n_features)
classes <- sample(1:n_classes, n_docs, replace = TRUE)
# Simulate texts as random draws from the vocab with
# different word frequencies for each class
txts <- NA * length(n_docs)
txts[classes == 1] <- vapply(
1:sum(classes == 1),
function(x) paste(
sample(vocab, sample(10:20, 1), replace = TRUE, prob = c(rep(0.01, 25), rep(0.03, 25))),
collapse = " "
), character(1)
)
txts[classes == 2] <- vapply(
1:sum(classes == 2),
function(x) paste(
sample(vocab, sample(10:20, 1), replace = TRUE, prob = c(rep(0.03, 25), rep(0.01, 25))),
collapse = " "), character(1)
)
head(txts)[1] "other didn't won't until both be until because both ours let's them they'd more until was isn't be"
[2] "each when his too ours them herself isn't he he's isn't here's let's he over"
[3] "couldn't such he's couldn't am other until more doing he we're why"
[4] "same be so until she'd who couldn't you'll a isn't he's up here's such herself ours to no let's"
[5] "why up very so very he until up he's them let's there as too"
[6] "do no do she'll she'd when until until why each few who"
# Create a document-feature matrix from the simulated texts
txts_dfm <- quanteda::dfm(quanteda::tokens(txts))
quanteda::docvars(txts_dfm, "id") <- 1:n_docs
quanteda::docvars(txts_dfm, "class") <- classes
txts_dfm_train <- quanteda::dfm_subset(txts_dfm, subset = 1:n_docs %in% train)
txts_dfm_test <- quanteda::dfm_subset(txts_dfm, subset = 1:n_docs %in% test)
Call:
textmodel_nb.dfm(x = txts_dfm_train, y = quanteda::docvars(txts_dfm_train,
"class"))
Class Priors:
(showing first 2 elements)
1 2
0.5 0.5
Estimated Feature Scores:
other didn't won't until both be because ours let's
1 0.01207 0.01034 0.01207 0.01034 0.03103 0.006897 0.02586 0.008621 0.01552
2 0.03367 0.02196 0.02489 0.05124 0.01025 0.014641 0.01611 0.039531 0.02635
them they'd more was isn't each when his too
1 0.006897 0.03448 0.036207 0.03276 0.01034 0.027586 0.029310 0.032759 0.01034
2 0.027818 0.01025 0.005857 0.01171 0.03221 0.008785 0.005857 0.007321 0.03075
herself he he's here's over couldn't such am doing
1 0.02931 0.01379 0.02241 0.006897 0.02759 0.022414 0.02586 0.022414 0.03103
2 0.01171 0.02782 0.01611 0.024890 0.01025 0.008785 0.01025 0.008785 0.01025
we're why same
1 0.020690 0.01034 0.01379
2 0.001464 0.02635 0.02343
Naive Bayes can only take features into consideration that occur both in the training and test set, but we can make the features identical using dfm_match()
docs 1 2
text2 4.725997e-02 9.527400e-01
text7 9.999994e-01 6.077527e-07
text8 9.999648e-01 3.517497e-05
text13 9.997856e-01 2.144390e-04
text15 1.492684e-01 8.507316e-01
text18 1.330139e-06 9.999987e-01
text22 1.717026e-03 9.982830e-01
text23 2.032627e-03 9.979674e-01
text25 3.922640e-04 9.996077e-01
text31 2.435133e-06 9.999976e-01
text36 2.812432e-01 7.187568e-01
text39 9.999993e-01 6.911338e-07
text57 1.814028e-03 9.981860e-01
text67 6.649492e-05 9.999335e-01
text70 4.873724e-02 9.512628e-01
text73 9.984123e-01 1.587729e-03
text78 9.998881e-01 1.118688e-04
text80 9.999931e-01 6.888002e-06
text84 9.325479e-01 6.745211e-02
text99 8.772226e-01 1.227774e-01
import numpy as np
from nltk.corpus import stopwords
rng = np.random.default_rng(123)
n_docs = 100
n_features = 50
n_classes = 2
vocab = rng.choice(stopwords.words("english"), size = n_features, replace = False).tolist()
classes = rng.integers(1, n_classes + 1, size = n_docs)
txts = [
" ".join(rng.choice(
vocab, size = rng.integers(10, 21), replace = True,
p = [0.01] * 25 + [0.03] * 25 if c == 1 else [0.03] * 25 + [0.01] * 25
))
for c in classes
]
txts[:6]["shouldn after am can hasn with don doesn't again mustn", "t m shouldn't they'd doesn down no that t were no those those that'll if i'll they t", "no our with those they'll above we've we'd them didn't doesn't don m am mustn they mustn't were", "we'd did about hasn with doesn't with don where down don above wouldn were just they'd", "from couldn m mustn't by after mustn with we'll for should we'll again mustn't during", "shouldn they'll each that m couldn doesn't did them after we'll don should am where they'll it'd"]
In Python we can rely on the sklearn library to fit a Naive Bayes classifier.
MultinomialNB()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.