Week 5: Supervised Modelling

POP77032 Quantitative Text Analysis for Social Scientists

Tom Paskhalis

Overview

  • Human Annotation
  • Reliability Measures
  • Training Set
  • Feature Engineering

Human Annotation

Manifesto Project Codebook: Economy

What is the Policy Category?

What is the Policy Category?

402 - Incentives: Positive

What is the Policy Category?

402 - Incentives: Positive

What is the Policy Category?

Wisdom of Crowds

(Thomas Barker, Amgueddfa Cymru – Museum Wales)

Vox Populi

A weight-judging competition was carried on at the annual show of the West of England Fat Stock and Poultry Exhibition recently held at Plymouth. A fat ox having been selected, competitors bought stamped and numbered cards, for 6d. each, on which to inscribe their respective names, addresses, and estimates of what the ox would weigh after it had been slaughtered and “dressed.” Those who guessed most successfully received prizes. About 800 tickets were issued, which were kindly lent me for examination after they had fulfilled their immediate purpose. These afforded excellent material. The judgments were unbiassed by passion and uninfluenced by oratory and the like. The sixpenny fee deterred practical joking, and the hope of a prize and the joy of competition prompted each competitor to do his best. The competitors included butchers and farmers, some of whom were highly expert in judging the weight of cattle; others were probably guided by such information as they might pick up, and by their own fancies. The average competitor was probably as well fitted for making a just estimate of the dressed weight of the ox, as an average voter is of judging the merits of most political issues on which he votes, and the variety among the voters to judge justly was probably much the same in either case.

Francis Galton (1907)

According to the democratic principle of “one vote one value,” the middlemost estimate expresses the vox populi, every other estimate being condemned as too low or too high by a majority of the voters […]. Now the middlemost estimate is 1207 lb., and the weight of the dressed ox proved to be 1198 lb.; so the vox populi was in this case 9 lb., or 0.8 per cent. of the whole weight too high.

Francis Galton (1907)

Wisdom of Crowds in Text Annotation

  • Benoit et al. (2016) use crowd-source platform to compare 200K+ of crowd-sourced annotations of sentences in UK party manifestos 1987-2010 to 100K+ expert evaluations.

Could we just use an LLM?

  • Probably, but similar validation standards need to be applied.

Human Annotation

  • Any concept measured from text with no obvious quantitative benchmark (e.g. economic performance) needs to be extensively validated.
  • In social sciences this is typically done through human annotation or expert coding.
  • One should aim to have multiple annotators/experts labelling the same unit of analysis (sentences, speeches, etc.)
  • The units chosen for annotation should relatively small (e.g. sentences, paragraphs) as human attention span is limited.
  • In addition, focussing on smaller units helps with making the task easier to scale.

Reliability vs Validity

(Krippendorff, 2019)

Evaluating Human Annotation

  • When assessing the quality of human annotations, several aspects should be considered:
Type Test Design Cause of Disagreement
Stability rest-retest intraobserver inconsistencies
Reliability test-test interobserver disagreements
Validity test-standard deviations from a standard

Measures of Agreement

  • Percent agreement: \(\frac{\text{number of agreeing annotations}}{\text{total number of annotations}} \times 100\%\)
  • Correlation: Pearson’s \(r\) for continuous scales or Spearman’s \(\rho\) for ordinal.
  • Measures of agreement:
    • Account not not only for agreement but also for the possibility of agreement occurring by chance.
    • Cohen’s \(\kappa\) - most common \[\kappa = \frac{P_o - P_e}{1 - P_e}\] where \(P_o\) is the observed agreement and \(P_e\) is the expected agreement by chance.
    • Krippendorff’s \(\alpha\) - a generalisation of Cohen’s \(\kappa\) to multiple annotators and different scales. \[\alpha = 1 - \frac{D_o}{D_e}\] where \(D_o\) is the observed disagreement and \(D_e\) is the expected disagreement by chance.

Reliability Data Matrix

  • A canonical representation of inter-coder agreement is a reliability data matrix.
  • Typically, it is arranged with annotators in rows and units in column
    (see Krippendorff for more details).
  • In practice, you can always work with a transpose of this.
  • In the simplest case we have 2 annotators labelling binary data:
Unit 1 2 3 4 5 6 7 8 9 10
Coder 1 1 1 0 0 0 0 0 0 0 0
Coder 2 0 1 1 0 0 1 0 1 0 0

Calculating Reliability

coder1 <- c(1, 1, 0, 0, 0, 0, 0, 0, 0, 0)
coder2 <- c(0, 1, 1, 0, 0, 1, 0, 1, 0, 0)
  • Coder 1 and Coder 2 agree on 6 out of 10 units, so the percent agreement is \(60\%\), which might look reasonable at first glance.
sum(coder1 == coder2)
[1] 6
sum(coder1 == coder2)/length(coder1) * 100
[1] 60
  • There are 4 observed disagreements, with coincidence matrix:
table(coder2, coder1) + table(coder1, coder2)
      coder1
coder2  0  1
     0 10  4
     1  4  2
  • Note that unlike contingency tables, coincidence matrices are symmetrical around the diagonal.

Krippendorff’s \(\alpha\)

  • Looking back at our coincidence matrix:

\[ \begin{array}{c|cc|c} & 0 & 1 & \text{Total} \\ \hline 0 & 10 & 4 & 14 \\ 1 & 4 & 2 & 6 \\ \hline \text{Total} & 14 & 6 & 20 \end{array} \]

  • Had all annotations been made randomly, we would expect to observe the following coincidence matrix:

\[ \begin{array}{c|cc|c} & 0 & 1 & \text{Total} \\ \hline 0 & 9.6 & 4.4 & 14 \\ 1 & 4.4 & 1.6 & 6 \\ \hline \text{Total} & 14 & 6 & 20 \end{array} \]

where \(e_{01} = e_{10} = n_0 \times n_1/(n-1)\)

Thus, \(\text{Krippendorff's } \alpha = 1 - \frac{D_o}{D_e} = 1 - \frac{4}{4.4211} = 0.095\), which is quite low.

Calculating Reliability in R

  • In practice, we would just use software to calculate reliability.
  • E.g. in R we could simply the irr package:
library("irr")
irr::kripp.alpha(rbind(coder1, coder2), method = "nominal")
 Krippendorff's alpha

 Subjects = 10 
   Raters = 2 
    alpha = 0.0952 
  • For comparison, we could also calculate Cohen’s \(\kappa\):
irr::kappa2(data.frame(coder1, coder2), weight = "unweighted")
 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 10 
   Raters = 2 
    Kappa = 0.0909 

        z = 0.323 
  p-value = 0.747 

Supervised Learning

Supervised vs Unsupervised

Unsupervised modelling:

Supervised modelling:

learning latent structure from unlabelled data

learning a relationship between inputs and labelled data

E.g. principal component analysis of a DTM

E.g. sentiment analysis using a training set of positive and negative reviews

Dictionaries and Supervised Learning

  • Dictionary could be considered a certain form of supervised learning.
  • Association between a feature and a category is based on reading of text(s).
  • It can be done either by human(s) or by machine(s).
  • But the texts used to build a dictionary are often different from those to which it is applied.
  • With more traditional supervised learning techniques, the association between a feature and a category is derived from data.

Dictionaries and Supervised Learning: Performance

(Barberá, Boydstun, Linn, McMahon & Nagler, 2021)

Basic Principles of Supervised Learning

  • We have some labelled data that we can use to develop a classifier.
  • The data is split into:
    • training set for the classifier to “learn” relationships between features and outcome
    • validation/development set for tuning and adjusting any hyperparameters
    • test set for evaluating the performance of the classifier
  • The idea is to build a classifier that can generalise to previously unseen samples.

Evaluating Classifier

\[ \begin{array}{c|cc} \textbf{Predicted / True} & \textbf{Positive} & \textbf{Negative} \\ \hline \textbf{Positive} & TP & FP \\ \textbf{Negative} & FN& TN \\ \end{array} \]

  • Accuracy: \(\frac{TP + TN}{TP + FP + FN + TN}\)
  • Precision: \(\frac{TP}{TP + FP}\)
  • Recall: \(\frac{TP}{TP + FN}\)
  • F1 Score (harmonic mean of precision and recall): \(2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\)

Naive Bayes

Naive Bayes Classification

  • Motivation: We want to classify a document into one of several categories based on its features (e.g. words).
  • Naive Bayes is a simple probabilistic classifier based on Bayes’ theorem with the “naive” assumption of independence between features in a document.
  • It is fast, simple, and often performs well in text classification tasks.

Bayes’ Theorem

  • Recall how conditional probability works:

\[ P(A|B) = \frac{P(A, B)}{P(B)} \]

  • E.g. When throwing a die, the probability of getting a 2 given that we got an even number is \(P(2|\text{even}) = \frac{1/6}{1/2} = \frac{1}{3}\)
  • Of course, it is also true that \(P(B|A) = \frac{P(B, A)}{P(A)}\)
  • But since \(P(A, B) = P(B, A)\), it must be that \(P(A|B)P(B) = P(B|A)P(A)\) and thus we have Bayes’ theorem:

\[ \color{red}{P(A|B) = \frac{P(A)P(B|A)}{P(B)}} \]

Naive Bayes Setup

  • E.g. we are interested in a document containing positive or negative sentiment.
  • Consider \(J\) features distributed across \(N\) documents each assigned to one of \(K\) classes.
  • Using Bayes’ Theorem at the word level we could express it as:

\[ P(c_k | w_j) = \frac{P(c_k) P(w_j | c_k)}{P(w_j)} \]

  • For 2 classes we could write it as:

\[ P(c_k | w_j) = \frac{P(c_k) P(w_j | c_k)}{P(c_k) P(w_j | c_k) + P(c_{\neg{k}}) P(w_j | c_{\neg{k}})} \]

where \(c_{\neg{k}}\) is the class alternative to class \(c_k\).

Naive Bayes: Word Likelihoods

\[ P(c_k | w_j) = \frac{P(c_k) \color{red}{P(w_j | c_k)}}{P(c_k) \color{red}{P(w_j | c_k)} + P(c_{\neg{k}}) \color{red}{P(w_j | c_{\neg{k}})}} \]

  • The term \(\color{red}{P(w_j | c_k)}\) is word likelihood conditional on class.
  • The MLE estimate for this is simply the proportion of times the work \(j\) occurs in class \(k\), but it more common to use Laplace smoothing by adding \(1\) to each observed count within class to avoid zero probabilities.
  • In practice, since \(P(c_k) P(w_j | c_k) + P(c_{\neg{k}}) P(w_j | c_{\neg{k}}) = P(w_j)\) which is the same for all classes, the denominator isn’t needed for classification.

Naive Bayes: Prior Probabilities

\[ P(c_k | w_j) = \frac{\color{red}{P(c_k)} P(w_j | c_k)}{\color{red}{P(c_k)} P(w_j | c_k) + \color{red}{P(c_{\neg{k}})} P(w_j | c_{\neg{k}})} \]

  • The terms \(\color{red}{P(c_k)}\) and \(\color{red}{P(c_{\neg{k}})}\) are the class prior probabilities.
  • In supervised learning, these are typically estimated from the training data as the proportion of documents in each class.

Naive Bayes: Posterior Probabilities

\[ \color{red}{P(c_k | w_j)} = \frac{P(c_k) P(w_j | c_k)}{P(c_k) P(w_j | c_k) + P(c_{\neg{k}}) P(w_j | c_{\neg{k}})} \]

  • The term \(\color{red}{P(c_k | w_j)}\) is the posterior probability of class \(c_k\) given the presence of word \(w_j\).
  • Which is, fundamentally, our quantity of interest.

Naive Bayes: Documents

  • Of course, in practice we would like to use more than a single feature \(w_j\) to predict class membership.
  • The “naive” assumption of Naive Bayes is that features are conditionally independent

\[ P(c_k | d_i) = P(c_k) \prod_{j=1}^J \frac{P(w_j|c_k)}{P(w_j)} \]

  • It is naive because it (wrongly) assumes:
    • conditional independence of feature counts
    • positional independence of features in a document

Next

  • Tutorial: Supervised Modelling
  • Next week: Unsupervised modelling
  • Assignment 2: Due 15:59 on Wednesday, 4th March (submission on Blackboard)