Week 4 Tutorial:
Modelling Text

POP77142 Quantitative Text Analysis for Social Scientists

Similarity

It is sometimes desirable to compare two texts to see how similar they are.
How do we do that?
Recall the representation of a single document from the DTM is a vector
As long as we can compare two vectors, we can compare two texts.
Some desirable properties of a similarity measure:
- Maximum similarity when the document is compared to itself.
- Minimum similarity when no words in common (orthogonal documents)
- Symmetry: the similarity between document A and B is the same as between B and A.

Inner Product

One straight-forward way to compare two vectors is to calculate the inner product.
For two documents \(\mathbf{W}_A\) and \(\mathbf{W}_B\) with \(J\) features, the inner product is: \[ \mathbf{W}_A \cdot \mathbf{W}_B = \sum_{j=1}^J W_{A,j} \times W_{B,j} \]
E.g. for two documents with 3 features: \[ \begin{align*} \mathbf{W}_A &= [1, 0, 1] \\ \mathbf{W}_B &= [0, 1, 2] \\ \mathbf{W}_A \cdot \mathbf{W}_B &= 1 \times 0 + 0 \times 1 + 1 \times 2 = 2 \end{align*} \]
The major downside: inner product is sensitive to vector length.

Cosine Similarity

To alleviate this problem we can normalise each vector before calculating the inner product.
We will normalise it by vector magnitude.
Where vector magnitude is the square root of the sum of squares of all elements (alternatively, Euclidean distance between the vector and itself): \[ \|\mathbf{W}\| = \sqrt{\sum_{j=1}^J W_j^2} \]
A document \(\mathbf{W}\) can then be normalised by dividing it by its magnitude: \[ \frac{\mathbf{W}_A}{\|\mathbf{W}_A\|} = \left[\frac{W_{A,1}}{\|\mathbf{W}_A\|}, \frac{W_{A,2}}{\|\mathbf{W}_A\|}, \ldots, \frac{W_{A,J}}{\|\mathbf{W}_A\|}\right] \]

Cosine Similarity

The cosine similarity between two documents \(\mathbf{W}_A\) and \(\mathbf{W}_B\) is then: \[ \cos(\mathbf{W}_A, \mathbf{W}_B) = \frac{\mathbf{W}_A \cdot \mathbf{W}_B}{\|\mathbf{W}_A\| \times \|\mathbf{W}_B\|} \]
It is called “cosine similarity” because it is based on the size of the angle between the two vectors.

Example: Cosine Similarity

For the two vectors \(\mathbf{W}_A = [1, 0, 1]\) and \(\mathbf{W}_B = [0, 1, 2]\): \[ \begin{align*} \|\mathbf{W}_A\| &= \sqrt{1^2 + 0^2 + 1^2} = \sqrt{2} \\ \|\mathbf{W}_B\| &= \sqrt{0^2 + 1^2 + 2^2} = \sqrt{5} \\ \cos(\mathbf{W}_A, \mathbf{W}_B) &= \frac{1 \times 0 + 0 \times 1 + 1 \times 2}{\sqrt{2} \times \sqrt{5}} = \frac{2}{\sqrt{10}} \approx 0.63 \end{align*} \]

In R we can easily implement this ourselves:

W_A <- c(1, 0, 1)
W_B <- c(0, 1, 2)

cosine_similarity <- sum(W_A * W_B) / (sqrt(sum(W_A^2)) * sqrt(sum(W_B^2)))
cosine_similarity

[1] 0.6324555

Exercise 1: Modelling Text

Let’s re-visit the Federalist papers that we looked at in the lecture.
Now we will consider the full dataset as opposed to a tiny subset.
The essays are available as part in the federalist.csv file on Blackboard.
Calculate \(\mu\) for each author that includes all words that appear in the corpus.
Use cosine similarity to compare the \(\hat{\mu}\) of each author with the \(\mu\) of disputed essays (those that have NA in the author column).

federalist_papers <- readr::read_csv(
  "../data/federalist.csv"
)

head(federalist_papers)

# A tibble: 6 × 4
  paper_number paper_numeric author   text                                      
  <chr>                <dbl> <chr>    <chr>                                     
1 No. 1                    1 hamilton "AFTER an unequivocal experience of the i…
2 No. 2                    2 jay      "WHEN the people of America reflect that …
3 No. 3                    3 jay      "IT IS not a new observation that the peo…
4 No. 4                    4 jay      "MY LAST paper assigned several reasons w…
5 No. 5                    5 jay      "QUEEN ANNE, in her letter of the 1st Jul…
6 No. 6                    6 hamilton "THE three last numbers of this paper hav…

table(federalist_papers$author, useNA = "ifany")


hamilton      jay  madison     <NA> 
      51        5       18       11

Exercise 2: Topic Modelling

In this exercise we will re-visit the Irish party manifestos.
Load the file ireland_ge_2020-24_manifestos.csv containing the manifestos for 2020 and 2024 General elections.
Fit an LDA model with 3, 5 and 10 topics.
Compare the log-likelihood, top terms and overall topics across manifestos.
Try labelling some topics based on the top terms.

manifestos <- readr::read_csv(
  "../data/ireland_ge_2020-24_manifestos.csv"
)

str(manifestos)

spc_tbl_ [17 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ party: chr [1:17] "AO" "FF" "FG" "GR" ...
 $ year : num [1:17] 2024 2024 2024 2024 2024 ...
 $ text : chr [1:17] "Our\nCommon Sense\n  Manifesto 2024\n\n                Opening statement\nIn the last year Aontú has come of ag"| __truncated__ "MOVING FORWARD. TOGETHER.\n\nAG BOGADH AR AGHAIDH. LE CHÉILE.\nGeneral Election Manifesto 2024\n\n\n\n\n       "| __truncated__ "General Election 2024\n   M A N I F E S TO\n                        1\n\nFINE GAEL | GENERAL ELECTION MANIFESTO"| __truncated__ "towards\n2030\na decade of change\nvolume II\nGreen Party Manifesto 2024\n\n\n\n\n              greens\n       "| __truncated__ ...
 - attr(*, "spec")=
  .. cols(
  ..   party = col_character(),
  ..   year = col_double(),
  ..   text = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

Week 4 Tutorial:Modelling Text

Similarity

Inner Product

Cosine Similarity

Cosine Similarity

Example: Cosine Similarity

Exercise 1: Modelling Text

Exercise 2: Topic Modelling

Week 4 Tutorial:
Modelling Text