Week 9: Embeddings

POP77032 Quantitative Text Analysis for Social Scientists

Tom Paskhalis

Overview

Vector space arithmetics
Embeddings in social sciences
Choosing embeddings
Dimension reduction

Vector Space Distances

Distance Metrics

Once we have word embeddings, we can measure similarity/distance between words.
Several distance metrics are commonly used:
- Cosine similarity: measures the cosine of the angle between two vectors
  (ranges from -1 to 1).
- Jaccard similarity: measures the overlap between two sets of features
  (ranges from 0 to 1).
- Euclidean distance: straight-line distance between two points in vector space
  (ranges from 0 to infinity).
For word embeddings, cosine similarity is most commonly used.
As it focuses on the direction of vectors rather than their magnitude.

Recap: Cosine Similarity

Recall ho we calculated between 2 documents \(A\) and \(B\) using their vector representations \(\mathbf{W}_A\) and \(\mathbf{W}_B\):

\[ \cos(\mathbf{W}_A, \mathbf{W}_B) = \frac{\mathbf{W}_A \cdot \mathbf{W}_B}{\|\mathbf{W}_A\| \times \|\mathbf{W}_B\|} \] where \(\|\mathbf{W}\| = \sqrt{\sum_{j=1}^J W_j^2}\) is the magnitude of the vector.
The same idea can be applied to words represented as vectors.
Values close to 0 indicate orthogonal (unrelated) vectors.
For word embeddings, higher values indicate more similar meanings.

Finding Similar Words

Let’s find the most similar words to taoiseach in our Dail 33 corpus.

find_similar <- function(target_word, word_vectors, top_n = 10) {
  # Calculate similarities between target word and all words
  similarities <- text2vec::sim2(
    x = matrix(word_vectors[target_word, ], nrow = 1),
    y = word_vectors,
    method = "cosine",
    norm = "l2"
  )
  
  # Convert to named vector and sort
  similarities <- as.vector(similarities)
  names(similarities) <- rownames(word_vectors)
  similarities <- sort(similarities, decreasing = TRUE)
  
  head(similarities, top_n + 1)  # +1 to exclude the word itself
}

find_similar("taoiseach", word_vectors, top_n = 10)

 taoiseach   tánaiste   minister government minister's  yesterday     asking 
 1.0000000  0.9511834  0.9037884  0.8171245  0.8059394  0.8049395  0.8016647 
     asked     stated      today       said 
 0.7956373  0.7837272  0.7813493  0.7764243

Word Vector Arithmetic

One remarkable property of word embeddings is that vector differences encode semantic relationships.
This allows arithmetic operations on word vectors to discover analogies.
Classic example from Mikolov et al. (2013): \[ \vec{king} - \vec{man} + \vec{woman} \approx \vec{queen} \]
The idea is to subtract the “male” concept, add the “female” concept.

Example: Vector Arithmetic in Dáil

Let’s explore political analogies in the context of the Dail corpus.

\[ \vec{government} - \vec{taoiseach} + \vec{tánaiste} \approx \text{?} \]

word_analogy <- function(word_vectors, positive, negative, top_n = 5) {
  # positive: words to add, negative: words to subtract
  result_vec <- matrix(0, nrow = 1, ncol = ncol(word_vectors))
  
  # Add positive words
  for (word in positive) {
    result_vec <- result_vec + matrix(word_vectors[word, ], nrow = 1)
  }
  
  # Subtract negative words
  for (word in negative) {
    result_vec <- result_vec - matrix(word_vectors[word, ], nrow = 1)
  }
  
  # Find most similar words to the result
  similarities <- text2vec::sim2(
    x = result_vec,
    y = word_vectors,
    method = "cosine",
    norm = "l2"
  )
  
  similarities <- as.vector(similarities)
  names(similarities) <- rownames(word_vectors)
  
  # Remove words used in the operation
  similarities <- similarities[!names(similarities) %in% c(positive, negative)]
  
  similarities <- sort(similarities, decreasing = TRUE)
  head(similarities, top_n)
}

Example: Vector Arithmetic in Dáil

\[ \vec{government} - \vec{taoiseach} + \vec{tánaiste} \approx \text{?} \]

word_analogy(word_vectors, 
             positive = c("government", "tánaiste"), 
             negative = c("taoiseach"),
             top_n = 5)

government's     minister        clear      believe    intention 
   0.8116223    0.8108598    0.7986580    0.7716991    0.7657963

\[ \vec{minister} - \vec{government} + \vec{party} \approx \text{?} \]

word_analogy(word_vectors,
             positive = c("minister", "party"),
             negative = c("government"),
             top_n = 5)

   deputy colleague  deputies     asked     speak 
0.8005498 0.8003592 0.7961536 0.7500455 0.7410448

Dimensions of Class

Social class - hierarchical distinction between groups of people in social standing.
Some aspects that social class can be based on:
- Wealth (e.g. Georg Simmel)
- Relationship to the means of production (e.g. Karl Marx)
- Education
- Gender
- Race
- Consumption patterns

Kozlowski, Taddy & Evans (2019)

Use Google Ngram corpus of 5-grams divided by decade between 1900 and 1999,
Derive cultural dimensions of word embeddings using word2vec skipgram architecture.
Validate on similar data for 2000-2012 along 3 sociological axes:
- affluence/class
- gender
- race
Validation relies on MTurkers rating words along the 0-100 scale on these dimensions.
For historical validation they construct 20 cultural dimensions derived from An Atlas of Semantic Profiles for 360 Words by Jenkins, Russell & Suci (1958).

Scale Construction

Kozlowski, Taddy & Evans (2019) argue that the reason that: \[ \vec{king} - \vec{man} + \vec{woman} \approx \vec{queen} \]
Is because \((\vec{woman} - \vec{man})\) closely corresponds to ‘gender dimension’.
E.g. in the case of class they find that: \[ \vec{hockey} + \vec{affluence} - \vec{poverty} \approx \vec{lacrosse} \]
Thus, by taking the mean of all antonym pairs across a given dimension: \[ \frac{\sum^{|P|}_p \vec{p_1} - \vec{p_2}}{|P|} \]

one can construct a scale that would position words along that dimension.

Mapping Words to Class

(Kozlowski, Taddy & Evans, 2019)

Choosing Embeddings

Applying word embeddings comes with a number of analytical choices.
These choices come on top of previously discussed preprocessing decisions.
One must choose between using ‘off-the-shelf’ embeddings or training their own.
In each case these key decisions have to be made (or were made for pre-trained):
- Corpus
- Context window size
- Dimensionality of the embedding
- Embedding algorithm

Validating Embeddings

Similar to other text analysis tasks, validating embeddings is essential.
E.g. Rodriguez & Spirling (2022) propose a Turing-test style assessment:
1. Ask humans (e.g. MTurkers) to produce closest analogies to a given prompt word.
2. Extract machine-generated nearest neighbors from a chosen embedding for the same prompt word.
3. Ask a different set of humans to choose a closer match between human- and machine-derived analogies.
4. Compute probability that machine-generated analogies are picked over human-derived ones.
A similar approach (step 2–4) could be used to compare across different computer-based embeddings.

Comparing Embeddings

Applying this approach to 102-111 US Congresses (~1.4M documents):

(Rodriguez & Spirling, 2022)

Dimension Reduction

Why Dimension Reduction?

Word embeddings are typically high-dimensional (50-300 dimensions).
But visualising and interpreting relationships in such high-dimensional spaces is difficult.
Which is of particular concern for social scientists interested in interpretability.
Dimension reduction transforms high-dimensional data into lower dimensions while preserving important structure.
Goal: retain as much meaningful information as possible with fewer dimensions.

Principal Component Analysis

Principal Component Analysis (PCA) identifies the directions (principal components) along which the data varies the most.
First principal component (PC1): Direction of maximum variance in the data.
Second principal component (PC2): Direction of maximum remaining variance, orthogonal to PC1.
And so on for subsequent components.
Each principal component is a linear combination of original features.

Correlated Data

Let’s apply PCA to a simple synthetic dataset with 2 correlated variables:

set.seed(123)
n <- 100
x1 <- rnorm(n, mean = 5, sd = 2)
x2 <- 0.8 * x1 + rnorm(n, mean = 2, sd = 1)
pca_data <- data.frame(x1 = x1, x2 = x2)

cor.test(pca_data$x1, pca_data$x2, method = "pearson")


    Pearson's product-moment correlation

data:  pca_data$x1 and pca_data$x2
t = 14.479, df = 98, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7508279 0.8793420
sample estimates:
      cor 
0.8255038

library("ggplot2")

ggplot(pca_data, aes(x = x1, y = x2)) +
  geom_point(color = "navy", size = 2, alpha = 0.6) +
  labs(x = "x1",
       y = "x2") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14)) +
  coord_fixed(ratio = 1)

Notice the positive correlation: as \(x_1\) increases, so does \(x_2\).

Applying PCA

To apply PCA in R we can use built-in prcomp() function.

pca <- prcomp(pca_data, center = TRUE, scale = TRUE)
summary(pca)

Importance of components:
                          PC1     PC2
Standard deviation     1.3511 0.41773
Proportion of Variance 0.9127 0.08725
Cumulative Proportion  0.9127 1.00000

We can then extract the loading vectors (eigenvectors), that show how each of the original features contributes to each component:

pca$rotation

         PC1        PC2
x1 0.7071068  0.7071068
x2 0.7071068 -0.7071068

Interpreting PCA Results

Eigenvalues: Indicate the amount of variance explained by each component.
Proportion of variance explained: \(\frac{\lambda_i}{\sum_{j=1}^p \lambda_j}\)
Typically report cumulative variance explained by first few components.
Loadings: Weights showing how original features contribute to each PC.
Principal components are orthogonal (uncorrelated with each other).
PCA assumes linear relationships between variables.

PCA Results

Plot
Code

# Get principal component directions (eigenvectors)
pc1_direction <- pca$rotation[, 1]
pc2_direction <- pca$rotation[, 2]

# Center of the data
center_x <- mean(pca_data$x1)
center_y <- mean(pca_data$x2)

# Scale for arrows (proportional to variance explained)
scale_factor <- 2
arrow_scale1 <- scale_factor * sqrt(pca$sdev[1])
arrow_scale2 <- scale_factor * sqrt(pca$sdev[2])

# Create arrow data
arrows_df <- data.frame(
  x = c(center_x, center_x),
  y = c(center_y, center_y),
  xend = c(center_x + arrow_scale1 * pc1_direction[1],
           center_x + arrow_scale2 * pc2_direction[1]),
  yend = c(center_y + arrow_scale1 * pc1_direction[2],
           center_y + arrow_scale2 * pc2_direction[2]),
  component = c("PC1", "PC2")
)

# Plot with PCA directions
ggplot(pca_data, aes(x = x1, y = x2)) +
  geom_point(color = "navy", size = 2, alpha = 0.4) +
  geom_segment(data = arrows_df,
               aes(x = x, y = y, xend = xend, yend = yend, color = component),
               arrow = arrow(length = unit(0.3, "cm"), type = "closed"),
               linewidth = 1.5) +
  scale_color_manual(values = c("PC1" = "red", "PC2" = "darkgreen")) +
  labs(title = "Principal Components",
       subtitle = sprintf("PC1 explains %.1f%% variance, PC2 explains %.1f%%",
                         100 * pca$sdev[1]^2 / sum(pca$sdev^2),
                         100 * pca$sdev[2]^2 / sum(pca$sdev^2)),
       x = "Variable 1",
       y = "Variable 2",
       color = "Component") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14)) +
  coord_fixed(ratio = 1)

PC1 (red) captures the direction of maximum variance.
PC2 (green) captures the remaining variance, orthogonal to PC1.

PCA for Word Embeddings

For word embeddings:
- Each word is an observation (\(n\) = vocabulary size)
- Each embedding dimension is a feature (\(p\) = embedding dimensionality, e.g., 50-300)
Reducing to 2D allows us to visualize semantic relationships.
Words close together in the original high-dimensional space should remain close in 2D.
Limitation: PCA may not capture all nuances of semantic similarity.

Visualizing Word Distances

We can visualise word embeddings by reducing dimensionality to 2D using PCA.

Plot
Code

# Select some interesting political words
political_words <- c("taoiseach", "tánaiste", "minister", "deputy", 
                     "government", "opposition", "party", "election",
                     "budget", "housing", "health", "education")

# Get vectors for these words
political_vecs <- word_vectors[political_words, ]

# Apply PCA
pca_embed <- prcomp(political_vecs, center = TRUE, scale. = TRUE)
pca_coords <- as.data.frame(pca_embed$x[, 1:2])
pca_coords$word <- rownames(pca_coords)

# Plot
ggplot(pca_coords, aes(x = PC1, y = PC2, label = word)) +
  geom_point(color = "navy", size = 3) +
  geom_text(hjust = 0, vjust = 0, nudge_x = 0.05, nudge_y = 0.05, size = 4) +
  labs(title = "Political Terms in 2D Vector Space",
       x = "First Principal Component",
       y = "Second Principal Component") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14))

Tutorial: Validating embeddings
Next Week: Neural networks
Assignment 3: Due 15:59 on Wednesday, 1st April (submission on Blackboard)

Week 9: Embeddings

Overview

Vector Space Distances

Distance Metrics

Recap: Cosine Similarity

Finding Similar Words

Word Vector Arithmetic

Example: Vector Arithmetic in Dáil

Example: Vector Arithmetic in Dáil

Embeddings in Social Sciences

Embeddings in Social Sciences

Example: Social Class

Dimensions of Class

Kozlowski, Taddy & Evans (2019)

Scale Construction

Mapping Words to Class

Social Class over Time

Choosing Embeddings

Validating Embeddings

Comparing Embeddings

Dimension Reduction

Why Dimension Reduction?

Principal Component Analysis

Correlated Data

Correlated Data

Applying PCA

Interpreting PCA Results

PCA Results

PCA for Word Embeddings

Visualizing Word Distances

Next