Week 9: Embeddings

POP77032 Quantitative Text Analysis for Social Scientists

Tom Paskhalis

Overview

  • Vector space arithmetics
  • Embeddings in social sciences
  • Choosing embeddings
  • Dimension reduction

Vector Space Distances

Distance Metrics

  • Once we have word embeddings, we can measure similarity/distance between words.
  • Several distance metrics are commonly used:
    • Cosine similarity: measures the cosine of the angle between two vectors
      (ranges from -1 to 1).
    • Jaccard similarity: measures the overlap between two sets of features
      (ranges from 0 to 1).
    • Euclidean distance: straight-line distance between two points in vector space
      (ranges from 0 to infinity).
  • For word embeddings, cosine similarity is most commonly used.
  • As it focuses on the direction of vectors rather than their magnitude.

Recap: Cosine Similarity

  • Recall ho we calculated between 2 documents \(A\) and \(B\) using their vector representations \(\mathbf{W}_A\) and \(\mathbf{W}_B\):

    \[ \cos(\mathbf{W}_A, \mathbf{W}_B) = \frac{\mathbf{W}_A \cdot \mathbf{W}_B}{\|\mathbf{W}_A\| \times \|\mathbf{W}_B\|} \] where \(\|\mathbf{W}\| = \sqrt{\sum_{j=1}^J W_j^2}\) is the magnitude of the vector.

  • The same idea can be applied to words represented as vectors.

  • Values close to 0 indicate orthogonal (unrelated) vectors.

  • For word embeddings, higher values indicate more similar meanings.

Finding Similar Words

  • Let’s find the most similar words to taoiseach in our Dail 33 corpus.
find_similar <- function(target_word, word_vectors, top_n = 10) {
  # Calculate similarities between target word and all words
  similarities <- text2vec::sim2(
    x = matrix(word_vectors[target_word, ], nrow = 1),
    y = word_vectors,
    method = "cosine",
    norm = "l2"
  )
  
  # Convert to named vector and sort
  similarities <- as.vector(similarities)
  names(similarities) <- rownames(word_vectors)
  similarities <- sort(similarities, decreasing = TRUE)
  
  head(similarities, top_n + 1)  # +1 to exclude the word itself
}
find_similar("taoiseach", word_vectors, top_n = 10)
 taoiseach   tánaiste   minister government minister's  yesterday     asking 
 1.0000000  0.9511834  0.9037884  0.8171245  0.8059394  0.8049395  0.8016647 
     asked     stated      today       said 
 0.7956373  0.7837272  0.7813493  0.7764243 

Word Vector Arithmetic

  • One remarkable property of word embeddings is that vector differences encode semantic relationships.

  • This allows arithmetic operations on word vectors to discover analogies.

  • Classic example from Mikolov et al. (2013): \[ \vec{king} - \vec{man} + \vec{woman} \approx \vec{queen} \]

  • The idea is to subtract the “male” concept, add the “female” concept.

Example: Vector Arithmetic in Dáil

  • Let’s explore political analogies in the context of the Dail corpus.

\[ \vec{government} - \vec{taoiseach} + \vec{tánaiste} \approx \text{?} \]

word_analogy <- function(word_vectors, positive, negative, top_n = 5) {
  # positive: words to add, negative: words to subtract
  result_vec <- matrix(0, nrow = 1, ncol = ncol(word_vectors))
  
  # Add positive words
  for (word in positive) {
    result_vec <- result_vec + matrix(word_vectors[word, ], nrow = 1)
  }
  
  # Subtract negative words
  for (word in negative) {
    result_vec <- result_vec - matrix(word_vectors[word, ], nrow = 1)
  }
  
  # Find most similar words to the result
  similarities <- text2vec::sim2(
    x = result_vec,
    y = word_vectors,
    method = "cosine",
    norm = "l2"
  )
  
  similarities <- as.vector(similarities)
  names(similarities) <- rownames(word_vectors)
  
  # Remove words used in the operation
  similarities <- similarities[!names(similarities) %in% c(positive, negative)]
  
  similarities <- sort(similarities, decreasing = TRUE)
  head(similarities, top_n)
}

Example: Vector Arithmetic in Dáil

\[ \vec{government} - \vec{taoiseach} + \vec{tánaiste} \approx \text{?} \]

word_analogy(word_vectors, 
             positive = c("government", "tánaiste"), 
             negative = c("taoiseach"),
             top_n = 5)
government's     minister        clear      believe    intention 
   0.8116223    0.8108598    0.7986580    0.7716991    0.7657963 

\[ \vec{minister} - \vec{government} + \vec{party} \approx \text{?} \]

word_analogy(word_vectors,
             positive = c("minister", "party"),
             negative = c("government"),
             top_n = 5)
   deputy colleague  deputies     asked     speak 
0.8005498 0.8003592 0.7961536 0.7500455 0.7410448 

Embeddings in Social Sciences

Embeddings in Social Sciences

  • NLP researchers are largely focussed on how useful embeddings are for downstream tasks (e.g. classification, part-of-speech tagging, etc.)
  • For social scientists, the interpretability of embeddings is often more important than their predictive performance.
  • E.g. if the distance between words “immigrants” and “hardworking” is smaller for liberals than for conservatives, this may be an interesting finding in itself.
  • However, this requires careful validation of the embedding space to ensure that it captures meaningful semantic relationships.

Example: Social Class

(Boris Ioganson, State Tretyakov Gallery)

Dimensions of Class

  • Social class - hierarchical distinction between groups of people in social standing.
  • Some aspects that social class can be based on:
    • Wealth (e.g. Georg Simmel)
    • Relationship to the means of production (e.g. Karl Marx)
    • Education
    • Gender
    • Race
    • Consumption patterns

Kozlowski, Taddy & Evans (2019)

  • Use Google Ngram corpus of 5-grams divided by decade between 1900 and 1999,
  • Derive cultural dimensions of word embeddings using word2vec skipgram architecture.
  • Validate on similar data for 2000-2012 along 3 sociological axes:
    • affluence/class
    • gender
    • race
  • Validation relies on MTurkers rating words along the 0-100 scale on these dimensions.
  • For historical validation they construct 20 cultural dimensions derived from An Atlas of Semantic Profiles for 360 Words by Jenkins, Russell & Suci (1958).

Scale Construction

  • Kozlowski, Taddy & Evans (2019) argue that the reason that: \[ \vec{king} - \vec{man} + \vec{woman} \approx \vec{queen} \]

  • Is because \((\vec{woman} - \vec{man})\) closely corresponds to ‘gender dimension’.

  • E.g. in the case of class they find that: \[ \vec{hockey} + \vec{affluence} - \vec{poverty} \approx \vec{lacrosse} \]

  • Thus, by taking the mean of all antonym pairs across a given dimension: \[ \frac{\sum^{|P|}_p \vec{p_1} - \vec{p_2}}{|P|} \]

one can construct a scale that would position words along that dimension.

Mapping Words to Class

(Kozlowski, Taddy & Evans, 2019)

Social Class over Time

(Kozlowski, Taddy & Evans, 2019)

Choosing Embeddings

  • Applying word embeddings comes with a number of analytical choices.
  • These choices come on top of previously discussed preprocessing decisions.
  • One must choose between using ‘off-the-shelf’ embeddings or training their own.
  • In each case these key decisions have to be made (or were made for pre-trained):
    • Corpus
    • Context window size
    • Dimensionality of the embedding
    • Embedding algorithm

Validating Embeddings

  • Similar to other text analysis tasks, validating embeddings is essential.
  • E.g. Rodriguez & Spirling (2022) propose a Turing-test style assessment:
    1. Ask humans (e.g. MTurkers) to produce closest analogies to a given prompt word.
    2. Extract machine-generated nearest neighbors from a chosen embedding for the same prompt word.
    3. Ask a different set of humans to choose a closer match between human- and machine-derived analogies.
    4. Compute probability that machine-generated analogies are picked over human-derived ones.
  • A similar approach (step 2–4) could be used to compare across different computer-based embeddings.

Comparing Embeddings

  • Applying this approach to 102-111 US Congresses (~1.4M documents):

(Rodriguez & Spirling, 2022)

Dimension Reduction

Why Dimension Reduction?

  • Word embeddings are typically high-dimensional (50-300 dimensions).
  • But visualising and interpreting relationships in such high-dimensional spaces is difficult.
  • Which is of particular concern for social scientists interested in interpretability.
  • Dimension reduction transforms high-dimensional data into lower dimensions while preserving important structure.
  • Goal: retain as much meaningful information as possible with fewer dimensions.

Principal Component Analysis

  • Principal Component Analysis (PCA) identifies the directions (principal components) along which the data varies the most.
  • First principal component (PC1): Direction of maximum variance in the data.
  • Second principal component (PC2): Direction of maximum remaining variance, orthogonal to PC1.
  • And so on for subsequent components.
  • Each principal component is a linear combination of original features.

Correlated Data

  • Let’s apply PCA to a simple synthetic dataset with 2 correlated variables:
set.seed(123)
n <- 100
x1 <- rnorm(n, mean = 5, sd = 2)
x2 <- 0.8 * x1 + rnorm(n, mean = 2, sd = 1)
pca_data <- data.frame(x1 = x1, x2 = x2)
cor.test(pca_data$x1, pca_data$x2, method = "pearson")

    Pearson's product-moment correlation

data:  pca_data$x1 and pca_data$x2
t = 14.479, df = 98, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7508279 0.8793420
sample estimates:
      cor 
0.8255038 

Correlated Data

library("ggplot2")

ggplot(pca_data, aes(x = x1, y = x2)) +
  geom_point(color = "navy", size = 2, alpha = 0.6) +
  labs(x = "x1",
       y = "x2") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14)) +
  coord_fixed(ratio = 1)
  • Notice the positive correlation: as \(x_1\) increases, so does \(x_2\).

Applying PCA

  • To apply PCA in R we can use built-in prcomp() function.
pca <- prcomp(pca_data, center = TRUE, scale = TRUE)
summary(pca)
Importance of components:
                          PC1     PC2
Standard deviation     1.3511 0.41773
Proportion of Variance 0.9127 0.08725
Cumulative Proportion  0.9127 1.00000
  • We can then extract the loading vectors (eigenvectors), that show how each of the original features contributes to each component:
pca$rotation
         PC1        PC2
x1 0.7071068  0.7071068
x2 0.7071068 -0.7071068

Interpreting PCA Results

  • Eigenvalues: Indicate the amount of variance explained by each component.
  • Proportion of variance explained: \(\frac{\lambda_i}{\sum_{j=1}^p \lambda_j}\)
  • Typically report cumulative variance explained by first few components.
  • Loadings: Weights showing how original features contribute to each PC.
  • Principal components are orthogonal (uncorrelated with each other).
  • PCA assumes linear relationships between variables.

PCA Results

# Get principal component directions (eigenvectors)
pc1_direction <- pca$rotation[, 1]
pc2_direction <- pca$rotation[, 2]

# Center of the data
center_x <- mean(pca_data$x1)
center_y <- mean(pca_data$x2)

# Scale for arrows (proportional to variance explained)
scale_factor <- 2
arrow_scale1 <- scale_factor * sqrt(pca$sdev[1])
arrow_scale2 <- scale_factor * sqrt(pca$sdev[2])

# Create arrow data
arrows_df <- data.frame(
  x = c(center_x, center_x),
  y = c(center_y, center_y),
  xend = c(center_x + arrow_scale1 * pc1_direction[1],
           center_x + arrow_scale2 * pc2_direction[1]),
  yend = c(center_y + arrow_scale1 * pc1_direction[2],
           center_y + arrow_scale2 * pc2_direction[2]),
  component = c("PC1", "PC2")
)

# Plot with PCA directions
ggplot(pca_data, aes(x = x1, y = x2)) +
  geom_point(color = "navy", size = 2, alpha = 0.4) +
  geom_segment(data = arrows_df,
               aes(x = x, y = y, xend = xend, yend = yend, color = component),
               arrow = arrow(length = unit(0.3, "cm"), type = "closed"),
               linewidth = 1.5) +
  scale_color_manual(values = c("PC1" = "red", "PC2" = "darkgreen")) +
  labs(title = "Principal Components",
       subtitle = sprintf("PC1 explains %.1f%% variance, PC2 explains %.1f%%",
                         100 * pca$sdev[1]^2 / sum(pca$sdev^2),
                         100 * pca$sdev[2]^2 / sum(pca$sdev^2)),
       x = "Variable 1",
       y = "Variable 2",
       color = "Component") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14)) +
  coord_fixed(ratio = 1)
  • PC1 (red) captures the direction of maximum variance.
  • PC2 (green) captures the remaining variance, orthogonal to PC1.

PCA for Word Embeddings

  • For word embeddings:
    • Each word is an observation (\(n\) = vocabulary size)
    • Each embedding dimension is a feature (\(p\) = embedding dimensionality, e.g., 50-300)
  • Reducing to 2D allows us to visualize semantic relationships.
  • Words close together in the original high-dimensional space should remain close in 2D.
  • Limitation: PCA may not capture all nuances of semantic similarity.

Visualizing Word Distances

  • We can visualise word embeddings by reducing dimensionality to 2D using PCA.

# Select some interesting political words
political_words <- c("taoiseach", "tánaiste", "minister", "deputy", 
                     "government", "opposition", "party", "election",
                     "budget", "housing", "health", "education")

# Get vectors for these words
political_vecs <- word_vectors[political_words, ]

# Apply PCA
pca_embed <- prcomp(political_vecs, center = TRUE, scale. = TRUE)
pca_coords <- as.data.frame(pca_embed$x[, 1:2])
pca_coords$word <- rownames(pca_coords)

# Plot
ggplot(pca_coords, aes(x = PC1, y = PC2, label = word)) +
  geom_point(color = "navy", size = 3) +
  geom_text(hjust = 0, vjust = 0, nudge_x = 0.05, nudge_y = 0.05, size = 4) +
  labs(title = "Political Terms in 2D Vector Space",
       x = "First Principal Component",
       y = "Second Principal Component") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14))

Next

  • Tutorial: Validating embeddings
  • Next Week: Neural networks
  • Assignment 3: Due 15:59 on Wednesday, 1st April (submission on Blackboard)