Week 12: Large Language Models

POP77032 Quantitative Text Analysis for Social Scientists

Tom Paskhalis

Overview

  • Embeddings architecture
  • Large language models
  • LLMs in social science research

Back to
Embeddings

Word2Vec Architectures

(Mikolov, Chen, Corrado & Dean, 2013)

  • The Word2Vec model is, in fact, 2 different, but related, architectures for learning word embeddings:
    • Continuous Bag of Words (CBOW): Predicts the current word based on the context (surrounding words).
    • Skip-Gram: Predicts the surrounding words given the current word.

Notes on Word2Vec

  • In both architectures the task itself (context prediction) is not of interest.
  • It only makes sense insofar as it allows to learn vector representations of words.
  • And those vector representations can then be shown to be “useful” in some tasks.
  • These vector representations are just states of the hidden layer of the neural network.
  • Note that the training data for the Word2Vec model is just the raw text itself.
  • This approach, often referred to as self-supervision, is one of its core innovations.
  • Coupled with the efficiency of negative sampling (replacing softmax with sigmoid) this allowed to train the model on large corpora.

Issues with Word2Vec

  • Being based on entire words, vanilla Word2Vec has no good way to deal with out-of-vocabulary (OOV) words.
  • A related issue is no mechanism for handling morphological variation, which is problematic for languages with rich morphology.
  • Some of these issues can be mitigated by using sub-word tokens, as in the fastText model (Bojanowski et al., 2017).

Static Embeddings

  • Fundamelly, all word embeddings models we considered so far (CBOW, Skip-Gram, GloVe, fastText) are static.
  • They learn a single vector representation for each word in the vocabulary.
  • The same word will have the same embedding regardless of the context in which it appears.
  • However, this creates problems for words with multiple meanings:
    • E.g. “river bank” and “central bank

Contextual Embeddings

(Uszkoreit, 2017)

  • To solve this problem the transformer architecture introduced self-attention layer.
  • At its core, self-attention is a weighted sum of context vectors, with the bulk of computation going into determining how these weights are computed and what gets summed.

Large Language Models

What is a Large Language Model?

  • Fundamentally, it is a language model, i.e. autoregressive model \(P(w_t | w_{t-1}, w_{t-2}, \ldots, w_1)\) that is trained on a large corpus of text.
  • The large part refers both to:
    • size of the training data;
    • number of model parameters (weights across layers).

Architectures of LLMs

(Jurafsky & Martin, 2026)

  • Decoder: takes a sequence of tokens as input and produces a sequence of tokens as output (E.g. GPT, Claude, LLaMA, Mistral, etc.)
  • Encoder: takes a sequence of tokens as input and produces a vector representation of the input (E.g. BERT, RoBERTa, etc.)
  • Encoder-Decoder: takes a sequence of tokens as input and produces a sequence of tokens as output, but these tokens don’t have to come from the same set (e.g. machine translation, speech-to-text, etc.)

Training LLMs

  • LLMs are trained in 3 stages:
    • Pre-training: the model is trained on a massive corpus of texts using self-supervised learning with cross-entropy loss with backpropagation.
    • Fine-tuning: the pre-trained model is further trained on a smaller, task-specific dataset that contains both instructions and correct responses.
    • Alignment: the model is further fine-tuned using reinforcement learning from human feedback (RLHF) to align the model’s behavior with human values and preferences.

Pretraining Corpora for LLMs

  • Given the scale requirements for training LLMs, the pretraining corpora that can match these are limited:
    • Internet (e.g. Common Crawl)
    • Wikipedia
    • Books (e.g. Google Books)
    • Code repositories (e.g. GitHub)
    • Social media (e.g. Reddit)
    • News archives (e.g. New York Times)
    • Academic papers (e.g. arXiv, PubMed)

Fine-tuning LLMs

  • General-purpose LLMs might not perform well on specific tasks/domans (e.g. legal) or in specific languages (e.g. Irish).
  • In such cases, the model can be further trained on a smaller, task-specific dataset that contains both instructions and correct responses.
  • Fine-tuning is sometimes combined with alignment to reduce the chances of the model producing harmful or biased outputs.
  • But there are attempts to remove such content from the data in pretraining stage as well.
  • The trade-off is that such measures can also reduce model’s performance on related tasks (e.g. detection of hate speech).

Evaluating LLMs

  • Traditional NLP approaches:
    • Perplexity: how well the model predicts a sample of text. Lower perplexity indicates better performance.
  • Contemporary approaches:
    • Reasoning: ability to perform human-like reasoning tasks (e.g. arithmetic, commonsense reasoning, etc.)
    • Standardized tests: performance on standardized tests (e.g. SAT, GRE, etc.)
    • Human evaluation: Turing-style tests with human judges.
  • Social sciences:
    • Annotation quality: how accurately and consistently the model can annotate text data.

LLMs in
Social Science Research

Applications of LLMs

  • LLMs have already seen a range of social science research applications, including:
  • Importantly, on some of the tasks LLMs have been shown to outperform human annotators.
  • What does this entail for the future of social science research?

Types of Annotation Tasks

(Bisbee & Spirling, 2026)

LLMs Performance over Time

(Bisbee & Spirling, 2026)

Methodological Implications

  • Both LLM and human performance is bound by 100% accuracy by definition.
  • But human performance may not be an upper bound on a given task.
  • Particularly so for crowdworkers and tasks that have objective ground truth.
  • Bisbee & Spirling (2026) argue for conducting sensitivity analysis
  • E.g. by how much should the annotation performance change in order for the substantive conclusions to change?

LLMs & Surveys

(Westwood, 2025)

LLMs & Surveys

  • On the one hand, some researchers have highlighted the opportunities offered by synthetic survey responses generated by LLMs (e.g. Argyle et al., 2023; Horton et al., 2026) for pilot testing.
  • On the other hand, such responses come with considerable caveats (Bisbee et al., 2024), such as:
    • Less variation in responses compared to human respondents;
    • Minor changes in prompt can lead to major changes in responses;
    • Training data heavily skewed towrds certain demographics (e.g. US-based, English-speaking, etc.)
  • Other scholars have shown the inherent risks posed by LLMs for online surveys (e.g. Westwood, 2025).

Research Design is Fundamental

(Huang et al., 2026)

Next

  • Tutorial: Designing studies with LLMs
  • Final project: Due by 23:59 on Wednesday, 22nd April (submission on Blackboard)