Week 12: Large Language Models

POP77032 Quantitative Text Analysis for Social Scientists

Tom Paskhalis

Overview

Embeddings architecture
Large language models
LLMs in social science research

Back to
Embeddings

Word2Vec Architectures

(Mikolov, Chen, Corrado & Dean, 2013)

The Word2Vec model is, in fact, 2 different, but related, architectures for learning word embeddings:
- Continuous Bag of Words (CBOW): Predicts the current word based on the context (surrounding words).
- Skip-Gram: Predicts the surrounding words given the current word.

Notes on Word2Vec

In both architectures the task itself (context prediction) is not of interest.
It only makes sense insofar as it allows to learn vector representations of words.
And those vector representations can then be shown to be “useful” in some tasks.
These vector representations are just states of the hidden layer of the neural network.
Note that the training data for the Word2Vec model is just the raw text itself.
This approach, often referred to as self-supervision, is one of its core innovations.
Coupled with the efficiency of negative sampling (replacing softmax with sigmoid) this allowed to train the model on large corpora.

Issues with Word2Vec

Being based on entire words, vanilla Word2Vec has no good way to deal with out-of-vocabulary (OOV) words.
A related issue is no mechanism for handling morphological variation, which is problematic for languages with rich morphology.
Some of these issues can be mitigated by using sub-word tokens, as in the fastText model (Bojanowski et al., 2017).

Static Embeddings

Fundamelly, all word embeddings models we considered so far (CBOW, Skip-Gram, GloVe, fastText) are static.
They learn a single vector representation for each word in the vocabulary.
The same word will have the same embedding regardless of the context in which it appears.
However, this creates problems for words with multiple meanings:
- E.g. “river bank” and “central bank”

Contextual Embeddings

(Uszkoreit, 2017)

To solve this problem the transformer architecture introduced self-attention layer.
At its core, self-attention is a weighted sum of context vectors, with the bulk of computation going into determining how these weights are computed and what gets summed.

Large Language Models

What is a Large Language Model?

Fundamentally, it is a language model, i.e. autoregressive model \(P(w_t | w_{t-1}, w_{t-2}, \ldots, w_1)\) that is trained on a large corpus of text.
The large part refers both to:
- size of the training data;
- number of model parameters (weights across layers).

Architectures of LLMs

(Jurafsky & Martin, 2026)

Decoder: takes a sequence of tokens as input and produces a sequence of tokens as output (E.g. GPT, Claude, LLaMA, Mistral, etc.)
Encoder: takes a sequence of tokens as input and produces a vector representation of the input (E.g. BERT, RoBERTa, etc.)
Encoder-Decoder: takes a sequence of tokens as input and produces a sequence of tokens as output, but these tokens don’t have to come from the same set (e.g. machine translation, speech-to-text, etc.)

Training LLMs

LLMs are trained in 3 stages:
- Pre-training: the model is trained on a massive corpus of texts using self-supervised learning with cross-entropy loss with backpropagation.
- Fine-tuning: the pre-trained model is further trained on a smaller, task-specific dataset that contains both instructions and correct responses.
- Alignment: the model is further fine-tuned using reinforcement learning from human feedback (RLHF) to align the model’s behavior with human values and preferences.

Pretraining Corpora for LLMs

Given the scale requirements for training LLMs, the pretraining corpora that can match these are limited:
- Internet (e.g. Common Crawl)
- Wikipedia
- Books (e.g. Google Books)
- Code repositories (e.g. GitHub)
- Social media (e.g. Reddit)
- News archives (e.g. New York Times)
- Academic papers (e.g. arXiv, PubMed)

Fine-tuning LLMs

General-purpose LLMs might not perform well on specific tasks/domans (e.g. legal) or in specific languages (e.g. Irish).
In such cases, the model can be further trained on a smaller, task-specific dataset that contains both instructions and correct responses.
Fine-tuning is sometimes combined with alignment to reduce the chances of the model producing harmful or biased outputs.
But there are attempts to remove such content from the data in pretraining stage as well.
The trade-off is that such measures can also reduce model’s performance on related tasks (e.g. detection of hate speech).

Evaluating LLMs

Traditional NLP approaches:
- Perplexity: how well the model predicts a sample of text. Lower perplexity indicates better performance.
Contemporary approaches:
- Reasoning: ability to perform human-like reasoning tasks (e.g. arithmetic, commonsense reasoning, etc.)
- Standardized tests: performance on standardized tests (e.g. SAT, GRE, etc.)
- Human evaluation: Turing-style tests with human judges.
Social sciences:
- Annotation quality: how accurately and consistently the model can annotate text data.

Applications of LLMs

LLMs have already seen a range of social science research applications, including:
- Text annotation (Gilardi et al., 2023)
- Estimation of ideological positions (Wu et al., 2023)
- Synthetic survey responses (Argyle et al., 2023; Horton et al., 2026)
Importantly, on some of the tasks LLMs have been shown to outperform human annotators.
What does this entail for the future of social science research?

Types of Annotation Tasks

(Bisbee & Spirling, 2026)

LLMs Performance over Time

(Bisbee & Spirling, 2026)

Methodological Implications

Both LLM and human performance is bound by 100% accuracy by definition.
But human performance may not be an upper bound on a given task.
Particularly so for crowdworkers and tasks that have objective ground truth.
Bisbee & Spirling (2026) argue for conducting sensitivity analysis
E.g. by how much should the annotation performance change in order for the substantive conclusions to change?

LLMs & Surveys

(Westwood, 2025)

LLMs & Surveys

On the one hand, some researchers have highlighted the opportunities offered by synthetic survey responses generated by LLMs (e.g. Argyle et al., 2023; Horton et al., 2026) for pilot testing.
On the other hand, such responses come with considerable caveats (Bisbee et al., 2024), such as:
- Less variation in responses compared to human respondents;
- Minor changes in prompt can lead to major changes in responses;
- Training data heavily skewed towrds certain demographics (e.g. US-based, English-speaking, etc.)
Other scholars have shown the inherent risks posed by LLMs for online surveys (e.g. Westwood, 2025).

Research Design is Fundamental

(Huang et al., 2026)

Tutorial: Designing studies with LLMs
Final project: Due by 23:59 on Wednesday, 22nd April (submission on Blackboard)

Week 12: Large Language Models

Overview

Back toEmbeddings

Word2Vec Architectures

Notes on Word2Vec

Issues with Word2Vec

Static Embeddings

Contextual Embeddings

Large Language Models

What is a Large Language Model?

Architectures of LLMs

Training LLMs

Pretraining Corpora for LLMs

Fine-tuning LLMs

Evaluating LLMs

LLMs inSocial Science Research

Applications of LLMs

Types of Annotation Tasks

LLMs Performance over Time

Methodological Implications

LLMs & Surveys

LLMs & Surveys

Research Design is Fundamental

Next

Back to
Embeddings

LLMs in
Social Science Research