Week 12: Large Language Models
POP77032 Quantitative Text Analysis for Social Scientists
Overview
- Embeddings architecture
- Large language models
- LLMs in social science research
Word2Vec Architectures
![]()
(Mikolov, Chen, Corrado & Dean, 2013)
- The Word2Vec model is, in fact, 2 different, but related, architectures for learning word embeddings:
- Continuous Bag of Words (CBOW): Predicts the current word based on the context (surrounding words).
- Skip-Gram: Predicts the surrounding words given the current word.
Notes on Word2Vec
- In both architectures the task itself (context prediction) is not of interest.
- It only makes sense insofar as it allows to learn vector representations of words.
- And those vector representations can then be shown to be “useful” in some tasks.
- These vector representations are just states of the hidden layer of the neural network.
- Note that the training data for the Word2Vec model is just the raw text itself.
- This approach, often referred to as self-supervision, is one of its core innovations.
- Coupled with the efficiency of negative sampling (replacing softmax with sigmoid) this allowed to train the model on large corpora.
Issues with Word2Vec
- Being based on entire words, vanilla Word2Vec has no good way to deal with out-of-vocabulary (OOV) words.
- A related issue is no mechanism for handling morphological variation, which is problematic for languages with rich morphology.
- Some of these issues can be mitigated by using sub-word tokens, as in the fastText model (Bojanowski et al., 2017).
Static Embeddings
- Fundamelly, all word embeddings models we considered so far (CBOW, Skip-Gram, GloVe, fastText) are static.
- They learn a single vector representation for each word in the vocabulary.
- The same word will have the same embedding regardless of the context in which it appears.
- However, this creates problems for words with multiple meanings:
- E.g. “river bank” and “central bank”
Contextual Embeddings
![]()
(Uszkoreit, 2017)
- To solve this problem the transformer architecture introduced self-attention layer.
- At its core, self-attention is a weighted sum of context vectors, with the bulk of computation going into determining how these weights are computed and what gets summed.
What is a Large Language Model?
- Fundamentally, it is a language model, i.e. autoregressive model \(P(w_t | w_{t-1}, w_{t-2}, \ldots, w_1)\) that is trained on a large corpus of text.
- The large part refers both to:
- size of the training data;
- number of model parameters (weights across layers).
Architectures of LLMs
![]()
(Jurafsky & Martin, 2026)
- Decoder: takes a sequence of tokens as input and produces a sequence of tokens as output (E.g. GPT, Claude, LLaMA, Mistral, etc.)
- Encoder: takes a sequence of tokens as input and produces a vector representation of the input (E.g. BERT, RoBERTa, etc.)
- Encoder-Decoder: takes a sequence of tokens as input and produces a sequence of tokens as output, but these tokens don’t have to come from the same set (e.g. machine translation, speech-to-text, etc.)
Training LLMs
- LLMs are trained in 3 stages:
- Pre-training: the model is trained on a massive corpus of texts using self-supervised learning with cross-entropy loss with backpropagation.
- Fine-tuning: the pre-trained model is further trained on a smaller, task-specific dataset that contains both instructions and correct responses.
- Alignment: the model is further fine-tuned using reinforcement learning from human feedback (RLHF) to align the model’s behavior with human values and preferences.
Pretraining Corpora for LLMs
- Given the scale requirements for training LLMs, the pretraining corpora that can match these are limited:
- Internet (e.g. Common Crawl)
- Wikipedia
- Books (e.g. Google Books)
- Code repositories (e.g. GitHub)
- Social media (e.g. Reddit)
- News archives (e.g. New York Times)
- Academic papers (e.g. arXiv, PubMed)
Fine-tuning LLMs
- General-purpose LLMs might not perform well on specific tasks/domans (e.g. legal) or in specific languages (e.g. Irish).
- In such cases, the model can be further trained on a smaller, task-specific dataset that contains both instructions and correct responses.
- Fine-tuning is sometimes combined with alignment to reduce the chances of the model producing harmful or biased outputs.
- But there are attempts to remove such content from the data in pretraining stage as well.
- The trade-off is that such measures can also reduce model’s performance on related tasks (e.g. detection of hate speech).
Evaluating LLMs
- Traditional NLP approaches:
- Perplexity: how well the model predicts a sample of text. Lower perplexity indicates better performance.
- Contemporary approaches:
- Reasoning: ability to perform human-like reasoning tasks (e.g. arithmetic, commonsense reasoning, etc.)
- Standardized tests: performance on standardized tests (e.g. SAT, GRE, etc.)
- Human evaluation: Turing-style tests with human judges.
- Social sciences:
- Annotation quality: how accurately and consistently the model can annotate text data.
LLMs in
Social Science Research
Applications of LLMs
- LLMs have already seen a range of social science research applications, including:
- Importantly, on some of the tasks LLMs have been shown to outperform human annotators.
- What does this entail for the future of social science research?
Methodological Implications
- Both LLM and human performance is bound by 100% accuracy by definition.
- But human performance may not be an upper bound on a given task.
- Particularly so for crowdworkers and tasks that have objective ground truth.
- Bisbee & Spirling (2026) argue for conducting sensitivity analysis
- E.g. by how much should the annotation performance change in order for the substantive conclusions to change?
LLMs & Surveys
- On the one hand, some researchers have highlighted the opportunities offered by synthetic survey responses generated by LLMs (e.g. Argyle et al., 2023; Horton et al., 2026) for pilot testing.
- On the other hand, such responses come with considerable caveats (Bisbee et al., 2024), such as:
- Less variation in responses compared to human respondents;
- Minor changes in prompt can lead to major changes in responses;
- Training data heavily skewed towrds certain demographics (e.g. US-based, English-speaking, etc.)
- Other scholars have shown the inherent risks posed by LLMs for online surveys (e.g. Westwood, 2025).
Next
- Tutorial: Designing studies with LLMs
- Final project: Due by 23:59 on Wednesday, 22nd April (submission on Blackboard)