Week 1: Introduction

POP77032 Quantitative Text Analysis for Social Scientists

Tom Paskhalis

Overview

  • Module objectives
  • Prerequisites and software
  • Materials and books
  • Module meetings
  • Assessment and collaboration
  • Weekly schedule

Module Objectives

  • Introduce the fundamentals of working with text as data;
  • Extract and prepare textual data for analysis;
  • Apply key computational techniques for textual data;
  • Practice these concepts using social science examples.

Module Materials

Books

Also:

  • Christopher Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. The MIT Press
  • Jacob Eisenstein. 2019. Introduction to Natural Language Processing. Cambridge, MA: The MIT Press.
  • Klaus Krippendorff. 2019. Content Analysis: An Introduction to Its Methodology. 4th ed. Thousand Oaks, CA: SAGE Publications

Additional Online Materials

Prerequisites and Software

  • Intermediate module - familiarity with basic statistical concepts and programming in R/Python is assumed.
  • Laptop with Windows/Mac/Linux OS (no Chrome books)
  • Required software:
    • Jupyter - web-based interactive computational environment
    • Python (version 3+) - versatile programming language
    • R (version 4+) - statistical programming language
  • Additional software:

Module Meetings

Assessment

  • 3 programming exercises (40%)

  • Research paper (60%)

    • Approximately 10 pages and 3,000-4,000 words (references excluded)
    • Due by 23:59 Wednesday, 22 April 2026

Plagiarism Policy

  • Plagiarising computer code is as serious as plagiarising text (see Google LLC v. Oracle America, Inc.)
  • All submitted programming assignments and final project should be done individually;
  • You may discuss general approaches to solutions with your peers;
  • But do not share or view each others code;
  • You can use online resources but give credit in the comments.

Generative AI Policy

  • The use of generative AI is permitted.
  • However:
    • No part of the module content can be used in a prompt;
    • It needs to be explicitly acknowledged in the submission;
    • You need to state the version of the model used.
  • Hardware permitting, I recommend using local offline models installed on your machine.
  • E.g. check LM Studio as a user-friendly interface to different models.

Module Outline

Week Date Topic Released Due
1 21 January Introduction
2 28 January Words and Tokens Assignment 1
3 4 February Quantifying Texts
4 11 February Dictionaries and Sentiment Assignment 1
5 18 February Supervised Modelling Assignment 2
6 25 February Unsupervised Modelling
7 4 March - Assignment 2
8 11 March Beyond Bag-of-Words
9 18 March Embeddings Assignment 3
10 25 March Neural Networks
11 1 April Transformers Assignment 3
12 8 April Large Language Models

Next

  • Introduction to QTA